Migrating from SVN to Git while Splitting Repository

As the final stage of a recent project we were transferring our large Subversion repository into multiple Git repositories. We wanted to retain as much of the Subversion history as possible. Interestingly this turned out to be considerably more complex than expected! See the end for links to pages that various other people have written about this process.

We initially considered splitting up the repository at the same time as doing the migration. However, it turned out to be very difficult to preserve the full history this way. Doing the process in two steps – migration then splitting – was a better approach. Another reason for doing two steps was that not all of the original repository was going to be split off. A large amount of content that was legacy would be left behind as an archive.

So the plan was that we would do an initial migration to Git. We would retain this as a standalone read-only archive of everything (about 1.6GB). Active ongoing content would then be split (copied) into separate Git repositories (which would only be about 100MB in total).

Subversion to Git Migration

The normal migration approach for Subversion to Git seems to be to use git svn . This tool however appears to be more of a live synchronisation tool so that people can use both Git and Subversion at the same time. For this reason, and others, we did not find it ideal for a one off migration. We eventually discovered an excellent tool called reposurgeon which is designed precisely for one-off migration. This tool is hugely configurable, fast, and supports a lot of the spurious baggage that comes with an old Subversion repository (which in our case had started out life as an even older CVS repository). Another advantage of reposurgeon is that it automatically creates a Makefile to define migration options and manage each migration step. This encapsulates the whole process and is important as we found that a number of migration attempts were necessary before it looked like everything had worked correctly.

The first problem we hit was that when our Subversion repository was created it had not been structured into trunk and branches. It was only at a later point (when a formal branch was actually needed) that it had been reformatted into this standard structure. The reposurgeon tool assumes the standard structure and as a result history prior to the restructuring and creation of trunk was effectively lost (buried in pseudo branches that reposurgeon creates to represent the previous root structure, complicating history extraction when subsequently splitting up the repository). Since we had only ever had one formal branch (albeit a big one) we opted to migrate using a flat structure (by using --nobranch as a read option in the reposurgeon Makefile). This meant that the repository was treated as an ordinary directory tree from the root – trunk and branches had no special meaning.

The second problem we had was that, by default, the Subversion revision number was not preserved. This revision number was important for us as they had been frequently used: in comments in our issue tracking system (as a reference to the commit where a fix had been made); in formal release notes; within commits (for example, simple errors like “this change is also associated with the fix in revision r12345”). To resolve this we needed to add the --legacy write option in the reposurgeon Makefile so that every commit message was automatically annotated with the original Subversion revision number.

After this the migration went pretty smoothly and we had a clean Git repository with the full Subversion repository history.

There were a couple of minor leftover problems with our new Git repository. One of the directories in our Subversion repository had contained content from a third party which was actually just a Git repository. At the end of the migration git status reported lots of error: Invalid path 'trunk/tools/.../.git/...' messages. Since we didn’t care about retaining the actual embedded .git directory content we just did git reset --hard to remove these errors and leave a clean repository.

The other issue we had was that in the Subversion repository some files had altered over time from being normal files with content to being symbolic links and then back to being normal files with content again. In the migrated Git repository some of these files (but not all) were broken as the file was still a symbolic link pointing to a file which had a name which was the concatenation of what should have been the file content! We did not pick up on this error until later when a clone of the split off repository containing these files failed to checkout because the resulting file names were too long. We remain unclear what could have caused this. Presumably something odd at the time the particular Subversion commit was made that reverted the symbolic link back to a file and something reposurgeon wasn’t consequently able to deal with automatically. We fixed this in the end by careful rewrite of the offending commit in the history of the split off Git repository.

Splitting the Git Repository

There seem to be two approaches to this, use git subtree or git filter-branch with the --subdirectory-filter option. Both scan the entire commit history and keep just those commits containing the specified path (removing the rest). Both have limitations if you want the full history. With subtree it is only possible to scan one branch, so history recorded in any other branch will be lost. This was a significant issue until we chose to migrate the Subversion repository as a flat directory structure. However, a principal failing of both approaches is that the specified path must have remained the same throughout the repository history. So, if at any point in the history, the specified path has changed name or parentage, or content in that path has arrived into it from elsewhere in the repository then all that history will be lost. Since a significant amount of re-structuring at both file and directory level had taken place in our repository over time this limitation was a significant issue to preserving full history.

One option would be to just accept the loss of history. This was not entirely unreasonable as we were retaining the original complete migration into Git as a permanent archive. However we are not able to make that archive public because of some of the legacy content. This would have made a lot of relevant history for content in active development unavailable to anyone else. Nor would it have been that convenient, even as a local developer, to not have the full history of a file self-contained within the split off repository – so you couldn’t just do git log --follow FILE  for example. Instead, having to locally clone 1.6GB just to access a tiny fraction of that to get one files full history.

In the end we managed to find a way to do what we want (with one caveat) using a combination of techniques and because we also had the advantage of access to the full file rename history in our old Subversion repository.

Our approach was to first identify all the current files in a directory that was being split off. For each of these files the Subversion repository was then accessed to extract a full file name and path history. Then every commit in the Git repository was filtered so that it contained only these files (when they existed in that commit). The result became the split repository, where every file contained its full history irrespective of whether it changed name or path in its history. The precise steps we took for each directory being split off are outlined below.

First we created a file containing a list of all the files (including all previous names and locations) in the directory to be split off, trunk/root1 in this example case. We built this from a checked out working copy of the Subversion repository.

cd /path/to/repo/svn/working/copy
REPOURL=svn info | awk '/^URL:/{print$2}' (for f in find trunk/root1 -type f; do svn log$f | grep -o -E '^r[0-9]+' | xargs -L 1 -irev svn info -rev $f | grep '^URL:'; done)| sed -e "s?^URL:$REPOURL/??1" | uniq > /tmp/root1.files

Next we created a clone of our migrated Subversion repository so everything we did was isolated and the original could be used again to split off another repository.

cd /tmp
cd clone1


Then we ran the filter. We used --index-filter as it is a lot faster than --tree-filter and has an identical result in this particular case. For each commit this filter first removes all the files and then selectively re-instates only those files which existed in that commit and which are also listed as being part of the split directory (eventually, even if not at this particular point).  The filter also removes any resulting empty commits and re-writes tags.

git filter-branch --prune-empty --index-filter 'git rm --cached -qr -- . && cat /tmp/root1.files | xargs -n 1 git reset -q $GIT_COMMIT --' --tag-name-filter cat -- --all  Now we had a modified repository just containing the current files in the root1 directory (actually trunk/root1) but with their full history. Next we tidied up by cloning (a simple way of leaving behind all the objects that have now been orphaned by the filter) and did some shuffling so that the content of trunk/root1 became the content at the top level of the new repository. cd /tmp mkdir root1 cd root1 git init git remote add origin ../clone1 git pull origin master git remote rm origin git mv trunk/root1 . git clean -d -x -n git clean -fd git mv root1/* . git clean -d -x -n git clean -fd git status -s git commit git status git log  After this we simply cloned again into a bare repository and pushed this onto our Git server. Then we moved onto the next directory to be split off. There is one major caveat with this approach. Only current files (those in the most recent commit) and their history is included. Any files that used to exist in the split directory path but have at some point been deleted will not be included. This may or may not matter. For us it did not as one of the reasons for splitting into separate repositories was to lose the 1.5GB of legacy files that were no longer of any relevance, so we did not want them retained. Also we still had the fallback of referring to the original migrated archive with all content as a last resort. Another problem with this approach was performance. Our repository had over 12,000 commits. The filter has to process each commit one by one. This was alright when the number of files (and their path and name history) for a split directory was small, but when it was in the thousands or more (as some were in our case) the time to process each commit became significant (from fractions of a second to tens of seconds). There is probably a better way to do this. Perhaps using a combination of this approach to get the unique file path changes and then using the filter to do specific commit history grafting at each change point. References http://www.catb.org/~esr/reposurgeon/reposurgeon.html https://www.atlassian.com/blog/git/tear-apart-repository-git-way https://stackoverflow.com/questions/14759345/how-to-split-a-git-repository-and-follow-directory-renames https://medium.com/making-soft-things/where-is-git-split-98dfbaec3d9a https://superuser.com/questions/241660/how-can-i-trace-the-history-of-a-files-name-in-svn https://stackoverflow.com/questions/19954485/extract-multiple-directories-using-git-filter-branch https://stackoverflow.com/questions/34568849/git-split-repository-directory-preserving-move-renames-history https://kevin.deldycke.com/2011/02/moving-git-subtree-repository/ http://git.661346.n2.nabble.com/Remove-all-files-except-a-few-files-using-filter-branch-td7567155.html using mock to test rpm packages Our PkgForge service is a locally developed build farm. We use this for automatically building .src.rpm files for the multiple platforms and architectures we often have to support concurrently . The service uses Mock underneath. It works very well for known working packages and a “send and forget” approach. It is not so good however for debugging build issues (although that was never its intended purpose). Really hit this recently when trying to build a new locally developed package for wider distribution. Being used to only having a few issues to resolve normally (commonly just missing pre-requisite packages), started off using the build farm service. But the numerous iterations needed and the slow rebuild cycles were getting tedious. Eventually an issue that really needed direct access to the build root to debug was the final nail in the coffin. Previously been put off using Mock manually by assuming that it would be complicated. Turns out it is an absolute breeze and definitely the way to pre-test, or even final build if you want speed and controlled (re)submission, new packages (particularly when you only need to worry about distribution for one architecture and platform). So here is a brief howto (as root).  yum install mock  Mock comes with the necessary configuration file for our main current platform and architecture already, so in principle you need do nothing more than:  mock -r epel-7-x86_64.cfg --rebuild PACKAGE.src.rpm  You can then follow the output live as it builds. If you are running this on your desktop you will also find it runs refreshingly fast. The build root and results are also kept around for subsequent analysis in:  /var/lib/mock/epel-7-x86_64/root /var/lib/mock/epel-7-x86_64/result/{build,root,state}.log  Commonly you may need to include our own local repositories. This is also simple, for example to use our “inf” repository:  cp /etc/mock/epel-7-x86_64.cfg /tmp  Edit and add the following within the set of repositories :  [el7inf] name=el7inf baseurl=http://http.pkgs.inf.ed.ac.uk/rpms/os/el7/x86_64/inf priority=1 enabled=1 gpgcheck=0  then:  mock -r /tmp/epel-7-x86_64.cfg --rebuild PACKAGE.src.rpm  If you get failures during build you can simply chroot and test the build environment, do further debugging and manual package/file installations and rebuilds:  chroot /var/lib/mock/epel-7-x86_64/root  Note to the wicked: if you are using Java you are likely to find that after doing the chroot you also need to do:  mount -t proc none /proc  PkgForge can automatically build multiple locally dependent packages in one go. If you need to do this with a local mock setup then you need to manually build each package one by one and submit them so they are accessible to mock (or you could investigate the feature of mock that means it can dynamically generate and update a yum repository from the results of previous builds). Python Command Scripts You would think trying to install and package a Python package and module with a command script would be easy. Turns out this is a twisty minefield on top of shifting quicksand just itching to catch the unwary. Found this article though which was most helpful. Function Conversion – Test Case 2 Next function looked at was one that runs its processing through the event queue. There were some core structural changes here because of limitations in PostgreSQL triggers. The old method was based on statement data set processing which added multiple entries for an event to the queue. The event daemon deleted events from the queue on notification which triggered the processing functions again via a statement data set. The event notification system can be replaced almost like for like in PostgreSQL with NOTIFY/LISTEN. The new structure involves the event daemon updating records in the queue to flag a particular set of records for an event to be processed. The processing function can then query the queue for these updated records associated with its event handling as a virtual data set in much the same way as the original virtual statement data set. The function must then explicitly delete the entries from the event queue after processing is finished. This is all working now and there is a PostgreSQL equivalent event daemon (still in testing) and a re-implemented function carrying out queue entry and queue removal for unit work marks. Some new issues to be aware of in conversion are below. • PostgreSQL does not do short circuit evaluation, so IF OLD IS NOT NULL AND OLD.X = S will not work when OLD is NULL as the remainder of the condition will still be evaluated causing an error. Conditions must be written like, IF OLD IS NOT NULL THEN IF OLD.X = S THEN ELSE … END IF; ELSE … END IF; which tends to be a lot more clunky. • Most of the unit processing functions are multi-talented and cover different sets of update types with the same code base. Handling of this is controlled by trigger updates based on changes to specific named columns. Of course you can’t do this in PostgreSQL, so all the separate trigger conditions must be amalgamated into one and folded into the start of the function itself which of course doesn’t do short circuit evaluation … This logic is generally attached to the function pushing the event into the queue so it is at least contained in a reasonably clear and consistent place. • Again due to the lack of a CASE statement in OpenIngres a table called “er_relop_map” was used to provide a query based mechanism to do IF A > B THEN 1 ELSE 0 type mappings by using normal joins. This could probably be replaced by simply using CASE statements in the expressions directly and remove the need for the additional table and joins. This hasn’t been done yet (waiting until a the core unit processing functions have been re-implemented). Function Conversion – Test Case 1 Some monologue for converting the first of the unit assessment functions. This was the us function, standing for unit score. It recalculates statistical values for any given “unit” (built from templates for each of a module, course and year). It is triggered via a touch process from any changes in associated unit marks. We aren’t looking at that specific trigger process here, although the re-implementation of the touch process is included. It is also triggered when unit marks are deleted or the ignore status changes. We aren’t going to cover any of the general substitution conversion parts of the process here, just interesting things specific to this function. • Using the old nomenclature, the function is triggered on U(s=s)cs; D,U(i)cm. So an UPDATE on class_score when the update affects the session@ column and the value is unchanged, a DELETE on the class_mark table or an UPDATE to the ignore column on the class_mark table. These have been replaced by Ucs; D,Ucm in PostgreSQL. These are separate triggers. In the latter case logic was added to the function so that the function will drop out when TG_OP is UPDATE and TG_TABLE_NAME is class_mark and OLD.ignore = NEW.ignore. In the former case logic was added to the function so that the function will drop out when TG_OP is UPDATE and the update changes any non-derived column or the update does not change a derived column to NULL (this latter is critical to avoid loops as the function itself updates the class_score table thus re-triggering itself). • Logic was added so that the Vsession and Vunit variables always have values (by assigning from NEW.session or OLD.session depending on TG_OP). Lets consider an alternative approach. • We could make the function an ordinary non-trigger function. It is called directly by the function that normally forces the touch and it is called indirectly via a trigger wrapper function to cover the triggers from class_mark. The wrapper function in this cases handles the ignore column check logic. The problem with this is that the function causing the touch is using statement based set processing to update batches of rows, this could be done with a select loop, however it is as yet unclear how to deal with statement based processing in PostgreSQL – that should be the next function to convert! • Alternatively the function that normally forces the touch could do so by deleting the corresponding row(s). The function would have to then re-insert, but it could do this using a single SELECT. However this approach may not work for other cases. Finally although the new function behaves the same way there are some improvements. • The original OpenIngres function executes 14 separate SELECT statements to collate numbers of different grades. In PostgreSQL with the use of the CASE expression function these could all be reduced to a single SELECT with 14 column results instead. • The original OpenIngres function executes two SELECT statements – one to collate MIN, MAX and AVG aggregates and one to calculate the standard deviation as a further aggregation using the previously calculated AVG values. However, PostgreSQL has a STDDEV aggregate function that actually produces identical results so that could be used instead to reduce this to a single SELECT. It would also remove the need to ensure that done counts of less than 2 do not need to be nullified and coalesced back to 0 (this was done to avoid a division by zero error when the standard deviation was aggregated over one row). • The two new PostgreSQL SELECT statements could be rolled into one now, so the original 16 are done as one. None of the refinements above have been made yet pending alterations to other key functions to see how everything will work together. Function/Trigger Conversion – Specific Stuff With some functional affect the following are specific implementation issues with conversion from OpenIngres rules and procedures to PostgreSQL triggers and functions. One of the things that happens a lot is the use of a “touch” to push changes. This is implemented by one function updating a column value in another table to the same value, generally session@. So the triggering SQL would be something like “UPDATE class_score SET session@ = session@ WHERE …” constrained based on current processing set of data. This triggers the procedure by a rule of the form “AFTER UPDATE(session@) ON class_score WHERE session@ = session@”. Basically the procedure is only triggered when session@ is updated to the same value. This cannot easily be mapped to PostgreSQL at all. Firstly it does not have the column restricted update nor does it have the constraining clause, you can simply do “AFTER UPDATE ON class_score”. Secondly you can’t as a result test for “session@ = session@” within the procedure body because it could very well equal itself on any other updates to that table which change other columns but not the “session@”. There is no simple (compatible) fix to this problem. Instead a different approach will be taken which also offers enhanced functionality. Changes are pushed in this way to force derived marks to update (from known underlying events to avoid too many multiple re-calculations). Instead of pushing a “touch” now a change is pushed by NULLifying all derived column values in the target table. The function triggers on update but internal logic now only triggers re-calculation when the update has only modified to NULL derived columns. This loosely replicates the previous behaviour with the advantage that derived columns are given an indeterminate state prior to re-calculation and any glitch that causes re-calculation to fail will leave them in the indeterminate state rather than with values calculated previously which may now be incorrect. To further clean things up permissions on the table should prevent derived columns being changed except via the function. Function/Trigger Conversion – Trickier General Stuff Carrying on with function conversion, again without functional change. The following are more structural changes due to language differences in OpenIngres stored procedures and PostgreSQL plpgsql. • Return value. OpenIngres does not need one. In the case of functions called via rules (triggers in PostgreSQL parlance) “before” an event the return value does not matter, OpenIngres will carry on with the event regardless (except where the function has raised an exception). In PostgreSQL a return value is required. If that return value is NULL then the event triggering the function will be aborted in the same way as an exception. Hence some care is required to ensure that the function converted to PostgreSQL behaves the same way. This means ensuring that all function exit points return either NEW or OLD depending on what they are triggered on which leads us to … this matters less for functions triggered “after” events as the return value is discarded (however a return value is still required for any trigger function, so at the least a “RETURN NULL” is required for these). • In OpenIngres on a delete or an insert triggered function the NEW and OLD content is the same and can be used interchangeably. In PostgreSQL in a delete NEW is NULL (and any reference to NEW.x would raise an exception) and in an insert OLD is NULL (and any reference to OLC.x would raise an exception). This is a bit of a pain. In general it needs the function needs to do more checking about how it has been called. This can be done using the TG_OP value. For example, a function that can be called on any event and just returns the event row value which in OpenIngres would have nothing in PostgreSQL needs to be written as “IF TG_OP = ‘DELETE’ THEN RETURN OLD; ELSE RETURN NEW; END IF;”. • Arguments. In OpenIngres these are named in the function definition, for example “CREATE PROCEDURE a ( x = CHAR(2), y = INT ) AS …”. Although you can do similarly for PostgreSQL in normal functions you cannot in trigger functions, you must look at the TG_ARGV content. The example would be re-written as “CREATE FUNCTION a() AS … DECLARE x CHAR(2) := TG_ARGV[0]; y INT := TG_ARGV[1]; …”. However in some cases you may be able to do away with any arguments at all replacing them by for example TG_NAME (name of the trigger that triggered the function), TG_TABLE_NAME (name of the table that triggered the function), TG_OP etc. • Arguments. In OpenIngres value arguments can be passed by the rule depending on the triggering row. In PostgreSQL only string literal arguments can be passed. In cases so far the equivalent NEW.x or OLD.x within the function can be used. • Update columns. In OpenIngres rules the update trigger can be constrained to work on only certain columns, for example “ON UPDATE(x)” would only trigger on an update writing a value to the “x” column (even if the value is the same as the current value which causes other problems with our local conversion). In PostgreSQL can only do “ON UPDATE” so any column changing triggers the procedure. These must be folded back into the function itself as additional logic (particularly if it assumes it can only be called on change to particular columns). For example, “IF TG_OP = ‘UPDATE’ AND OLD.x <> NEW.x THEN … END IF;”. • Contraints. In OpenIngres rules the triggering can be further constrained by a “WHERE” clause so for example where “ON UPDATE(x) WHERE x <> y” will only trigger when “x” updated to be not the same as “y”. PostgreSQL does not support this, the logic must be folded back into the function itself. • Statement processing. In PostgreSQL functions triggered by statement have no content in OLD or NEW. In OpenIngres you can pass the whole set of affected data as (in essence) a virtual table that can be used as a normal table in SQL statements within the function. This is powerful and makes large batch updates very efficient. Haven’t worked out how to translate this across to PostgreSQL yet … Function Conversion – Easy Substitution Type Stuff Started on converting triggers and functions from OpenIngres to PostgreSQL. This is a list of the main easy things to do first for function conversion. These are all non-affective (should not have any functional change) and are just substituting one thing for another in a simple way. • Convert the actual function name, in general just replace “$” with “_”.
• Convert referenced table and column names. This was covered in a previous post but there have been some additions, so below is the current rule set.
• First table name mapping.
• if OI_TABLE_NAME = “order” then PG_TABLE_NAME = “ing_order”
• else if OI_TABLE_NAME ends with “@” then PG_TABLE_NAME = OI_TABLE_NAME – “@” + “_key”
• else if OI_TABLE_NAME ends with “#” then PG_TABLE_NAME = OI_TABLE_NAME – “#” + “_num”
• else PG_TABLE_NAME = OI_TABLE_NAME
• Next column name mapping.
• repeat the following until no clause matches:
• if OI_COLUMN_NAME = “oid@” then PG_COLUMN_NAME = “_gui_oid”
• else if OI_COLUMN_NAME = “order” then PG_COLUMN_NAME = “ing_order”
• else if OI_COLUMN_NAME = “grant” then PG_COLUMN_NAME = “ing_grant”
• else if OI_COLUMN_NAME = “group” then PG_COLUMN_NAME = “ing_group”
• else if OI_COLUMN_NAME = “left” then PG_COLUMN_NAME = “ing_left”
• else if OI_COLUMN_NAME = “right” then PG_COLUMN_NAME = “ing_right”
• else if OI_COLUMN_NAME ends with “@” then PG_COLUMN_NAME = OI_COLUMN_NAME – “@”
• else if OI_COLUMN_NAME ends with “#” then PG_COLUMN_NAME = OI_COLUMN_NAME – “#”

• if ( OI_COLUMN_NAME ends with “term” and ( OI_TABLE_NAME = “class_group_event” or OI_TABLE_NAME = “module_work_event” or OI_TABLE_NAME = “class_work_event” or OI_TABLE_NAME = “module_event” or OI_TABLE_NAME = “module_group_event” or OI_TABLE_NAME = “year_group_event” ) ) or ( OI_COLUMN_NAME ends with “pages” and OI_TABLE_NAME = “publication” ) or OI_COLUMN_NAME ends with “session” and OI_TABLE_NAME = “class” ) then PG_COLUMN_NAME = OI_COLUMN_NAME + “_num”
• else if OI_COLUMN_NAME starts with “_” and OI_COLUMN_NAME ends with “$d” then PG_COLUMN_NAME = OI_COLUMN_NAME – “$d”
• else if OI_COLUMN_NAME ends with “$d” then PG_COLUMN_NAME = “_” + OI_COLUMN_NAME – “$d
• Variable name mapping. In OpenIngres variables in a function are preceeded by “:”. This has the advantage of clearly distinguishing variables with the same names as tables and columns from the actual tables and columns. PostgreSQL does not tag variables in such a way. So for example in “select * from x where a = :b and b = :a”. In OpenIngres it is clear that “:b” is a reference to variable “b” and “:a” is a reference to variable “a”. However, in PostgreSQL “select * from x where a = b and b = a” the “a” and “b” references are all interpreted as columns in table “x”. To be safe we should have a standard mapping for variable names. Since variable names often match column names the first mapping to apply is the standard column name mapping above, then preceed the variable name with a “V” (although PostgreSQL is case insensitive here uppercase helps to distinguish the variable). For example, “:mark\$d” would become “V_mark” and “:session@” would become “Vsession”. The same applies in variable declarations, here OpenIngres does not require the leading “:”, but the variable must obviously still by “V” in PostgreSQL to match later references.
• Variable declaration mapping. In OpenIngres the following syntax is used “NAME1, NAME2, NAME3 TYPE” which declares all NAME variables as the same type. In PostgreSQL this has to be rewritten as “NAME1 TYPE; NAME2 TYPE; NAME3 TYPE”.
• Variable type mapping. The following apply so far. “INTEGER2” (or “INT2”) -> “SMALLINT”. “FLOAT8” -> “FLOAT”. “FLOAT4 -> REAL”. For consistency also map all CHAR(X) types to VARCHAR(X).
• Cast mapping. The following OpenIngres functions are used for casting, “CHAR(X)”, “FLOAT4(X)”. In PostgresSQL these are re-written as “TEXT(X)” and “CAST(X AS REAL)” respectively.
• Null mapping. In OpenIngres the IFNULL(X,Y) function is used a lot. In PostgresSQL this is re-written as “COALESCE(X,Y)”. In fact “COALESCE()” can take multiple arguments so nested IFNULLs could be replaced by one COALESCE. For example, “IFNULL(X,IFNULL(Y,Z))” would just become “COALESCE(X,Y,Z)”.
• Boolean mapping. In OpenIngres there was no boolean type, instead an INT type with a value 0 for false and -1 for true was used (-1 because this worked more naturally with the graphical interface). This means there are conditional tests like “_ignore <> -1” for example. Since in PostgreSQL we have converted these columns to real boolean types conditions like this would now fail and need to be rewritten as “_ignore <> TRUE” for example. Use the literals “TRUE” and “FALSE” in place of “-1” and “0” where a comparison is made on a boolean column.
• The behaviour of MIN(), MAX() and AVG() aggregate functions in PostgreSQL differs from OpenIngres when the SELECT returns no rows. In OpenIngres they will have the value 0 whereas in PostgreSQL it will be NULL. For compatibility they should be rewritten, for example, “MAX(X)” should be re-written as “COALESCE(MAX(X),0” in PostgreSQL.
• Statement terminators. This is a semi-colon in both. However in OpenIngres you can get away without a terminator in a number of places (for example single statements within “IF … ELSE … ENDIF” clauses). In PostgreSQL make sure every statement is terminated with a semi-colon.
• Functions in PostgreSQL to to be marked as “SECURITY DEFINER” so they run as the user that created them rather than the user that invokes/triggers them. This was the default in OpenIngres. This allows the procedure to carry out any operation even if the user does not have permission to do the same manually. In addition to this change functions should have the default public right to execute revoked and specific execute permissions granted as per base tables. The Theon schema can handle this easily enough by adding capability elements to the function definition although this isn’t implemented yet.

Annotation

Ok, looks like we should use name space extensions under <xs:appinfo> to do extensions as its the standard way and won’t give us any back end problems. This does make the resulting schema definition clunky however its not necessarily the case that it would always be hand edited – we would expect a simple ER design tool to generate it for example, or we envisage translators to go from an SQL DDL or any other data model representation to a Theon XML Schema. We could even provide a slightly abstracted and more user friendly version of the schema language that is then translated via XSLT to the base standard. So not overly concerned about the clunkiness at this level. Next to define the structures under <xs:appinfo> to represent a simple RBAC (Role Based Access Control) system (because XACML looks fiendish and massively over complex for what we need), triggers and procedure wrappers (procedures themselves are code and could be in all manner of different base languages so we don’t intend to do anything other than provide a way of encapsulating them for version control and distribution) and indexes as these are used extensively for performance improvement in the current system.

Extending XML-Schema

Natively in XML Schema we can nicely express tables, columns, keys and referential integrity. However, our data model schema needs more than this – we need to express permissions/roles, triggers/procedures and indexes/views at the very least. We already use the annotation extension feature of XML Schema for adding some meta data (documentation and data supporting the existing user interface) so what are the options for adding these other structures, which are not strictly speaking meta data (although they would not be present in an XML document that instantiated content from the schema)?

• Use a separate name space with structures under <xs:appinfo>, as we do for meta data.
• Use foreign namespaced attributes. These can be added to any XSD element and work a bit like extensions under <xs:appinfo> but a more restricted since they can only take simple types.
• Extend our “implicit” structure of database->clusters->tables->columns with a parallel “abstract” structure representing the additional information. This would mean, for example, defining <xs:element name=”wibble” abstract=”true” type=”Role”> where “Role” is defined like any other complexType. Such elements could never be instantiated but are still part of the schema which we process directly. This is probably the most user friendly approach, but we would need a separate schema to define it (ie. this would move out of our existing schema representation to an “instantiation” (substitutedtype) of some kind of “ancillary data” schema.
• Copy and extend the base XML-Schema directly. Problem with this approach is it would likely break standard parsing tools.

For now we will assume a structure under <xs:appinfo> with a separate namespace defining the schema for it as we do for meta data.