Simon's Musings

May 26, 2009

Converting OpenAFS to git

Filed under: Uncategorized — sxw @ 12:46 pm
Tags: , ,

For a while now, there have been plans afoot to convert OpenAFS’s CVS repository to git. A number of attempts have been made, which have all stalled due to the complexity of the underlying problem, and issues with the existing tools. Previously, it was felt that the main hurdle to a successful conversion was OpenAFS’s use of ‘deltas’ to provide a changeset style interface on top of CVS. A delta is a collection of related changes, grouped using a comment in the CVS revision log. However, unlike a real changeset, there is no requirement that a delta’s changes be contiguous. A file may be modified by delta A, then by delta B, and then modified by delta A again. This makes it impossible to properly represent all deltas as single changesets. In addition, abuse of deltas within OpenAFS has caused some to span branch or tag points, again making it impossible to represent those deltas as a changeset without destroying the repository history. For many months now, people have been trying to produce conversion tools that achieve as close to a 1 to 1 correspondence between deltas and changesets as is possible, just leaving the troublesome cases as multiple commits.

Frustrated with the lack of progress of this approach, I decided to do a quick and dirty conversion, with the view to getting something completed by the start of coding for this year’s Summer of Code (which I’ve missed) and the yearly Best Practices Conference (which I might just make). I decided to not concern myself with merging deltas at all, but instead use cvsps and the existing git-cvsimport tool to produce a tree where the branch heads and all tag points matched, and which retained enough information to reconstruct deltas without forcing them to be single changesets. In order to be able to perform simple manipulations, I decided to create a perl script which would post-process the cvsps output before feeding it to git. I also created a tool which would check out every branch and tag from cvs, and compare them to the corresponding item in git, and report on any errors. Pretty straightforwards, I thought …

Unfortunately, I rapidly discovered that cvsps had significant problems with the OpenAFS repository. Many tags in CVS were simply not in the cvsps output, other tags (both those marked as FUNKY and INVALID, and those not) were in the wrong place and branchpoints were being incorrectly determined. Rather than get into cvsps’s internals, I ended up extending my post processing script to deal with these errors. It now performs a number of tasks:

Reordering inverted patchsets Some of cvsps’s output gets the patchset ordering wrong, such that a patchset that does fileA:1.2->1.3 comes before fileA:1.1->1.2. The script scans through all of the patchsets for this problem and swaps any that it finds.

Tag point determination Using the output from CVS’s rls command, it is possible to get the revision numbers of every file in a given tag. With this information, the set of patchsets from cvsps can be walked in order to identify the first patchset to satisify every revision contained within the tag. Unfortunately, cvsps patchsets aren’t correctly ordered, so this process also works out how to reorder the patch sets such that no patchsets with file revisions higher than those in the tag occur before the tag point. This reordering is carefully performed in order to not break any tag or branch points which we have already placed! In addition, cvsps sometimes merges commits which occur over a tag point, so we also need to split patchsets which contain both files with revisions before the tag point, and files with revisions after it.

Branch point determination The cvsps output incorrectly places many of OpenAFS’s branchpoints. Fortunately, many of these were marked by tags at the time they were created, and a hardcoded list of these is used to place some branch points in the correct position. For branches that don’t have a corresponding tag, a brute force approach is used. By examining all of the patchsets on the branch, it’s possible to determine the starting revision number of every file that’s modified on that branch – combining this with the contents of the branch head tag from cvs rls gives the equivalent of a tag marking the branchpoint. This can then be processed by the tag point algorithm to work out the correct position in which to place the branch point. This gives the patchset that the branch occurs after, rather than cvsps’s “Ancestor branch” field, which gives the first patchset on the branch. Ancestor branch is fundamentally flawed, as it doesn’t allow changes to occur on HEAD after a branchpoint is created, and before the first patch on that branch. As part of this process, git-cvsimport was modified to understand a new ‘post-patchset’ branch syntax

Unknown branch determination cvsps fails to place some patchsets on the correct branch. By comparing the revision numbers of files in these patchsets with those elsewhere in the list, the correct branch for all of these can be determined (this needs to be done in order that we can work out tag points, as well as being necessary for repository consistency)

We also clean up the output to deal with problems of our own making

Delta naming Whilst there is a style of  delta name where the branch is given first, a number of deltas don’t conform to this style, and have the same name across multiple branches. All deltas are renamed such that they are something like STABLE14-delta-name-yyyy-mm-dd

Mistagged files In some places, tags have been incorrectly applied such that files on HEAD are tagged as part of a branch tag. The script contains manual overrides to fix these to tag a revision on the correct branch.

Finally, because having done all of the above I had a pretty good toolset for dealing with patchsets, I implemented support for merging deltas. This merges all bar 10o or so, out of 15,000 deltas into single patchsets. The remaining items are comprised of deltas which span tag or branch points (and which can never be merged) and deltas which contain conflicting changes to a single file (which it might be possible to merge, but which would require manual intervention). These deltas are tagged in a set of git references at refs/deltas/branches/<branch>/<delta>. We separate them from tags so that git-tags doesn’t have to deal with over 10,000 tags, and split them into branches to avoid directory size limitations.

The resulting git tree isn’t a perfect replica of the CVS repository. It has a number of issues which it’s going to be really difficult to fix, and which probably aren’t earth shattering

It contains additional files There are a number of places where a user has added additional directories to cvs. When another user has subsequently tagged their working directory for a release, they haven’t done a cvs update -d, and so these additional directories (and in a small number of cases, files) aren’t contained in the tag for that release. It’s impossible to create a patchset ordering which allows a git tag to not include these directories, so we end up with additional files in the git tag. I don’t think that this is a particular problem

It’s missing a tag There is a tag in the CVS repository (BP-openafs-rxkad-krb5-lha) which is so broken that it’s impossible to construct a patch set ordering that matches it. It is just omitted from the resulting git repository

One branch is bad The openafs-rxkad-krb5-lha branch was created by only branching certain files in the tree. This means that it’s impossible to create a git branch which mimics this one without creating a huge number of additional ‘pull-up’ patch sets. Whilst we include all of the changes that were made on this branch, the final branch state is very different from the one in CVS.

Some deltas are disjoint As discussed, some deltas cannot be merged into single patchsets. This is going to require a new wdelta style tool which understands how to merge these deltas.

Theme: Rubric.