Simon's Musings

February 17, 2010

Git Tip 2: Splitting up existing changes

Filed under: Uncategorized — sxw @ 5:31 pm
Tags:

Today’s tip is pretty clearly spelled out in the manpage for git-rebase. But I find it so useful, I thought I would highlight it here.

From time to time, you end up with a patchset in your history that contains more than it should. It may be that it contains files (or chunks) that have been added incorrectly, or that a review comment has suggested would be better split into multiple changes. Git’s all purpose rewriter of history – git rebase – can help with doing this.

git rebase is a fantastic tool which can let you completely change the history of your project at the drop of a hat. I’ll be writing more about it in tips to come. As with all powerful tools, you need to use its power wisely. In effect, rebasing doesn’t rewrite your history, but creates an entirely new timeline, starting at some common point in the past. If you are working on a personal repository, then this isn’t an issue. If you are sharing your repository, you should think very carefully about rebasing it as your collaborators will have based their work on the old tip of your tree. When you rebase, you create a new branch, with a new tip, of which your collaborators will be completely unaware.

How to split a change is described in detail in the git rebase manpage, so we’ll just recap it here.

  • Find the SHA1 hash of the change you wish to split – this is probably easiest done by running
    git log –oneline
  • Start up an interactive rebase, with the start point being the parent of the commit you wish to split. The git notation HASH^ – for exampledeadbeef^ will give you the parent of a commit. So, run:
    git rebase -i HASH^
  • In the editor window that appears replace the word pick that appears beside the change you are picking with edit. This instructs git to stop the rebase operation at this point, and to allow you to modify this change.
  • Save the file, and exit your editor. Git will now start the rebase operation.
  • When it reaches the change you are modifying, git will pause the rebase, and return control. You can now modify this
    change – make a note of the SHA1 id it has stopped at, as this will come in handy later!
  • git reset HEAD^ will reset the index to the state of the parent commit. This gives you a working tree that contains the contents of the change you are splitting.
  • Use git add, and git commit to create as many commits as you wish from this tree. If you want your original commit message, then using git commit -C HASH will commit the current index, using the same message as that in HASH.
  • When you’re done, and have committed all of the fragments, use git rebase –continue to resume the rebasing process

Providing all goes well, you’ll end up with a modified history, and a change that has been split into multiple parts. You’ll note that every change after the split now has a new SHA1 – to git, these are completely new changes, as they have a different history, and this is the whole reason why rebases can be dangerous.

February 16, 2010

Git tip of the day #1

Filed under: Uncategorized — sxw @ 9:24 am
Tags:

This is the first of an occasional series of helpful hints and tips for the git revision control system – look for articles tagged with ‘git’

Cleaning your local commits

Many of the projects I submit to are picky about both trailing, and embedded, whitespace. git show and git diff will let you see these and if you are pulling in a patchset from elsewhere, git apply --whitespace=fix will tidy them up for you. But, I’d always thought that doing it in your own tree was harder. However …

git rebase -f --whitespace=fix origin/master

… will rebase your current working tree (assuming that the branch point was origin/master), and also clean up any embeded whitespace problems along the way. Obviously this has all of the caveats that rebasing carries – you don’t want to do it on a tree that others are working from. But as a way of cleaning up local changes before pushing them into gerrit, it’s remarkably useful.

May 26, 2009

Converting OpenAFS to git

Filed under: Uncategorized — sxw @ 12:46 pm
Tags: , ,

For a while now, there have been plans afoot to convert OpenAFS’s CVS repository to git. A number of attempts have been made, which have all stalled due to the complexity of the underlying problem, and issues with the existing tools. Previously, it was felt that the main hurdle to a successful conversion was OpenAFS’s use of ‘deltas’ to provide a changeset style interface on top of CVS. A delta is a collection of related changes, grouped using a comment in the CVS revision log. However, unlike a real changeset, there is no requirement that a delta’s changes be contiguous. A file may be modified by delta A, then by delta B, and then modified by delta A again. This makes it impossible to properly represent all deltas as single changesets. In addition, abuse of deltas within OpenAFS has caused some to span branch or tag points, again making it impossible to represent those deltas as a changeset without destroying the repository history. For many months now, people have been trying to produce conversion tools that achieve as close to a 1 to 1 correspondence between deltas and changesets as is possible, just leaving the troublesome cases as multiple commits.

Frustrated with the lack of progress of this approach, I decided to do a quick and dirty conversion, with the view to getting something completed by the start of coding for this year’s Summer of Code (which I’ve missed) and the yearly Best Practices Conference (which I might just make). I decided to not concern myself with merging deltas at all, but instead use cvsps and the existing git-cvsimport tool to produce a tree where the branch heads and all tag points matched, and which retained enough information to reconstruct deltas without forcing them to be single changesets. In order to be able to perform simple manipulations, I decided to create a perl script which would post-process the cvsps output before feeding it to git. I also created a tool which would check out every branch and tag from cvs, and compare them to the corresponding item in git, and report on any errors. Pretty straightforwards, I thought …

Unfortunately, I rapidly discovered that cvsps had significant problems with the OpenAFS repository. Many tags in CVS were simply not in the cvsps output, other tags (both those marked as FUNKY and INVALID, and those not) were in the wrong place and branchpoints were being incorrectly determined. Rather than get into cvsps’s internals, I ended up extending my post processing script to deal with these errors. It now performs a number of tasks:

Reordering inverted patchsets Some of cvsps’s output gets the patchset ordering wrong, such that a patchset that does fileA:1.2->1.3 comes before fileA:1.1->1.2. The script scans through all of the patchsets for this problem and swaps any that it finds.

Tag point determination Using the output from CVS’s rls command, it is possible to get the revision numbers of every file in a given tag. With this information, the set of patchsets from cvsps can be walked in order to identify the first patchset to satisify every revision contained within the tag. Unfortunately, cvsps patchsets aren’t correctly ordered, so this process also works out how to reorder the patch sets such that no patchsets with file revisions higher than those in the tag occur before the tag point. This reordering is carefully performed in order to not break any tag or branch points which we have already placed! In addition, cvsps sometimes merges commits which occur over a tag point, so we also need to split patchsets which contain both files with revisions before the tag point, and files with revisions after it.

Branch point determination The cvsps output incorrectly places many of OpenAFS’s branchpoints. Fortunately, many of these were marked by tags at the time they were created, and a hardcoded list of these is used to place some branch points in the correct position. For branches that don’t have a corresponding tag, a brute force approach is used. By examining all of the patchsets on the branch, it’s possible to determine the starting revision number of every file that’s modified on that branch – combining this with the contents of the branch head tag from cvs rls gives the equivalent of a tag marking the branchpoint. This can then be processed by the tag point algorithm to work out the correct position in which to place the branch point. This gives the patchset that the branch occurs after, rather than cvsps’s “Ancestor branch” field, which gives the first patchset on the branch. Ancestor branch is fundamentally flawed, as it doesn’t allow changes to occur on HEAD after a branchpoint is created, and before the first patch on that branch. As part of this process, git-cvsimport was modified to understand a new ‘post-patchset’ branch syntax

Unknown branch determination cvsps fails to place some patchsets on the correct branch. By comparing the revision numbers of files in these patchsets with those elsewhere in the list, the correct branch for all of these can be determined (this needs to be done in order that we can work out tag points, as well as being necessary for repository consistency)

We also clean up the output to deal with problems of our own making

Delta naming Whilst there is a style of  delta name where the branch is given first, a number of deltas don’t conform to this style, and have the same name across multiple branches. All deltas are renamed such that they are something like STABLE14-delta-name-yyyy-mm-dd

Mistagged files In some places, tags have been incorrectly applied such that files on HEAD are tagged as part of a branch tag. The script contains manual overrides to fix these to tag a revision on the correct branch.

Finally, because having done all of the above I had a pretty good toolset for dealing with patchsets, I implemented support for merging deltas. This merges all bar 10o or so, out of 15,000 deltas into single patchsets. The remaining items are comprised of deltas which span tag or branch points (and which can never be merged) and deltas which contain conflicting changes to a single file (which it might be possible to merge, but which would require manual intervention). These deltas are tagged in a set of git references at refs/deltas/branches/<branch>/<delta>. We separate them from tags so that git-tags doesn’t have to deal with over 10,000 tags, and split them into branches to avoid directory size limitations.

The resulting git tree isn’t a perfect replica of the CVS repository. It has a number of issues which it’s going to be really difficult to fix, and which probably aren’t earth shattering

It contains additional files There are a number of places where a user has added additional directories to cvs. When another user has subsequently tagged their working directory for a release, they haven’t done a cvs update -d, and so these additional directories (and in a small number of cases, files) aren’t contained in the tag for that release. It’s impossible to create a patchset ordering which allows a git tag to not include these directories, so we end up with additional files in the git tag. I don’t think that this is a particular problem

It’s missing a tag There is a tag in the CVS repository (BP-openafs-rxkad-krb5-lha) which is so broken that it’s impossible to construct a patch set ordering that matches it. It is just omitted from the resulting git repository

One branch is bad The openafs-rxkad-krb5-lha branch was created by only branching certain files in the tree. This means that it’s impossible to create a git branch which mimics this one without creating a huge number of additional ‘pull-up’ patch sets. Whilst we include all of the changes that were made on this branch, the final branch state is very different from the one in CVS.

Some deltas are disjoint As discussed, some deltas cannot be merged into single patchsets. This is going to require a new wdelta style tool which understands how to merge these deltas.

Theme: Rubric.