Simon's Musings

January 27, 2011

Gerrit upgrade

Filed under: Uncategorized — sxw @ 10:55 am

OpenAFS’s gerrit generally keeps pace with the latest releases from upstream, but I’ve been much more tardy about upgrading the Informatics not-a-service instance.

Last night, I took the plunge, and moved this from 2.1.1.1 to 2.1.6.1. This brings with it lots of shiny new features, in particular for access control. Full details are in the release notes.

May 16, 2010

Spotting trivial rebases in gerrit

Filed under: Uncategorized — sxw @ 4:51 pm

One of the pains with gerrit is that of rebasing. Once a change has gone through its review process, it is common to find that the tree has moved on since it was written, and the change requires updating before it can be applied. On many occasions, all that is required is a trivial rebase – just moving the patch to the latest version of the tree, without actually changing any of the code. However, whenever a new patch is uploaded, gerrit forgets the existing review and verification scores. In many cases, this is a good thing – you don’t want your “This looks great, submit it!” review attached to a piece of code that is substantially different from what you reviewed. With a trivial rebase, things are different – the code hasn’t changed, it probably doesn’t need re-reviewing, and dropping the existing review and verification information just means that information is lost when the commit is eventually applied.

So, Nasser Grainawi wrote a trivial rebase hook, which uses gerrit’s hook and command line features to identify these trivial rebases and reapply all of the review scores when it identifies one. I’ve enabled this hook for OpenAFS’s gerrit instance. It turns out that doing so isn’t quite as simple as it might appear, so I thought I’d document the hoops I jumped through.

1: Create an SSH key

The script will run as the gerrit user, so we need to create a ssh public key pair for it to use as that user. Just run ssh-keygen as normal, and put the key pair somewhere sensible.

2: Create a User

A role user needs to be created for the script to use when interrogating gerrit. You need this, despite the fact that the script also uses the super-user functionality obtained from the server’s private key when reapplying existing scores. The development version of gerrit has a nice simple command to create new role accounts, but we’re runnign the stable series, which hasn’t yet got this feature. So, we get to play directly with the SQL database.

I did::

insert into accounts (registered_on, account_id, full_name, preferred_email)
    VALUES (now(), nextval('account_id'), 'Automated Hook Processor',
            '<email>');
insert into account_ssh_keys (valid, account_id, seq, ssh_public_key)
    VALUES ('Y', currval('account_id'), 1, '<ssh public key>');
insert into account_external_ids (account_id, external_id)
    VALUES (currval('account_id'), 'username:hooks');

… substituting <email> with an email address that the hook processor can be attributed with, and <ssh public key> with the public portion of the key we generated earlier.

3: Configure the SSH client

Let the ssh client know how to talk to gerrit. The script uses ‘localhost’ as the hostname to talk to, so we need to do something like:

Host localhost
    IdentityFile ~/.ssh/id_rsa
    User hooks

4: Connect to Gerrit

In order to populate the known hosts file, we need to connect to gerrit, and accept the host key. Something like

ssh -p 29418 localhost

will accomplish this.

5: Make the new user an admistrator

Go into the gerrit front end, and make the user that we’ve just created an administrator

6: Test the rebase script

Testing the rebase script independently from gerrit will let you know if everything above has worked. You can run it from the command line, as the gerrit user:

GIT_DIR=<gitRepository> \
    ./trivial_rebase.py --change <changeId> \
                        --project <projectName> \
                        --commit <commitSHA1> \
                        --patchset <patchSetNumber> \
                        --private-key-path=$site_path/etc/ssh_host_dsa_key
  • gitRepository is the path to the bare git repository that gerrit saves its changes into.
  • changeId is the full changeID (something like I97e07a6e730df8ac480d295b4cf30b0695ace511)
  • commitSHA1 is the git SHA1 for the patchset that’s just bene uploaded
  • patchset is gerrit’s patchset number

The best way to test this is to pick a change that has been trivially rebased, but which isn’t marked as such. Go to its page in the web interface, and get the changeID, the SHA1 of the newest patchest, and its number, from that page, and plug them in to this command. All being well, the change will be marked as a trivial rebase. If that doesn’t happen, then the logs in $site_path/logs/error_log will give an idea of what’s going on.

7: Write a wrapper

The trivial rebase script can’t be run directly from a gerrit hook, so we need to provide a wrapper script to fill in some of the gaps. The script OpenAFS is currently using is:

#!/usr/bin/perl

use strict;
use warnings;
use Getopt::Long;

my $gerritRoot="/srv/gerrit/cfg";
my $hookDir="/srv/gerrit/gerrit-hooks";

my $change;
my $project;
my $branch;
my $commit;
my $patchset;
my $result = GetOptions("change=s" => \$change,
                        "project=s" => \$project,
                        "branch=s" => \$branch,
                        "commit=s" => \$commit,
                        "patchset=s" => \$patchset);

system($hookDir."/trivial_rebase.py",
                "--change", $change,
                "--project", $project,
                "--commit", $commit,
                "--patchset", $patchset,
                "--private-key-path", $gerritRoot."/etc/ssh_host_dsa_key");

This script needs to go into the hooks directory below your gerrit site path as a file called patchset-created. In order for gerrit to run the hook, it needs to be executable.

April 8, 2010

FOSDEM 2010

Filed under: Uncategorized — sxw @ 4:12 pm
Tags:

From the better-late-than-never department

I spent the first weekend of February at FOSDEM 2010, a pretty much unique conference of free software developers held yearly in Brussels. Whilst I emailed these notes around internally upon my return, it’s taken a while to get around to putting them up on my blog. In that time, though, FOSDEM have published the videos they made of most of these talks. Links to those videos are included below.

FOSDEM is a completely free event, whose scale has to be seen to be believed – they have over 300 talks spread across the 2 days, and initial estimates suggested that there were approximately 10,000 delegates on Saturday (the wireless network had over 4000 active leases at one point). I split my time between attending main track sessions, listening to development room talks, and volunteering with conference support

Bullet points for the pointy haired …

  • Mozilla don’t have any real interest in the enterprise space
  • Facebook are doing some amazing things, and actually talking about it.
  • Usable open source smartcard stacks are finally in today’s distributions.
  • Hadoop, and the NoSQL movement in general, are gaining a large amount of mindshare. If we’re not getting requests for Hadoop clusters now, it’s probably just a matter of time.
  • The pound is taking a serious beating.

In the corridors

I spoke to Chris Blizzard and Gervase Markham from Mozilla about our experiences with the move from Firefox 2->3. They think that the latest development versions will offer some improvement to the number of small disk writes, but admitted that the level of disk activity had never been a concern in benchmarking the application. Gervase had some helpful suggestions about things we could do to assist with this in the future.

There’s a growing interest in Europe around smartcards. This is primarily being driven by the fact that a number of European governments are adding smartcards to their existing national ID cards, and people are wanting to exploit the authentication properties of these. One group were demonstrating an out-of-the box Ubuntu system using a smartcard for authentication, and document signing. Fedora apparently offers the same user experience. There are also powerful Open Source CA products becoming available, which would be very useful if we decide to go down the smartcard route at any point.

Talks

Brooks Davis gave an interesting talk about moving a large company (Aerospace, in the US) towards open source development models. He had some interesting observations on the uptake of new tools in big organisations, and on the reluctance to share code between development teams. Some of his horror stories were staggering – teams whose idea of revision control is hosting all of their source on a shared filesystem, and using a physical whiteboard to indicate who was modifying which files. Needless to say, code loss was common. Aerospace are rolling out an internal system based around Trac to encourage the use of revision control and other good sofware development practices company wide.

Next, Richard Clayton gave a fascinating tour of ‘Evil on the Internet’. This was an eye-opening overview of the way in which criminals steal money over the internet, exploring the whole ecosystem, from those who phish for credentials, the mules they recruit to launder the money, and the many, many fake websites that they have to con you into participating. What was alarming was the realism of many of these sites – I’d become used to identifying phishing scams through poorly formed emails, and the plausibility of much of what he demonstrated was alarming.

I spent much of the afternoon in the XMPP devroom. They’re doing some very interesting work both on integrating XMPP with web applications, and in terms of moving voice and video forwards with Jingle. I popped out of that devroom to listen to a talk by Mitchell Bakerabout Mozilla’s mission, and their view of how they can contribute to ensuring an open internet. Inspiring stuff, but sadly she did confirm that Mozilla as a whole don’t have an interest in the enterprise space.

Due to lack of space (a common theme during the weekend, sadly) I couldn’t get into the Spacewalk talk. Spacewalk (http://www.redhat.com/spacewalk/ ) is the open source version of RedHat Network, and is designed to manage software updates across an enterprise wide deployment of Linux. There was a lot of buzz about this at FOSDEM, and it looks like something we should definitely investigate.

On Sunday, I spent a some time in the Mozilla devroom, in particular speaking to Ludo about current, and forthcoming, changes to Thunderbird. We discussed the current state of both GSSAPI and LDAP support (my patch enabling GSSAPI for LDAP in Thunderbird was part of the 3.0 release). I was hoping to listen to some of the NoSQL talks too, but sadly that room was overflowing everytime I tried. I spoke with Peter St Andre (XMPP Foundation, now with Cisco) about improving XMPP’s Kerberos support – in particular how we can push support for domain based names into the SASL software stack. I will be following that up with him and Alexey Melnikov after the conference. I also met with some other OpenAFS developers, for a brief chat about the state of the tree, and the move towards the 1.6 release.

On Sunday afternoon, I helped moderate the talks in the Janson auditorium, in the scalability conference track. Isabel Jost started with a fascinating talk about Apache Hadoop, which offers distributed petabyte scale data processing, using tools modelled on Google’s MapReduce paradigm. HDFS looks pretty much like an implementation of what’s publicly known about Google’s GFS, and Hadoop layers Map/Reduce on top of that. The talk provided a great overview, firstly of map reduce and its power, secondly of the flexibility of the Hadoop implementation, and finally an idea of the huge degrees to which it can scale.

The next talk on this track was from Facebook. They started with some fairly blinding statistics – 8 billion minutes every day are spent on Facebook and 2.5 billion photos are uploaded each month. They provided an overview of their whole infrastructure, highlighting the various projects that make it tick. Their development language is PHP but due to issues with its speed, and memory use, they’ve developed hipohp, a static analyzer and translator which converts PHP into optimised C++. For logging, they used to use syslog, but their log volumes (~25 terabytes per day) melted down all of the syslog servers that they tried, so they moved towards ‘Scribe’ – which is now used by both themselves and Twitter, and offers massively scalable log storage. To process these logs, and for other data analysis tasks, Facebook are big Hadoop users. They’ve built Hive, which puts an SQL like layer on top of Hadoop’s syntax, with the aim of encouraging ease of use, and internal adoption. Their Hadoop cluster currently uses more than 80,000 compute hours per day.

Facebook’s data store totals 160 billion photos and serves 1.2 million of them every second. Nobody’s NFS scales to this kind of data load – I/O bandwidth, rather than storage density, ends up being the limiting factor. They’ve built a new storage system called Haystack, which removes a lot of metadata information, and transfers data serving from 10 disk seeks per file served into a single operation.  memcache is crucial to facebook’s interactive performance, but they have a bit of a love/hate relationship with it. They’ve done a fair bit of work on extending it, by adding features 64bit and multithreading support. Their permanent data storage is MySQL, which they like very much – simple, fast and very reliable. They keep their usage of it simple – they don’t do joins at the database layer, but combine datasets together in PHP.

One very interesting observation from Facebook was that most of their interesting projects happen as “hack projects” where a very small group of engineers work intensively on an exciting idea – Haystack, for example, was built by 3 people.

Finally, in the scalability track, there was a talk about Status.Net and identi.ca. Choice quote: “When web people talk about scalability – what they really mean is will it keep working” Status.Net is a twitter-like service which is available both as a hosted system (identi.ca) and as a locally installable server, designed for organisations who want to host their own. The talk dealt more with the overall architecture of Status.Net (formerly laconica), rather than detailing its scalability, but provided an interesting example of how various open source components could be stitched together to produce a compelling product, and a rallying cry for the importance of both libre web services, and the ownership of your own web content.

To round off a very long couple of days, Greg K-H gave a whistle stop tour of writing your first patch for the Linux kernel. This was a very well delivered overview of a lot of complex topics, and was aimed at inspiring members of the audience to contribute, rather than being an in depth description of the kernel. It was a very well chosen end to an inspiring weekend

February 17, 2010

Git Tip 2: Splitting up existing changes

Filed under: Uncategorized — sxw @ 5:31 pm
Tags:

Today’s tip is pretty clearly spelled out in the manpage for git-rebase. But I find it so useful, I thought I would highlight it here.

From time to time, you end up with a patchset in your history that contains more than it should. It may be that it contains files (or chunks) that have been added incorrectly, or that a review comment has suggested would be better split into multiple changes. Git’s all purpose rewriter of history – git rebase – can help with doing this.

git rebase is a fantastic tool which can let you completely change the history of your project at the drop of a hat. I’ll be writing more about it in tips to come. As with all powerful tools, you need to use its power wisely. In effect, rebasing doesn’t rewrite your history, but creates an entirely new timeline, starting at some common point in the past. If you are working on a personal repository, then this isn’t an issue. If you are sharing your repository, you should think very carefully about rebasing it as your collaborators will have based their work on the old tip of your tree. When you rebase, you create a new branch, with a new tip, of which your collaborators will be completely unaware.

How to split a change is described in detail in the git rebase manpage, so we’ll just recap it here.

  • Find the SHA1 hash of the change you wish to split – this is probably easiest done by running
    git log –oneline
  • Start up an interactive rebase, with the start point being the parent of the commit you wish to split. The git notation HASH^ – for exampledeadbeef^ will give you the parent of a commit. So, run:
    git rebase -i HASH^
  • In the editor window that appears replace the word pick that appears beside the change you are picking with edit. This instructs git to stop the rebase operation at this point, and to allow you to modify this change.
  • Save the file, and exit your editor. Git will now start the rebase operation.
  • When it reaches the change you are modifying, git will pause the rebase, and return control. You can now modify this
    change – make a note of the SHA1 id it has stopped at, as this will come in handy later!
  • git reset HEAD^ will reset the index to the state of the parent commit. This gives you a working tree that contains the contents of the change you are splitting.
  • Use git add, and git commit to create as many commits as you wish from this tree. If you want your original commit message, then using git commit -C HASH will commit the current index, using the same message as that in HASH.
  • When you’re done, and have committed all of the fragments, use git rebase –continue to resume the rebasing process

Providing all goes well, you’ll end up with a modified history, and a change that has been split into multiple parts. You’ll note that every change after the split now has a new SHA1 – to git, these are completely new changes, as they have a different history, and this is the whole reason why rebases can be dangerous.

February 16, 2010

Git tip of the day #1

Filed under: Uncategorized — sxw @ 9:24 am
Tags:

This is the first of an occasional series of helpful hints and tips for the git revision control system – look for articles tagged with ‘git’

Cleaning your local commits

Many of the projects I submit to are picky about both trailing, and embedded, whitespace. git show and git diff will let you see these and if you are pulling in a patchset from elsewhere, git apply --whitespace=fix will tidy them up for you. But, I’d always thought that doing it in your own tree was harder. However …

git rebase -f --whitespace=fix origin/master

… will rebase your current working tree (assuming that the branch point was origin/master), and also clean up any embeded whitespace problems along the way. Obviously this has all of the caveats that rebasing carries – you don’t want to do it on a tree that others are working from. But as a way of cleaning up local changes before pushing them into gerrit, it’s remarkably useful.

January 24, 2010

GSSAPI Key Exchange for OpenSSH 5.3p1

Filed under: Uncategorized — sxw @ 12:00 pm

Once again, far far later than I would have liked, I’ve produced a set of updated patches for OpenSSH 5.3p1. Compared to previous releases this one is pretty simple – a resolved merge conflict, and a few one line patches. However, I really need to get quicker at doing these – it’s 4 months since 5.3p1 appeared, and 5.4 will be just around the corner. The announcement email was:

From the better-late-than-never-department, I’m pleased to announce the availability of my GSSAPI Key Exchange patches for OpenSSH 5.3p1. This is a pretty minor maintenance release – it contains a couple of fixes to take into account changes to the underlying OpenSSH code, and a compilation fix for when GSSAPI isn’t required. Thanks to Colin Wilson and Jim Basney for their bug reports.

I’d like to thank the distributors who’ve been patiently waiting for me to get this done – sorry once again for the delay.

Why?
—-
Whilst OpenSSH contains support for GSSAPI user authentication, this still relies upon SSH host keys to authenticate the server to the  user. For sites with a deployed Kerberos infrastructure this adds an additional, unnecessary, key management burden. GSSAPI key exchange allows the use of security mechanisms such as Kerberos to authenticate the server to the user, removing the need for trusted ssh host keys, and allowing the use of a single security architecture.

How?
—-
This patch adds support for the RFC4462 GSSAPI key exchange mechanisms to OpenSSH, along with adding some additional, generic, GSSAPI features. It implements:

*) gss-group1-sha1-*, gss-group14-sha1-* and gss-gex-sha1-* key exchange mechanisms. (#1242)
*) Support for the null host key type (#1242)
*) Support for CCAPI credentials caches on Mac OS X (#1245)
*) Support for better error handling when an authentication exchange fails due to server misconfiguration (#1244)
*) Support for GSSAPI connections to hosts behind a round-robin load balancer (#1008)
*) Support for GSSAPI connections to multi-homed hosts, where each interface has a unique name (#928)
*) Support for cascading credentials renewal
*) Support for the GSSAPIClientIdentity option, to allow the user to select which client identity to use when authenticating to a server.

(bugzilla.mindrot.org bug numbers are in brackets)

Where?
——
As usual, the code is available from
http://www.sxw.org.uk/computing/patches/openssh.html

Two patches are available, one containing cascading credentials support, and one without. In addition, the quilt patch series that makes up this release is also provided, for those who wish to pick and choose!

Cheers,

Simon.

January 17, 2010

Debugging a Mac OS X Kernel Panic

Filed under: Uncategorized — sxw @ 12:40 pm
Tags: ,

Adam Megacz posted to the openafs-info mailing list a kernel panic that he encountered with the 1.4.12 release candidate. Derrick and I debugged this over Jabber (with Derrick doing the kernel disassembly). I thought it might be useful to share the methods we used, and to record the command invocations for future googling.

The Problem

A Mac OS X kernel panic. Adam first posted the panic itself, then followed up with the results of running OpenAFS’s decode-panic tool. decode-panic basically drives gdb to turn an panic log with addresses into a readable backtrace with symbols – it needs to be run with the same kernel, and architecture, that created the panic in the first place.

The Analysis

The output from decode-panic looks something like:

Panic Date:      Interval Since Last Panic Report:  472905 sec
Kernel Version:  Darwin Kernel Version 10.2.0: Tue Nov  3 10:37:10 PST
2009; root:xnu-1486.2.11~1/RELEASE_I386
OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
=============
add symbol table from file "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 
0x21b2bd <panic+445>:                   mov    0x8011d0,%eax
0x2a7ac2 <kernel_trap+1530>:            jmp    0x2a7ade <kernel_trap+1558>
0x29d968 <lo_alltraps+712>:             mov    %edi,%esp
0x4607e500 <afs_GetDCache+7832>:   mov    0x64(%edx),%ebx
0x46078a18 <BPrefetch+144>:             mov    %eax,-0x3c(%ebp)
0x4607928d <afs_BackgroundDaemon+573>:  jmp    0x460792cb <afs_BackgroundDaemon+635>
0x460e76a7 <afsd_thread+719>:           call   0x2a013e <current_thread>
0x29d68c <call_continuation+28>:        add    $0x10,%esp

The line in bold is the point where the panic actually occurred (all of the stuff before it is related to the trap mechanism). The lines afterwards tell us how we arrived in afs_GetDCache – for this particular problem, we’re not concerned about the callers, so we can ignore these.

If we look at  the raw panic log, we see:

CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS:  0x00000004, DS:  0x0000000c

So, at the time of the panic, EDX was probably NULL. This implies that the bug is a NULL pointer dereference of some kind, and that we’re looking up something that’s 0x64 bytes into the structure that’s pointed at by EDX.

If we had a kernel with debugging symbols, then we could look directly at the C source which corresponds to afs_GetDCache+7832. Unfortunately, we didn’t have one to hand, so Derrick improvised. By disassembling the current kernel module, we can see exactly what code occurs at that location…

gdb --arch=i386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>:        call   0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>:        cmp    0x0,%eax
0x0001150e <afs_GetDCache+7846>:        je     0x1152c <afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>:        movl   $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>:        movl   $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>:        movl   $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>:        call   0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>:        movl   $0x0,0x0
0x00011536 <afs_GetDCache+7886>:        mov    0x0,%eax
0x0001153b <afs_GetDCache+7891>:        mov    %eax,(%esp)
0x0001153e <afs_GetDCache+7894>:        call   0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>:        mov    %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>:        mov    -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>:        mov    %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>:        call   0x4df78 <rx_EndCall>
[ ... ]

(The call 0x0 is because we’re debugging a module that isn’t loaded. You can get a dump that fills in the blanks for these by doing the following. When kextutil asks for the load address, give it the one that the panic log said the OpenAFS module was loaded at, in this case 0x4606c000:

mkdir /tmp/symbols
kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
gdb –arch=i386 /mach_kernel
(gdb) add-kext /tmp/symbols/afs.kext
(gdb) disassemble afs_GetDCache

)
We’re relatively lucky here, as near the panic location is a call to rx_EndCall(), the 2nd of 3 that occur in that GetDCache. Lining this up with the corresponding source code, the last call to rx_EndCall occurs at line 2219 of afs_dcache.c, which looks something like:

 if (length > size) {
 /* The fileserver told us it is going to send more data
  * than we requested. It shouldn't do that, and
  * accepting that much data can make us take up more
  * cache space than we're supposed to, so error. */
 code = rx_Error(tcall);
 RX_AFS_GUNLOCK();
 code1 = rx_EndCall(tcall, code);
 RX_AFS_GLOCK();
 tcall = (struct rx_call *)0;
 code = EIO;
 }

So, assuming that the compiler hasn’t reordered our code, we’ve got rx_Error(), followed by RX_GUNLOCK. It isn’t possible for the compiler to reorder over a lock, so it’s pretty likely that the ordering is as shown. Both of these are macros masquerading as functions. Looking at the simpler, rx_Error(), we have:

#define rx_Error(call)                  ((call)->error)

Now, if you recall, we panic’d because we were looking up something that has an offset of 0x64 from the base of the structure. We can verify that ‘error’ is stored 0x64 bytes into the rx_call structure by visually examining the structure definition or, if you’ve got a build tree from the same architecture, by running:

gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 = (afs_int32 *) 0x64

All of this points pretty conclusively at a NULL value for ‘tcall’. Examining the code, there is a situation where that can occur, if we get a bogus length from the network, and as a result double free the rx call. It seems highly likely that this is the bug.

December 22, 2009

Building OpenAFS on OpenSolaris

Filed under: Uncategorized — sxw @ 9:21 pm

I had a few spare moments, and an OpenAFS bug report about building on OpenSolaris seemed like it would be relatively easy to fix. So, I decided to bring up an OpenSolaris VM and go about fixing it. This is a record of all of the steps from bare metal to building a version of OpenAFS on Solaris.

The bug report was against the newest OpenSolaris development snapshot (snv_129), so that’s what I aimed to install. Much of the trauma below is because I was aiming for this, rather than the last release, 2009.06

  • Install a VM with 2009.06. This is straightforward – boot off the image, and select the install icon on the desktop
  • Upgrade the VM to snv_129. I used the instructions at http://pkg.opensolaris.org/dev/en/index.shtml
    which boil down to:

    pfexec pkg set-publisher -O http://pkg.opensolaris.org/dev opensolaris.org
    pfexec pkg image-update
  • Marvel as the machine now panics on boot.
  • Googling produced http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-December/001343.html (to be fair, it is the release announcement), and the link to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6820576. Adding -B disable-pcieb=true to the kernel line in grub is sufficient to get the machine up and running.
  • Now, the machine will actually boot, but attempts to log in remotely are doomed with errors like
    sshd[8714]: error: session_pty_req: session 0 alloc failed

    Fortunately, http://defect.opensolaris.org/bz/show_bug.cgi?id=12380 contains both the cause, and a workaround – once you’ve managed to log in as root on console, just

    chmod 666 /dev/ptmx
  • Now that we’ve finally got a machine that works, make sure and add -B disable-pcieb=true to the kernel lines of the grub configuration in /rpool/boot/grub/menu.lst otherwise it will just panic when you reboot it.
  • OpenSolaris ships without a development environment, so first, lets get one of those
    pfexec pkg install ss-dev
  • Current OpenAFS sources are in ‘git’, so get the git command suite
    pfexec pkg install SUNWgit

You should then be able to download (with git clone) and build the current OpenAFS ‘master’ branch. At least you will be able to once the bug fixes that were the reason for embarking on this waste of time get into the tree.

July 26, 2009

Making the OpenAFS client faster

Filed under: Uncategorized — sxw @ 1:41 pm
Tags: , , ,

During a course of a project here it became apparent that the Linux OpenAFS cache manager is slow when performing reads from the local disk. In this case, all of the data is already on the local disk, and the cache manager knows that the data is up to date. Naively, you would imagine that reading this data would take roughly the same time as if you were reading directly from the cache filesystem. However, that is not the case – in fact, reads appear to be more than twice as slow when fetched through the AFS cache manager, as compared to fetching the equivalent files from the local disk.

I’ve implemented modifications to the cache manager which attempt to reduce this speed deficit. These modifications can be broadly split into 5 sections

Remove crref() calls

Pretty much every call into the OpenAFS VFS does a crref(), to get a reference to the users current credentials, despite the fact that this information isn’t always required. crref is relatively expensive – it acquires a number of locks in order to perform its copies, and can be a cause of unnecessary serialisation. By only calling crref when required we can gain a small, but measurable, performance increase

Reduce the code path leading to a cache hit

In readpages, we perform a lot of setup operations before we discover whether the data we’re interested in is cached or not. By making the cached case the fast path, we can gain a performance increase for cache hits, without causing a noticable degradation for cache misses.

Remove abstraction layers, and use native interfaces

The code currently uses operating system independent abstraction layers to perform the reads from the disk cache. These don’t know anything about the way in which Linux organises its virtual memory, and do a significant amount of extra, unnecessary work. For example, we use the ‘read’ system call to read in the data, rather than the significantly faster readpages(). As we’re being invoked through the AFS module’s readpages() entry point, we can guarantee that we’re going to be fetching a page off disk. Read() also gets called from a user, rather than kernel, memory context, adding to the overhead.

Implement readahead

The Linux Cache Manager currently has no support for readpages(), instead requiring the VFS layer request each page independently with readpage(). This not only means that we can’t take advantage of cache locality, it also means that we have no support for readahead. Doing readahead is important, because it means that we can get data from the disk into the page cache whilst the application is performing other tasks. It can dramatically increase our throughput, particularly where we are serving data out to other clients, or copying it to other locations. Implementing readpages() on its own gives a small speed improvement, although blocking the client until the readpages completes kind of defeats the point, and leads to sluggish interactive performance!

Make readahead copies occur in the background

The next trick, then, is to make the readahead occur in the background. By having a background kernel thread which waits until each page of data is read from the cache, and then handles copying it over into corresponding AFS page, the business of reading and copying data from the cache can be hidden from the user.

Conclusions

This set of changes actually makes a signifciant improvement to cache read speed. In simple tests where the contents of the cache are copied to /dev/null, the new cache manager is around 55% faster than the old one. Tests using Apache to serve data from AFS show significant (but slightly less dramatic, due to other overheads) performance improvements.

Sadly, the Linux Memory Management architecture means that we’re never going to obtain speeds equivalent to using the native filesystem directly. The architecture requires that a page of memory must be associated with a single filesystem. So, we end up reading a page from the disk cache, copying that page into the AFS page, and returning the AFS page to the user. Ideally, we’d be able to dispense with this copy and read directly into the AFS page by switching the page mappings once the read was complete. However, this isn’t currently an option, and the performance benefits obtained through the current approach are still significant.

May 26, 2009

Converting OpenAFS to git

Filed under: Uncategorized — sxw @ 12:46 pm
Tags: , ,

For a while now, there have been plans afoot to convert OpenAFS’s CVS repository to git. A number of attempts have been made, which have all stalled due to the complexity of the underlying problem, and issues with the existing tools. Previously, it was felt that the main hurdle to a successful conversion was OpenAFS’s use of ‘deltas’ to provide a changeset style interface on top of CVS. A delta is a collection of related changes, grouped using a comment in the CVS revision log. However, unlike a real changeset, there is no requirement that a delta’s changes be contiguous. A file may be modified by delta A, then by delta B, and then modified by delta A again. This makes it impossible to properly represent all deltas as single changesets. In addition, abuse of deltas within OpenAFS has caused some to span branch or tag points, again making it impossible to represent those deltas as a changeset without destroying the repository history. For many months now, people have been trying to produce conversion tools that achieve as close to a 1 to 1 correspondence between deltas and changesets as is possible, just leaving the troublesome cases as multiple commits.

Frustrated with the lack of progress of this approach, I decided to do a quick and dirty conversion, with the view to getting something completed by the start of coding for this year’s Summer of Code (which I’ve missed) and the yearly Best Practices Conference (which I might just make). I decided to not concern myself with merging deltas at all, but instead use cvsps and the existing git-cvsimport tool to produce a tree where the branch heads and all tag points matched, and which retained enough information to reconstruct deltas without forcing them to be single changesets. In order to be able to perform simple manipulations, I decided to create a perl script which would post-process the cvsps output before feeding it to git. I also created a tool which would check out every branch and tag from cvs, and compare them to the corresponding item in git, and report on any errors. Pretty straightforwards, I thought …

Unfortunately, I rapidly discovered that cvsps had significant problems with the OpenAFS repository. Many tags in CVS were simply not in the cvsps output, other tags (both those marked as FUNKY and INVALID, and those not) were in the wrong place and branchpoints were being incorrectly determined. Rather than get into cvsps’s internals, I ended up extending my post processing script to deal with these errors. It now performs a number of tasks:

Reordering inverted patchsets Some of cvsps’s output gets the patchset ordering wrong, such that a patchset that does fileA:1.2->1.3 comes before fileA:1.1->1.2. The script scans through all of the patchsets for this problem and swaps any that it finds.

Tag point determination Using the output from CVS’s rls command, it is possible to get the revision numbers of every file in a given tag. With this information, the set of patchsets from cvsps can be walked in order to identify the first patchset to satisify every revision contained within the tag. Unfortunately, cvsps patchsets aren’t correctly ordered, so this process also works out how to reorder the patch sets such that no patchsets with file revisions higher than those in the tag occur before the tag point. This reordering is carefully performed in order to not break any tag or branch points which we have already placed! In addition, cvsps sometimes merges commits which occur over a tag point, so we also need to split patchsets which contain both files with revisions before the tag point, and files with revisions after it.

Branch point determination The cvsps output incorrectly places many of OpenAFS’s branchpoints. Fortunately, many of these were marked by tags at the time they were created, and a hardcoded list of these is used to place some branch points in the correct position. For branches that don’t have a corresponding tag, a brute force approach is used. By examining all of the patchsets on the branch, it’s possible to determine the starting revision number of every file that’s modified on that branch – combining this with the contents of the branch head tag from cvs rls gives the equivalent of a tag marking the branchpoint. This can then be processed by the tag point algorithm to work out the correct position in which to place the branch point. This gives the patchset that the branch occurs after, rather than cvsps’s “Ancestor branch” field, which gives the first patchset on the branch. Ancestor branch is fundamentally flawed, as it doesn’t allow changes to occur on HEAD after a branchpoint is created, and before the first patch on that branch. As part of this process, git-cvsimport was modified to understand a new ‘post-patchset’ branch syntax

Unknown branch determination cvsps fails to place some patchsets on the correct branch. By comparing the revision numbers of files in these patchsets with those elsewhere in the list, the correct branch for all of these can be determined (this needs to be done in order that we can work out tag points, as well as being necessary for repository consistency)

We also clean up the output to deal with problems of our own making

Delta naming Whilst there is a style of  delta name where the branch is given first, a number of deltas don’t conform to this style, and have the same name across multiple branches. All deltas are renamed such that they are something like STABLE14-delta-name-yyyy-mm-dd

Mistagged files In some places, tags have been incorrectly applied such that files on HEAD are tagged as part of a branch tag. The script contains manual overrides to fix these to tag a revision on the correct branch.

Finally, because having done all of the above I had a pretty good toolset for dealing with patchsets, I implemented support for merging deltas. This merges all bar 10o or so, out of 15,000 deltas into single patchsets. The remaining items are comprised of deltas which span tag or branch points (and which can never be merged) and deltas which contain conflicting changes to a single file (which it might be possible to merge, but which would require manual intervention). These deltas are tagged in a set of git references at refs/deltas/branches/<branch>/<delta>. We separate them from tags so that git-tags doesn’t have to deal with over 10,000 tags, and split them into branches to avoid directory size limitations.

The resulting git tree isn’t a perfect replica of the CVS repository. It has a number of issues which it’s going to be really difficult to fix, and which probably aren’t earth shattering

It contains additional files There are a number of places where a user has added additional directories to cvs. When another user has subsequently tagged their working directory for a release, they haven’t done a cvs update -d, and so these additional directories (and in a small number of cases, files) aren’t contained in the tag for that release. It’s impossible to create a patchset ordering which allows a git tag to not include these directories, so we end up with additional files in the git tag. I don’t think that this is a particular problem

It’s missing a tag There is a tag in the CVS repository (BP-openafs-rxkad-krb5-lha) which is so broken that it’s impossible to construct a patch set ordering that matches it. It is just omitted from the resulting git repository

One branch is bad The openafs-rxkad-krb5-lha branch was created by only branching certain files in the tree. This means that it’s impossible to create a git branch which mimics this one without creating a huge number of additional ‘pull-up’ patch sets. Whilst we include all of the changes that were made on this branch, the final branch state is very different from the one in CVS.

Some deltas are disjoint As discussed, some deltas cannot be merged into single patchsets. This is going to require a new wdelta style tool which understands how to merge these deltas.

Next Page »

Theme: Rubric.