Simon's Musings

January 24, 2010

GSSAPI Key Exchange for OpenSSH 5.3p1

Filed under: Uncategorized — sxw @ 12:00 pm

Once again, far far later than I would have liked, I’ve produced a set of updated patches for OpenSSH 5.3p1. Compared to previous releases this one is pretty simple – a resolved merge conflict, and a few one line patches. However, I really need to get quicker at doing these – it’s 4 months since 5.3p1 appeared, and 5.4 will be just around the corner. The announcement email was:

From the better-late-than-never-department, I’m pleased to announce the availability of my GSSAPI Key Exchange patches for OpenSSH 5.3p1. This is a pretty minor maintenance release – it contains a couple of fixes to take into account changes to the underlying OpenSSH code, and a compilation fix for when GSSAPI isn’t required. Thanks to Colin Wilson and Jim Basney for their bug reports.

I’d like to thank the distributors who’ve been patiently waiting for me to get this done – sorry once again for the delay.

Whilst OpenSSH contains support for GSSAPI user authentication, this still relies upon SSH host keys to authenticate the server to the  user. For sites with a deployed Kerberos infrastructure this adds an additional, unnecessary, key management burden. GSSAPI key exchange allows the use of security mechanisms such as Kerberos to authenticate the server to the user, removing the need for trusted ssh host keys, and allowing the use of a single security architecture.

This patch adds support for the RFC4462 GSSAPI key exchange mechanisms to OpenSSH, along with adding some additional, generic, GSSAPI features. It implements:

*) gss-group1-sha1-*, gss-group14-sha1-* and gss-gex-sha1-* key exchange mechanisms. (#1242)
*) Support for the null host key type (#1242)
*) Support for CCAPI credentials caches on Mac OS X (#1245)
*) Support for better error handling when an authentication exchange fails due to server misconfiguration (#1244)
*) Support for GSSAPI connections to hosts behind a round-robin load balancer (#1008)
*) Support for GSSAPI connections to multi-homed hosts, where each interface has a unique name (#928)
*) Support for cascading credentials renewal
*) Support for the GSSAPIClientIdentity option, to allow the user to select which client identity to use when authenticating to a server.

( bug numbers are in brackets)

As usual, the code is available from

Two patches are available, one containing cascading credentials support, and one without. In addition, the quilt patch series that makes up this release is also provided, for those who wish to pick and choose!



January 17, 2010

Debugging a Mac OS X Kernel Panic

Filed under: Uncategorized — sxw @ 12:40 pm
Tags: ,

Adam Megacz posted to the openafs-info mailing list a kernel panic that he encountered with the 1.4.12 release candidate. Derrick and I debugged this over Jabber (with Derrick doing the kernel disassembly). I thought it might be useful to share the methods we used, and to record the command invocations for future googling.

The Problem

A Mac OS X kernel panic. Adam first posted the panic itself, then followed up with the results of running OpenAFS’s decode-panic tool. decode-panic basically drives gdb to turn an panic log with addresses into a readable backtrace with symbols – it needs to be run with the same kernel, and architecture, that created the panic in the first place.

The Analysis

The output from decode-panic looks something like:

Panic Date:      Interval Since Last Panic Report:  472905 sec
Kernel Version:  Darwin Kernel Version 10.2.0: Tue Nov  3 10:37:10 PST
2009; root:xnu-1486.2.11~1/RELEASE_I386
OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
add symbol table from file "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 
0x21b2bd <panic+445>:                   mov    0x8011d0,%eax
0x2a7ac2 <kernel_trap+1530>:            jmp    0x2a7ade <kernel_trap+1558>
0x29d968 <lo_alltraps+712>:             mov    %edi,%esp
0x4607e500 <afs_GetDCache+7832>:   mov    0x64(%edx),%ebx
0x46078a18 <BPrefetch+144>:             mov    %eax,-0x3c(%ebp)
0x4607928d <afs_BackgroundDaemon+573>:  jmp    0x460792cb <afs_BackgroundDaemon+635>
0x460e76a7 <afsd_thread+719>:           call   0x2a013e <current_thread>
0x29d68c <call_continuation+28>:        add    $0x10,%esp

The line in bold is the point where the panic actually occurred (all of the stuff before it is related to the trap mechanism). The lines afterwards tell us how we arrived in afs_GetDCache – for this particular problem, we’re not concerned about the callers, so we can ignore these.

If we look at  the raw panic log, we see:

CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS:  0x00000004, DS:  0x0000000c

So, at the time of the panic, EDX was probably NULL. This implies that the bug is a NULL pointer dereference of some kind, and that we’re looking up something that’s 0x64 bytes into the structure that’s pointed at by EDX.

If we had a kernel with debugging symbols, then we could look directly at the C source which corresponds to afs_GetDCache+7832. Unfortunately, we didn’t have one to hand, so Derrick improvised. By disassembling the current kernel module, we can see exactly what code occurs at that location…

gdb --arch=i386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>:        call   0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>:        cmp    0x0,%eax
0x0001150e <afs_GetDCache+7846>:        je     0x1152c <afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>:        movl   $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>:        movl   $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>:        movl   $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>:        call   0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>:        movl   $0x0,0x0
0x00011536 <afs_GetDCache+7886>:        mov    0x0,%eax
0x0001153b <afs_GetDCache+7891>:        mov    %eax,(%esp)
0x0001153e <afs_GetDCache+7894>:        call   0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>:        mov    %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>:        mov    -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>:        mov    %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>:        call   0x4df78 <rx_EndCall>
[ ... ]

(The call 0x0 is because we’re debugging a module that isn’t loaded. You can get a dump that fills in the blanks for these by doing the following. When kextutil asks for the load address, give it the one that the panic log said the OpenAFS module was loaded at, in this case 0x4606c000:

mkdir /tmp/symbols
kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
gdb –arch=i386 /mach_kernel
(gdb) add-kext /tmp/symbols/afs.kext
(gdb) disassemble afs_GetDCache

We’re relatively lucky here, as near the panic location is a call to rx_EndCall(), the 2nd of 3 that occur in that GetDCache. Lining this up with the corresponding source code, the last call to rx_EndCall occurs at line 2219 of afs_dcache.c, which looks something like:

 if (length > size) {
 /* The fileserver told us it is going to send more data
  * than we requested. It shouldn't do that, and
  * accepting that much data can make us take up more
  * cache space than we're supposed to, so error. */
 code = rx_Error(tcall);
 code1 = rx_EndCall(tcall, code);
 tcall = (struct rx_call *)0;
 code = EIO;

So, assuming that the compiler hasn’t reordered our code, we’ve got rx_Error(), followed by RX_GUNLOCK. It isn’t possible for the compiler to reorder over a lock, so it’s pretty likely that the ordering is as shown. Both of these are macros masquerading as functions. Looking at the simpler, rx_Error(), we have:

#define rx_Error(call)                  ((call)->error)

Now, if you recall, we panic’d because we were looking up something that has an offset of 0x64 from the base of the structure. We can verify that ‘error’ is stored 0x64 bytes into the rx_call structure by visually examining the structure definition or, if you’ve got a build tree from the same architecture, by running:

gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 = (afs_int32 *) 0x64

All of this points pretty conclusively at a NULL value for ‘tcall’. Examining the code, there is a situation where that can occur, if we get a bogus length from the network, and as a result double free the rx call. It seems highly likely that this is the bug.

Theme: Rubric.