Simon's Musings

January 17, 2010

Debugging a Mac OS X Kernel Panic

Filed under: Uncategorized — sxw @ 12:40 pm
Tags: ,

Adam Megacz posted to the openafs-info mailing list a kernel panic that he encountered with the 1.4.12 release candidate. Derrick and I debugged this over Jabber (with Derrick doing the kernel disassembly). I thought it might be useful to share the methods we used, and to record the command invocations for future googling.

The Problem

A Mac OS X kernel panic. Adam first posted the panic itself, then followed up with the results of running OpenAFS’s decode-panic tool. decode-panic basically drives gdb to turn an panic log with addresses into a readable backtrace with symbols – it needs to be run with the same kernel, and architecture, that created the panic in the first place.

The Analysis

The output from decode-panic looks something like:

Panic Date:      Interval Since Last Panic Report:  472905 sec
Kernel Version:  Darwin Kernel Version 10.2.0: Tue Nov  3 10:37:10 PST
2009; root:xnu-1486.2.11~1/RELEASE_I386
OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
=============
add symbol table from file "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 
0x21b2bd <panic+445>:                   mov    0x8011d0,%eax
0x2a7ac2 <kernel_trap+1530>:            jmp    0x2a7ade <kernel_trap+1558>
0x29d968 <lo_alltraps+712>:             mov    %edi,%esp
0x4607e500 <afs_GetDCache+7832>:   mov    0x64(%edx),%ebx
0x46078a18 <BPrefetch+144>:             mov    %eax,-0x3c(%ebp)
0x4607928d <afs_BackgroundDaemon+573>:  jmp    0x460792cb <afs_BackgroundDaemon+635>
0x460e76a7 <afsd_thread+719>:           call   0x2a013e <current_thread>
0x29d68c <call_continuation+28>:        add    $0x10,%esp

The line in bold is the point where the panic actually occurred (all of the stuff before it is related to the trap mechanism). The lines afterwards tell us how we arrived in afs_GetDCache – for this particular problem, we’re not concerned about the callers, so we can ignore these.

If we look at  the raw panic log, we see:

CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS:  0x00000004, DS:  0x0000000c

So, at the time of the panic, EDX was probably NULL. This implies that the bug is a NULL pointer dereference of some kind, and that we’re looking up something that’s 0x64 bytes into the structure that’s pointed at by EDX.

If we had a kernel with debugging symbols, then we could look directly at the C source which corresponds to afs_GetDCache+7832. Unfortunately, we didn’t have one to hand, so Derrick improvised. By disassembling the current kernel module, we can see exactly what code occurs at that location…

gdb --arch=i386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>:        call   0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>:        cmp    0x0,%eax
0x0001150e <afs_GetDCache+7846>:        je     0x1152c <afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>:        movl   $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>:        movl   $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>:        movl   $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>:        call   0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>:        movl   $0x0,0x0
0x00011536 <afs_GetDCache+7886>:        mov    0x0,%eax
0x0001153b <afs_GetDCache+7891>:        mov    %eax,(%esp)
0x0001153e <afs_GetDCache+7894>:        call   0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>:        mov    %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>:        mov    -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>:        mov    %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>:        call   0x4df78 <rx_EndCall>
[ ... ]

(The call 0x0 is because we’re debugging a module that isn’t loaded. You can get a dump that fills in the blanks for these by doing the following. When kextutil asks for the load address, give it the one that the panic log said the OpenAFS module was loaded at, in this case 0x4606c000:

mkdir /tmp/symbols
kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
gdb –arch=i386 /mach_kernel
(gdb) add-kext /tmp/symbols/afs.kext
(gdb) disassemble afs_GetDCache

)
We’re relatively lucky here, as near the panic location is a call to rx_EndCall(), the 2nd of 3 that occur in that GetDCache. Lining this up with the corresponding source code, the last call to rx_EndCall occurs at line 2219 of afs_dcache.c, which looks something like:

 if (length > size) {
 /* The fileserver told us it is going to send more data
  * than we requested. It shouldn't do that, and
  * accepting that much data can make us take up more
  * cache space than we're supposed to, so error. */
 code = rx_Error(tcall);
 RX_AFS_GUNLOCK();
 code1 = rx_EndCall(tcall, code);
 RX_AFS_GLOCK();
 tcall = (struct rx_call *)0;
 code = EIO;
 }

So, assuming that the compiler hasn’t reordered our code, we’ve got rx_Error(), followed by RX_GUNLOCK. It isn’t possible for the compiler to reorder over a lock, so it’s pretty likely that the ordering is as shown. Both of these are macros masquerading as functions. Looking at the simpler, rx_Error(), we have:

#define rx_Error(call)                  ((call)->error)

Now, if you recall, we panic’d because we were looking up something that has an offset of 0x64 from the base of the structure. We can verify that ‘error’ is stored 0x64 bytes into the rx_call structure by visually examining the structure definition or, if you’ve got a build tree from the same architecture, by running:

gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 = (afs_int32 *) 0x64

All of this points pretty conclusively at a NULL value for ‘tcall’. Examining the code, there is a situation where that can occur, if we get a bogus length from the network, and as a result double free the rx call. It seems highly likely that this is the bug.

Theme: Rubric.