I’ve talked a lot in the past about kernel debugging, usually comparing the Solaris tools against what’s available in Linux and other Operating Systems. I thought that I’d walk through a concrete example of kernel debugging, first with our debugger KMDB (which replaced the venerable kadb in build 61 of Solaris 10), and then with the available Linux kernel debuggers (KDB and KGDB). I’ll also examine port-mortem analysis tools for Linux.
For today’s lesson, I’ve picked a problem where a port-mortem or live kernel debugger is absolutely necessary. Those of you who have done multithreaded programming in the past have most likely had to deal with a deadlock here or there. In your garden variety deadlock, one thread grabs lock A and then lock B, while another thread grabs lock B and then lock A. If they both get their first lock but not their second, they’ll both be blocked waiting for locks which will never be released. I’ve decided to spice it up a bit by looking at reader-writer locks. With an rwlock, you can have any number of readers, but only one writer. This makes problems much more difficult to debug. With a mutex, the current owner is stored as part of the data structure. But with a rwlock, the only indication you have is the number of active readers.
In our fantasy land, we have the following situation: commands are getting stuck in the kernel. Forget kill -9; if you’re stuck waiting on a kernel synchronization primitive (wihtout accepting interruption by signals), you’re in trouble. The timeline for our threads looks something like this:
thread 1 |
thread 2 |
rw_enter(rwlock, RW_READER)
<stuck somewhere>
|
rw_enter(rwlock, RW_WRITER)
<blocked>
|
Loading KMDB
Something’s ill on our system. We know that our ‘ps’ command is hanging for some reason. So we enter KMDB:
# mdb -K
Loaded modules: [ ufs unix krtld nca lofs genunix ip usba specfs nfs random sctp ]
[0]>
KMDB is ready to be loaded on every system running Solaris 10 (build 61 or later). You can also boot -k which will load kmdb in the background. If you do this, you can send a break (via a tip line or L1/F1-A) to get into KMDB.
Find our troubled thread
In this example, we know that ‘ps’ is hanging. Finding the troubled stack is easy:
[0]> ::pgrep ps | ::walk thread | ::findstack
rw_enter_sleep+0x144()
as_pageunlock+0x48()
default_physio+0x340()
pread+0x244()
syscall_trap+0x88()
Get the address of the rwlock
This takes a little bit of work because we don’t have the necessary debugging information in the kernel to restore arguments from arbitrary positions within a function. In this example we’re at rw_enter_sleep+0x144. Using ::findstack -v or $C we can get the frame pointer for each frame. At this point, we have to examine the disassembly around each call site to find out where our first parameter came from. We've been kicking around some ideas to do this automatically, but I'll owe Matt another nickel if I mention it out loud... At the moment, you need to do something similar to the following (on SPARC - x86 is similar):
[0] > as_pageunlock+48::dis
as_pageunlock+0x20: mov 3, %i4
as_pageunlock+0x24: call -0x2318
as_pageunlock+0x28: restore
as_pageunlock+0x2c: add %l5, 0x30, %l0
as_pageunlock+0x30: sethi %hi(0x1069400), %i5
as_pageunlock+0x34: sethi %hi(0x1069400), %i1
as_pageunlock+0x38: ldx [%i5 + 0x1e8], %l6
as_pageunlock+0x3c: mov %l0, %o0
as_pageunlock+0x40: ldx [%i1 + 0x1e0], %i0
as_pageunlock+0x44: add %i2, %i3, %i3
as_pageunlock+0x48: call -0x143754
as_pageunlock+0x4c: mov 1, %o1
as_pageunlock+0x50: mov %l5, %o0
as_pageunlock+0x54: mov %i2, %o1
as_pageunlock+0x58: call -0x2cdc
as_pageunlock+0x5c: clr %o2
as_pageunlock+0x60: add %i3, %i0, %l7
as_pageunlock+0x64: and %i2, %l6, %l3
as_pageunlock+0x68: ldx [%o0 + 0x38], %i2
as_pageunlock+0x6c: and %l7, %l6, %l4
as_pageunlock+0x70: mov %l3, %o1
[0]> 000002a101344ee1+7ff::print struct frame
{
fr_local = [ 0x3000eda5358, 0, 0x1, 0xfd1ff, 0x30018c3dec8, 0xb, 0x8, 0x1 ]
fr_arg = [ 0x3df8515a500, 0xb, 0x2, 0x300068202d8, 0x3000eda5358, 0x1 ]
fr_savfp = 0x2a101344f91
fr_savpc = 0x1186134
fr_argd = [ 0, 0x810, 0x102, 0, 0x3001503a1a0, 0x2a101345930 ]
fr_argx = [ 0x300021541d8 ]
}
[0]> 0x3000eda5358::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
3000eda5358 READERS=1 B011 30018878f20 (R)
|| 30018c86000 (R)
WRITE_WANTED -------+| 30018c86300 (R)
HAS_WAITERS --------+ 30018c86600 (R)
30018c86900 (R)
30018c86f00 (R)
30018c87200 (R)
[0]>
Find who owns the reader lock
Here's another place where things get tough. Since we don't record readers as part of the rwlock_t (only the count), we can't just look at memory and see who owns this. We could just dump the stack of every thread and try to guess who's the owner, but on a large server this quickly becomes impractical. This is where you need ::kgrep, which will find references to a given pointer within the kernel. From there, we pipe it to ::whatis (or more recently, ::whattype), which can tell us in which stacks the value is referenced. This dramatically cuts down our list of choices. After some guess-and-check (but significantly less than looking at every thread in the system), we find the real culprit of the crime.
[0]> ::kgrep 30006820000 | ::whatis
2a100527730 is in thread 30013e04920's stack
2a100527770 is in thread 30013e04920's stack
2a100527780 is in thread 30013e04920's stack
2a100ecf730 is in thread 3000f962300's stack
2a100ecf770 is in thread 3000f962300's stack
2a100ecf780 is in thread 3000f962300's stack
2a101171730 is in thread 3001506db60's stack
2a101171770 is in thread 3001506db60's stack
[...]
[0]> 30013e05220::findstack
stack pointer for thread 30013e05220: 2a1011a09a1
[ 000002a1011a09a1 sema_p+0x140() ]
000002a1011a0a51 biowait+0x60()
000002a1011a0b01 ufs_getpage_miss+0x2f0()
000002a1011a0c01 ufs_getpage+0x6a8()
000002a1011a0d61 fop_getpage+0x48()
000002a1011a0e31 segvn_fault+0x834()
000002a1011a0fe1 as_fault+0x4a0()
000002a1011a10f1 pagefault+0xac()
000002a1011a11b1 trap+0xc14()
000002a1011a12f1 utl0+0x4c()
[0]>
Looking at this now, it makes me want a new MDB command, ::whatthread, which would find all threads with stacks containing a given pointer. With MDB's modular design and stable API, it's easy to connect ::kgrep and ::whatis into a self-contained command. But that's a RFE for another day.
Figure out what our culprit is doing
From here, you're pretty much own your own. In the case outlined about, we're stucking waiting for I/O while we have a read lock on the address space. The gory details are beyond a single blog post, but what's been covered is more than enough for a crash-course in kernel debugging. The best part about this bug is that you could never figure this out with custom instrumentation or 'classic' debugging techniques. This is a race condition of the worst kind - all the printf()s in the world wouldn't have helped you here. And if you need to reboot in order to load your custom kernel, you'll probably never see the problem again.
Post mortem debugging and Linux
On Solaris, the post-mortem case is exactly the same as the live KMDB case. Simply reboot -d (or initiate a sync from KMDB or the OK prompt if it's really hung) and then run mdb on the resulting crash dump. You can also examine the kernel while the system is still running with mdb -k, but some commands (such as ::kgrep) will fail because memory is actively changing.
In upcoming posts, I'll wade through documentation on linux kernel debuggers, trying to figure out how they would solve this same problem. Note that this could be "solved" by wading through the stack traces of every thread on the system, and picking out the troublemaker thread by eye. This requires intimate knowledge of the source code (all of it), is prone to error, and extremely tedious. It's painful on a simple desktop, and quickly becomes impossible for large servers with thousands of threads. Also, keep in mind this is just a demonstration of kernel debugging - the worst bugs are much more complicated.