Disclaimer: This is a technical post and not for the faint of heart. You have been warned.
I thought I’d share an interesting debugging session I had a while ago. Personally, I like debugging war stories. They are stories about hard problems solved by talented engineers. Any Solaris kernel engineer will tell you that we love hard problems – a necessary desire for anyone doing operating systems work. Every software discipline has its share of hard problems, but none approach the sheer volume, complexity, and variety of problems we encounter while working on an OS like Solaris.
This was one of those “head scratching” bugs that really had me baffled for a while. While by no means the most difficult bug I have seen (a demographic populated by memory corruption, subtle transient failures, and esoteric hardware interactions), it was certainly not intuitive. What made this bug really difficult was that a simple programming mistake exposed a longstanding and very subtle race condition with timing on multiprocessor x86 machines.
Of course, I didn’t know this at the time. All I knew is that a few of our amd64 build machines were randomly hanging. The system would be up and running fine for a while, and then *poof* no more keystrokes or ping. With KMDB, we just sent a break to the console and took a crash dump. As you’ll see, this problem would have been virtually impossible to solve without a crash dump, due to its unpredictable nature. I loaded mdb on one of these crash dumps to see what was going on:
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fec1f1a0 1f 16 0 0 yes no t-300013 9979c340 make.bin
1 80fdc080 1f 46 0 164 yes no t-320107 813adde0 sched
2 fec22ca8 1b 19 0 99 no no t-310905 80aa3de0 sched
3 80fce440 1f 20 0 60 yes no t-316826 812f79a0 fsflush
We have threads stuck spinning on all CPUs (as evidenced by the long time since last switch). Each thread is stuck waiting for a dispatch lock, except for one:
$gt; 914d4de0::findstack -v
914d4a84 0x2182(914d4de0, 7)
914d4ac0 turnstile_block+0x1b9(a4cc3170, 0, 92f32b40, fec02738, 0, 0)
914d4b0c mutex_vector_enter+0x328()
914d4b2c ipcl_classify_v4+0x30c(a622c440, 6, 14, 0)
914d4b68 ip_tcp_input+0x756(a622c440, 922c80f0, 9178f754, 0, 918cc328,a622c440)
914d4bb8 ip_rput+0x623(918ff350, a622c440)
914d4bf0 putnext+0x2a0(918ff350, a622c440)
914d4d3c gld_recv_tagged+0x108()
914d4d50 gld_recv+0x10(9189d000, a622c440)
914d4d64 bge_passup_chain+0x40(9189d000, a622c440)
914d4d80 bge_receive+0x60(9179c000, 917cf800)
914d4da8 bge_gld_intr+0x10d(9189d000)
914d4db8 gld_intr+0x24(9189d000)
9e90dd64 cas64+0x1a(b143a8a0, 3)
9e90de78 trap+0x101b(9e90de8c, 8ff904bc, 1)
9e90de8c kcopy+0x4a(8ff904bc, 9e90df14, 3)
9e90df00 copyin_nowatch+0x27(8ff904bc, 9e90df14, 3)
9e90df18 instr_is_prefetch+0x15(8ff904bc)
9e90df98 trap+0x6b2(9e90dfac, 8ff904bc, 1)
9e90dfac 0x8ff904bc()
Strangely, we’re stuck in new_mstate():
> turnstile_block+0x1b9::dis
turnstile_block+0x1a2: adcl $0x0,%ecx
turnstile_block+0x1a5: movl %eax,0x3bc(%edx)
turnstile_block+0x1ab: movl %ecx,0x3c0(%edx)
turnstile_block+0x1b1: pushl $0x7
turnstile_block+0x1b3: pushl %esi
turnstile_block+0x1b4: call -0x93dae
turnstile_block+0x1b9: addl $0x8,%esp
turnstile_block+0x1bc: cmpl $0x6,(%ebx)
turnstile_block+0x1bf: jne +0x1a
turnstile_block+0x1c1: movl -0x20(%ebp),%eax
turnstile_block+0x1c4: movb $0x1,0x27b(%eax)
turnstile_block+0x1cb: movb $0x0,0x27a(%eax)
turnstile_block+0x1d2: movl $0x1,0x40(%esi)
One of the small but important features in Solaris 10 is that microstate accounting is turned on by default. This means we record timestamps when changing system state, rather than relying on clock-based process accounting. Unfortunately, I can’t post the source code to new_mstate() (another benefit of OpenSolaris – no secrets). But I can say that there is a do/while loop where we spin waiting for the present to catch up to the past. The ia32 architecture was not designed with large MP systems in mind. In particular, the chips have a high resolution timestamp counter (tsc) with the particularly annoying property that different CPUs do not have to be in sync. This means that a thread can read tsc from one CPU, get bounced to another, and read a tsc value that appears to be in the past. This is very rare in practice, and we really should never go through this new_mstate() loop more than once. On these machines, we were looping for a very long time.
We read this tsc value from gethrtime_unscaled(). Slogging through the assembly for new_mstate(), we see that the result is stored as a 64-bit value in -0x4(%ebp) and -0x1c(%ebp):
new_mstate+0xe3: call -0x742d9
new_mstate+0xe8: movl %eax,-0x4(%ebp)
new_mstate+0xeb: movl %edx,-0x1c(%ebp)
Looking at our current stack, we are able to reconstruct the value:
> 0x914d4a84-4/X
0x914d4a80: d53367ab
> 0x914d4a84-1c/X
0x914d4a68: 2182
This gives us the value 0x2182d53367ab. But our microstate accounting data tells us something entirely different:
> 914d4de0::print kthread_t t_lwp->lwp_mstate
{
t_lwp->lwp_mstate.ms_prev = 0x3
t_lwp->lwp_mstate.ms_start = 0x1b3c14bea4be
t_lwp->lwp_mstate.ms_term = 0
t_lwp->lwp_mstate.ms_state_start = 0x3678297e1b0a
t_lwp->lwp_mstate.ms_acct = [ 0, 0x1b3c14bf554d, 0, 0, 0, 0, 0, 0, 0x1052, 0x10ad ]
}
Now we see the problem. We’re trying to catch up to 0x3678297e1b0a clock ticks, but we’re only at 0x2182d53367ab. This means we’re going to stay in this loop for the next 23,043,913,724,767 clock ticks! I wouldn’t count on this routine returning any time soon, even on a 2.2GHz Opteron. Using a complicated MDB pipeline, we can print the starting microstate time for every thread in the system and sort the output to find the highest values:
> ::walk thread | ::print kthread_t t_lwp | ::grep ".!=0" | ::print klwp_t \
lwp_mstate.ms_state_start ! awk -Fx '{print $2}' | perl -e \
'while () {printf "%s\n", hex $_}' | sort -n | tail
30031410702452
30031411732325
30031412153016
30031412976466
30031485108972
30031485108972
36845795578503
59889720105738
59889720105738
59889720105738
The three highest values all belong to the same thread (which share a lwp), and are clearly way out of line. This is where the head scratching begins. We have a thread which, at one point in time, decided to put approximately double the expected value into ms_state_start, and since then has gone back to normal, waiting for a past which will (effectively) never arrive.
So I pour through the source code for gethrtime_unscaled(), which is really tsc_gethrtimeunscaled(). After searching for half an hour, with several red herrings, I finally discover that, through some botched indirection, we were actually calling tsc_gethrtime() from gethrtime_unscaled() by mistake. One version returns the number of clock ticks (unscaled), while the other returns the number of nanoseconds (scaled, and more expensive to calculate). Immediately this reveals itself as a source of many other weird timing bugs we were seeing with the amd64 kernel. Undoing this mistake brought much of our system back to normal, but I was worried that there was another bug lurking here. Even if we were getting nanoseconds instead of clock ticks, we still shouldn’t end up spinning in the kernel – something else was fishy.
The problem was very obviously a nasty race condition. Further mdb analysis, DTrace analysis, or custom instrumentation would be futile, and wouldn’t provide any more information than I already had. So I retreated to the source code, where things finally started to come together.
The implementation of tsc_gethrtime() is actually quite complicated. When CPUs are offlined/onlined, we have to do a little dance to calculate the amount of skew between the other CPUs. On top of this, we have to keep the CPUs from drifting too far from one another. We have a cyclic called once a second to record the ‘master’ tsc value for a single CPU, which is used to create a running baseline. Subsequent calls to tsc_gethrtime() use the delta from this baseline to create a unified view of time. Turns out there has been a subtle bug in this code for a long time, and only by calling gethrtime() on every microstate transition were we able to expose it.
In order to factor in power management, we have a check in tsc_gethrtime() to see if the present is less than the past, in which case we ignore the delta between the last master tsc. This doesn’t seem quite right. Because the tsc registers can drift between CPUs, we have the remote possibility that we read tsc on the master CPU 0 at nearly the same instant we read another tsc value on CPU 1. If the CPUs have drifted just the right amount, our new tsc value will appear to be in the past, even though it really isn’t. The end result is that we nearly double the returned value, which is the exact behavior we were seeing.
This is one of those “one in a million” race conditions. You have to be on a MP x86 system where the high resolution timers have drifted ever so slightly: too much and a different algorithm kicks in. Then, you have to call gethrtime() at almost exactly the same time as tsc_tick() is called from the cyclic subsystem (once a second). If you hit the window of a few clock cycles just right, you’ll get an instantaneous bogus value, but the next call to gethrtime() will return to normal. This isn’t fatal in most situations. The only reason we found it at all is because we were accidentally calling gethrtime() thousands of times per second, resulting in a fatal microstate accounting failure.
Hopefully this has given you some insight into the debugging processes that we kernel developers (and our support engineers) go through every day, as well as demonstrating the usefulness of mdb and post mortem debugging. After six long hours of on and off debugging, I was finally able to nail this bug. With a one line change, I was able to fix a glaring (but difficult to identify) bug in the amd64 kernel, while simultaneously avoid a nasty race condition that we have been hiding from for a long time.