Eric Schrock's Blog

Month: September 2004

What am I up to these days?

Recently, I putback my last round of major bugfixes in my “traditional” area of expertise – procfs, libproc, mdb, etc. And since I’m not attached to any of the major S10 projects, I can actually pick and choose what to work on next. I have tons of projects I’d love to start, but I can’t justify new projects so late in the release cycle. There are simply too many other things that need to be fixed first. The downside of choice is that eventually, word gets out that you have copious amounts of free time with only two months of development left in the release. I’ve basically been put up for auction, except the bidding price is always the same – though the work itself becomes the reward.

I’ve been spending quality time with our bug database, as well as entertaining offers from potential suitors. I’ve been pinch hitting on the amd64 project, helping out with greenline (SMF), and most recently signed up to help design and implement a new subsystem for ZFS. I’m also getting to play with cool things like ZFS on amd64: we’re getting it up and running on some very cool (and super-secret) hardware that I unfortunately can’t talk about. I’m learning about parts of Solaris that I never thought I’d have a chance to work with. It’s all very exciting, but I will definitely be happy when S10 ships so I can get back to some personal projects that have been evolving in the depths of my mind. Stay tuned…

So, do I hear a hundred? How about two hundred?

So it’s been a while since my KMDB post, but I promised I would do some investigation into kernel debugging on the Linux side. Keep in mind that I have no Linux kernel experience. While I will try to be thorough in my research, there may be things I miss simply from lack of experience or a good test system. Feel free to comment on any errors or omissions.

We’ll try to solve the same problem that I approached with KMDB in the last post: a deadlock involving reader-writer locks. Linux has a choice of two debuggers, kdb and kgdb (though User Mode Linux presents interesting possibilities). In this post I’ll be taking a look at KDB.

Fire up KDB

Chances are you’re not running a Linux kernel with KDB installed. Some distros (like Debian) make it easier to download and apply the patch, but none seems to include it by default (admittedly, I didn’t do a very thorough search). This means you’ll have to go download the patch, apply it, tweak some kernel config variables (CONFIG_KDB and CONFIG_FRAME_POINTER), recompile/reinstall your kernel, and reboot. Hopefully you’ve done all this beforehand, because as soon you reboot you’ve lost your bug (possibly forever – race conditions are fickle creatures). Assuming you were running a kdb-enabled kernel when you hit this bug, you then run:

# echo "1" > /proc/sys/kernel/kdb

And then press the ‘pause’ key on your keyboard. Alternatively, you can hook up a serial console, but I’ll opt for the easy way out.

Find our troubled thread

First, we need to find the pid our offending process. The only way to do this is to use the 'ps' command to display all processes on the system, and then pick out (visually) which pid belongs to our ‘ps’ process. Once we have this information, we can then use 'btp <pid>' to get a stack trace.

Get the address of the rwlock

This step is very similar to the one we took when using kmdb. The stack trace produced by 'btp' includes frame pointers like kmdb’s $C. Looking back over my kmdb post, it wasn’t immediately clear where I got that magic starting number – it came from the frame pointer in the (verbose) stack trace. In any case, we use 'id <addr>' to disassemble the code around our call site. We then use 'mdr <addr+offset>' to examine the memory where the original value is saved. This gets much more interesting (painful) on amd64, where arguments are passed in registers and may not get pushed on the stack until several frames later.

Without a paddle?

At this point, the next step should be “Find who owns the reader lock.” But I can’t find any commands in the kdb manpages that would help us determine this. Without kmdb’s ::kgrep, we’re stuck searching for a needle in a haystack. Somewhere on this system, one or more threads have referenced this rwlock in the past. Our only course of action is to try 'bta', which will give us a stack trace of every single process on the system. With a deep understanding of the code, a great deal of persistence, and a little bit of luck, we may be able to pick out the offending stack just by sight. This quickly becomes impractical on large systems, not to mention difficult to verify and prone to error.

With KDB we can do some basic debugging tasks, but it still relies on giant “leaps of faith” to correlate two pieces of seemingly disjoint data (two thread involved in a deadlock, for example). As a point of comparison, KDB provides 40 different commands, while KMDB provides 771 (356 dcmds and 415 walkers on my current desktop). Next week I’ll look at kgdb and see if it fills in any of these gaps.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012