Eric Schrock's Blog

Month: June 2004

In the Solaris group, we take a great deal of pride in our debugging tools. Every tool we design (truss, DTrace, MDB, KMDB, ptools, libumem) helps reduce time to root cause from days to hours or minutes, in addition to solving previously unsolvable problems. Over the past week or two I’ve some conversations about the state of Linux debugging, especially kernel debugging. Since I haven’t had much Linux experience, I thought I’d do a little research on our competitor. All I can say is that the Linux community doesn’t seem to take the same pride in debugging tools as we do. In particular, I was shocked by the lack of post-mortem debugging tools available for the linux kernel. While this may not be immediately useful for you developers out there, I find it interesting, and later I will dig into post-mortem debugging for user processes.

Post mortem debugging begins with a ‘dump’. For processes, this is a usually called a ‘core’ (most UNIX users have probably seen something like ‘segmentation fault – core dumped’ before). For kernels, it is called a ‘crash dump’. When something bad happens the system will ‘panic’, and desperately try to stop the world and write out the current machine state while relying on as little kernel state as possible. The world of a panicking kernel is a very scary place; I won’t go into how exactly this works. Suffice to say that after you reboot the system, you will end up with a file such as /var/crash/machine/{vmcore,unix}.11

These crashdumps are essential to us kernel engineers, and the failure of the linux community to recognize this2 is shocking. The reason that these dumps are so important is that they are often the only evidence we will ever have of the bug. In particular, we must be able to root-cause a problem from a single crash dump. If a production system panics for one of our Fortune 500 customers, we can’t exactly go up to them and say “Hmmm, that’s too bad. Can you run this instrumented kernel with these flags and make it happen again?” Especially since we’ve already cost our customer a lot of money3 by failing in the first place. We can try to reproduce it in development, but as Bryan points out, this can be an error-prone and often futile process. With post-mortem debugging, the customer simply sends us a crash dump and we do all our analysis on this file while their system continues to (hopefully) run happily in production.

This seemingly simple idea forms the foundation of Solaris debugging (really part of a trifecta now that DTrace and KMDB have arrived). On the Linux side, there is nothing built into the base kernel. When a Linux kernel ‘oops’ happens (their name for a panic), all you get to work with is a simple message, the current machine state, and a stack backtrace. As you will see, only the most trivial problems can be solved with this limited set of information. There is a patch to add this functionality, known as Linux Kernel Crash Dumps (LKCD). Surprising as it may seem, this patch has yet to make it into the base kernel. The politics around this are simply astounding. A good overview of why is covered here, with Linus’s initial rejection here. The reasons seem largely political, but there is a definite statement that this doesn’t belong in the base kernel – it should be a vendor-added “feature”. The last patch available seems to be for kernel version 2.5.44, so the project may be falling into disrepair. Without crash dumps, your only recourse is to try and reproduce on an instrumented kernel. Even if you can reproduce it, this usually means inserting printf() statements: the lack of a coherent kernel debugger such as KMDB is a subject for a future post. My favorite quote is from this email, which states:

If it becomes apparent through empirical data that crash dumps are a useful tool, I’m sure that Linus will become far more amenable. Until then, lets let him handle all of his other work which needs to get done.

You need empirical evidence to prove that crash dumps are useful? Try looking at any of the tens of thousands of kernel bugs in our database. At the bottom of this post you’ll find a random bug that I analyzed not that long ago. Don’t worry if you don’t have a clue what it means; it may not be comprehensible to anyone outside of the Solaris kernel group, but it’s a nice show of MDB’s power on a real-world bug.

Note that the line that glazes over “a little ::kgrep | ::whattype magic” is actually really interesting. ::kgrep will search for references to an address anywhere in the kernel. See Bryan Cantrill’s AADEBUG paper for information on the gory details of ::typegraph and ::whattype. With their powers combined, I was able to find the lock address in the stack of the thread shown.

I’ve read a little bit of the LKCD documentation, and I’m not convinced that the bug below could have been diagnosed with their set of commands, and certainly it would have been a much more painful task. If you’ve made it this far through my post, you are probably totally confused and/or or screaming “but how can I benefit from this?” Well, MDB works just as well on user processes (and cores) as it does on kernel crash dumps. A single tool can debug kernel crash dumps, live kernels, process cores, and live processes. I look forward to exploring MDB in much more detail in future posts.

Bug excerpt:

A quick cursory glance shows the the following:
> ::walk thread | ::findstack ! munges
88      ##################################  tp: 300072ebb40
72      ##################################  tp: 2a100017ce0
54      ##################################  tp: 3000ee13ca0
We have 54 threads waiting for an address space lock.  Not necessarily a
smoking gun, but a good place to start.  All of these threads are hung up on
the following rwlock:
> 0x3000eda5358::print rwlock_impl_t
rw_wwwh = 0xb
So someone has a reader lock and we're all stuck waiting.  Maybe we have a
recursive rw_enter(RW_READER) somewhere?  This rwlock belongs to the following
address space:
> 3000eda5328::print -a struct as a_lock
3000eda5358 a_lock._opaque = [ 0xb ]
> 3000eda5328::as2proc | ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   2673      1   2673   2673     30 0x00004000 0000030011f0e030
So who owns the current reader lock?  With a little ::kgrep | ::whattype magic
the owner appears to be:
> 30013e05220::findstack
stack pointer for thread 30013e05220: 2a1011a09a1
[ 000002a1011a09a1 sema_p+0x140() ]
000002a1011a0a51 biowait+0x60()
000002a1011a0b01 ufs_getpage_miss+0x2f0()
000002a1011a0c01 ufs_getpage+0x6a8()
000002a1011a0d61 fop_getpage+0x48()
000002a1011a0e31 segvn_fault+0x834()
000002a1011a0fe1 as_fault+0x4a0()
000002a1011a10f1 pagefault+0xac()
000002a1011a11b1 trap+0xc14()
000002a1011a12f1 utl0+0x4c()
In this thread, we've taken the address space lock in order to fault in a page,
but UFS is stuck in biowait().  Poking around, we see there are a bunch of
threads similarly stuck:
> ::walk thread | ::findstack ! grep biowait | wc -l
So now something's amiss in the I/O system, which has had a cascading effect on
the VM system, and all hell breaks loose.  The particular buf that we're
interested in is:
> 30006a5a048::print struct buf b_vp | ::vnode2path
So we're doing some I/O to one of the disks (that's a shocker). There are several
more suspicious stacks in the dump with traces like:
> 2a1009d9ce0::findstack
stack pointer for thread 2a1009d9ce0: 2a1009d85c1
000002a1009d8671 swtch+0x78()
000002a1009d8721 turnstile_block+0x5f8()
000002a1009d87d1 mutex_vector_enter+0x3dc()
000002a1009d8881 aio_done+0x1c4()
000002a1009d8951 volkiodone+0x364()
000002a1009d8ad1 volsiodone+0x544()
000002a1009d8bb1 vol_subdisksio_done+0x68()
000002a1009d8c71 volkcontext_process+0x29c()
000002a1009d8d41 voldiskiodone+0x368()
000002a1009d8e41 ssd_return_command+0x198()
000002a1009d8ef1 ssdintr+0x224()
000002a1009d8fa1 qlc_fast_fcp_post+0x188()
000002a1009d9051 qlc_isr+0x378()
000002a1009d9151 qlc_intr+0xd0()
000002a1009d9221 pci_intr_wrapper+0x9c()
000002a1009d92d1 intr_thread+0x130()
000002a100969171 disp_getwork+0xe8()
This thread is blocked on the following mutex:
> 000002a1009d8881+7ff::print struct frame fr_arg[4]
fr_arg[4] = 0x3001967c060
Which is the aio_mutex for this aio structure:
> 0x3001967c000::print -a aio_t aio_mutex
3001967c060 aio_mutex._opaque = [ 0x30018878c21 ]
The owner of this mutex is here:
> 0x30018878c20::findstack
stack pointer for thread 30018878c20: 2a10175cd51
[ 000002a10175cd51 turnstile_block+0x5f8() ]
000002a10175ce01 rw_enter_sleep+0x144()
000002a10175ceb1 as_pageunlock+0x48()
000002a10175cf71 aphysio_unlock+0x6c()
000002a10175d021 aio_cleanup_cleanupq+0x2c()
000002a10175d0d1 aio_cleanup+0xc4()
000002a10175d181 aio_cleanup_thread+0x9c()
000002a10175d231 kaioc+0xe0()
000002a10175d2f1 syscall_trap+0x88()
Now it seems like we've finally found the culprit.  aio_cleanup_cleanupq() takes
the aio_mutex lock like so:
[... source code removed ...]
We then call aphysio_unlock().  This particular aio has AIO_PAGELOCKDONE set, so
we go to unlock the page:
[... source code removed ...]
For as_pageunlock(), we need the address space lock.
The address space lock is held by thread 30013e05220, which is stuck waiting for
I/O to complete.  Now we're in an ugly deadlock, since the I/O can't complete
while the as lock is held.

1 Essentially, we write out all of kernel memory to a reserved place on disk. Usually this is swap space, but it can be configured through dumpadm(1M). When the system reboots, savecore(1M) extracts this information and puts it in a more permanent location.

2 There are many people in the Linux community, so obviously this blanket statement doesn’t apply to everyone. Whatever the local politics, I can’t see how this wasn’t implemented in Linux 0.1.0. Although it did take us until Solaris 7 to enable savecore(1M) by default.

3 We have customers that lose more than $100,000 per minute that their system is down. There is a reason why Solaris has to be bulletproof.

One of the most visible features that I have integrated into Solaris 10 is the ability to store pathnames with each open file1. This allows new avenues of observability that were previously inaccessible. First off, we simply have the files as symbolic links in /proc/<pid>/path:

$ ls -l /proc/`pgrep Firebird`/path | cut -b 55-
0 -> /devices/pseudo/mm@0:null
1 -> /home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0
10 -> /usr/local/MozillaFirebird/chrome/comm.jar
11 -> /usr/local/MozillaFirebird/chrome/en-US.jar
12 -> /usr/local/MozillaFirebird/chrome/embed-sample.jar
13 -> /usr/local/MozillaFirebird/chrome/pipnss.jar
14 -> /usr/local/MozillaFirebird/chrome/pippki.jar
15 -> /usr/local/MozillaFirebird/chrome/US.jar
16 -> /usr/local/MozillaFirebird/chrome/en-unix.jar
17 -> /usr/local/MozillaFirebird/chrome/classic.jar
18 -> /usr/local/MozillaFirebird/chrome/toolkit.jar
19 -> /usr/local/MozillaFirebird/chrome/browser.jar
2 -> /home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0
22 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_MAP_
23 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_001_
24 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_002_
25 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_003_
27 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/formhistory.dat
28 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/history.dat
29 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/cert8.db
30 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/key3.db
4 -> /var/run/name_service_door
5 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/XUL.mfasl
a.out -> /usr/local/MozillaFirebird/MozillaFirebird-bin
cwd -> /home/eschrock
root -> /
ufs.102.0.11082 -> /usr/lib/iconv/
ufs.102.0.11521 -> /usr/lib/iconv/
[ ... output elided ... ]

As usual, mozilla firebird has lots of interesting stuff open. You may notice that some of the file descriptors have no path information. This is likely because they refer to a socket or FIFO (there is a small chance they refer to a file that has since been moved). The pfiles(1) command has been modified to use this information, so you can now see the path with the rest of the goodies:

$ pfiles `pgrep Firebird`
286670: /usr/local/MozillaFirebird/MozillaFirebird-bin
Current rlimit: 512 file descriptors
0: S_IFCHR mode:0666 dev:200,0 ino:6815752 uid:0 gid:3 rdev:13,2
1: S_IFREG mode:0644 dev:210,1281 ino:346 uid:138660 gid:10 size:4164
2: S_IFREG mode:0644 dev:210,1281 ino:346 uid:138660 gid:10 size:4164
3: S_IFIFO mode:0666 dev:209,0 ino:9 uid:0 gid:1 size:0
4: S_IFDOOR mode:0444 dev:209,0 ino:52 uid:0 gid:0 size:0
O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[100253]
5: S_IFREG mode:0644 dev:210,1281 ino:744 uid:138660 gid:10 size:747398
6: S_IFIFO mode:0000 dev:203,0 ino:119094 uid:138660 gid:10 size:0
7: S_IFIFO mode:0000 dev:203,0 ino:119094 uid:138660 gid:10 size:0
[ ... output elided ... ]

This should be enough to get most savvy sysadmins drooling. But wait, there’s more!. This feature allowed the new DTrace io provider (integrated into build 60, aka Beta 5, aka SX 07/04) to get path name information for arbitrary files in the system. This allows you to do neat stuff like:

# cat iohog.d
#!/usr/sbin/dtrace -s
@[execname, args[2]->fi_pathname] = sum(args[0]->b_bcount);
# ./iohog.d
sched           /home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0      4096
xlp             /var/adm/utmpx                                         4096
fsflush         /export/iso/solaris_4.iso                              73728
sched           <none>                                                 82432
cp              <none>                                                 114688
fsflush         <none>                                                 177152
cp              /export/iso/solaris_4.iso                              238936064
cp              /export/iso/solaris_1.iso                              239910912

For years we’ve had the iostat(1M) utility. It’s great to know that someone is hammering away on sd0, but that’s not really the question you want answered. What you really want to know is who is hammering away on your disks. With the DTrace io provider, we’ve taken it one step further by giving you the means to answer why someone is hammering away on your disks. All of a sudden one of the most opaque problems is now completely transparent. So head on over and check it out (while the io provider is not available in Solaris Express quite yet, the documentation for it is available on the DTrace page).

1 For the curious: Solaris implements a Virtual File System (VFS) layer, which includes the notion of a vnode to represent an abitrary file. The filesystem-dependent part is stored in a format private to the filesystem implementation (think of it in terms of inheritence if it helps). To illustrate with crude ASCII art:

USERLAND        KERNEL VFS                         KERNEL FS
fd ----+----> file_t -----+----> vnode_t ------> inode_t /
|                  |                      prnode_t /
fd ----+                  |                      etc
fd ---------> file_t -----+

We store a (char *) pointer at the end of the vnode_t when we go to look up the file, and now we have path information for all the open files in the kernel (even those implicitly mapped into process address space, without an associated file_t). There are some subtleties with hard links and moving files around, but it works perfectly 99% of the time, which is all we can hope for in this case.

Before I start looking at some of problems we’re addressing in Solaris, I want to step back and examine one of the more fundamental problems I’ve been seeing in the industry. In order to develop more powerful software quickly, we insert layers of abstraction to distance ourselves from the actual implmentation (let someone else worry about it). There is nothing inherently wrong with this; no one is going to argue that you should write your business critical web service in assembly language instead of Java using J2EE. The problem comes from the disturbing trend that programmers are increasingly less knowledgeable about the layers upon which they build.

Most people can sit down and learn how to program Java or C, given enough time. The difference between an average programmer and a gifted programmer is the ability to truly understand the levels above and below where one works. For the majority of us1, this means understanding two things: our development platform and our customers. While understanding customer needs is a difficult task, a more tragic problem is the failure of programmers to understand their immediate development environment.

If you are a C programmer, it is crucial that you understand how virtual memory works, what exactly happens when you write to a file descriptor, and how threads are implemented. You should understand what a compiler does, how the dynamic linker works, and how the assembly code the compiler generates really works. If you are a Java programmer, you need to understand how garbage collection works, how java bytecodes are interpreted, and what JIT compiling really does. Just because you don’t need to know the OS works doesn’t mean you shouldn’t. Knowing your environment encourages good software practice, as well as making you more effective at solving problems when they do occur.

Unfortunately, everything in the world is not somebody else’s fault. As these new layers are added, things tend to become less and less observable. Not to mention that poor documentation can ruin an otherwise great tool. You may understand how a large number of cross calls can be the product of a misbehaving application, but if you can’t determine where they’re coming from, what’s the point? We in the Solaris group develop tools to provide useful layers of abstraction, as well as tools that rip the hood off2 so you can see what’s really happening inside.

Before I start examining some of these tools, I just wanted to point out that there is a large human factor involved that is out of our hands. The most powerful tool in the world can be useless in the hands of someone without a basic understanding of their system. Hopefully by providing these tools we can simultaneously expose the inner workings while sparking desire to learn about these inner workings. So take some some time to read a good book once in a while.

1For us kernel developers, the landscape is a little different. We have to work with a multitude of difference hardware (development platforms) in order to provide an effectively limitless number of solutions for our customers. I like to think that kernel engineers are pound-for-pound the best group of programmers around because of this, but maybe that’s just my ego talking.

2Scott is in the habit of talking about how we sell cars, not auto parts. Does that mean we also provide the Jaws of Life to save you after you crash your new Enzo?

So begins my first blog post ever.

I have been a Solaris Kernel Engineer for 10 months now after graduating from my alma mater. Since I joined so late in the Solaris 10 development cycle, I have not had the pleasure of working on one of the larger S10 projects such as DTrace, Zones (N1 Grid Containers), FMA (Predictive Self Healing), or ZFS. But this has given me the unique opportunity to attack bits and pieces of Solaris from all directions. In particular, I have spent more than a few lonely nights with mdb, procfs, and the ptools. I’ve enjoyed growing up in this playground built by Mike Shapiro, Adam Leventhal, Roger Faulkner, and those who came before me. More recently, I have been drafted into service for the AMD64 (opteron) army, selflessly sacrificing my free time for the good of our porting effort.

From here, I will most likely continue to post about Solaris development as well as general software principles. You’ll likely see a focus on software observability, debugging, and complexity. This comes with the territory, as you can see from Bryan’s blog. It is not a coincidence that we kernel engineers share similar views and goals. It is an essential part of the philosophy that makes Solaris what it is today: a robust, reliable, manageable, serviceable, and observable operating system.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012