Eric Schrock's Blog

Category: OpenSolaris

In past comments, it has been pointed out that a transition guide between GDB and MDB would be useful to some developers out there. A full comparison would also cover dbx(1), but I’ll defer this to a later point. Given the number of available commands, I’ll be dividing up this post into at least two pieces.

Before diving into too much detail, it should be noted that MDB and GDB have slightly different design goals. MDB (and KMDB) replaced the aging adb(1) and crash(1M), and was designed primarily for post-mortem analysis and live kernel analysis. To this end, MDB presents the same interface when debugging a crash dump as when examining a live kernel. Solaris corefiles have been enhanced so that all the information for the process (including library text and type information) is present in the corefile. MDB can examine and run live processes, but lacks some of the features (source level debugging, STABS/DWARF support, conditional breakpoints, scripting language) that are standard for developer-centric tools like GDB (or dbx). GDB was designed for interactive process debugging. While you can use GDB on corefiles (and even LKCD crash dumps or Linux kernels – locally and remotely), you often need the original object files to take advantage of GDB’s features.

Before going too far into MDB, be sure to check out Jonathan’s MDB Cheatsheet as a useful quick reference guide, with some examples of stringing together commands into pipelines. Seeing as how I’m not the most accomplished GDB user in the world, I’ll be basing this comparison off the equivalent GDB reference card.

GDB MDB Description

Starting Up

gdb program mdb path
mdb -p pid
Start debugging a command or running process. GDB will treat numeric arguments as pids, while mdb explicitly requires the ‘-p’ option
gdb program core mdb [ program ] core Debug a corefile associated with ‘program’. For MDB, the program is optional and is generally unnecessary given the corefile enhancements made during Solaris 10.

Exiting

quit ::quit Both programs also exit on Ctrl-D.

Getting Help

help
help command
::help
::help dcmd
::dcmds
::walkers
In mdb, you can list all the available walkers or dcmds, as well as get help on a specific dcmd. Another useful trick is ::dmods -l module which lists walkers and dcmds provided by a specific module.

Running Programs

run arglist ::run arglist Runs the program with the given arguments. If the target is currently running, or is a corefile, MDB will restart the program if possible.
kill ::kill Forcibly kill and release target.
show env ::getenv Display current environment.
set env var string ::setenv var=string Set an environment variable.
get env var ::getenv var Get a specific environment variable.

Shell Commands

shell cmd !
cmd
Execute the given shell command.

Breakpoints and Watchpoints

break func
break *addr
addr::bp Set a breakpoint at the given
address or function.
break file:line - Break at the given line of the file. MDB does not
support source level debugging.
break ... if expr - Set a conditional breakpoint. MDB doesn’t support
conditional breakpoints, though you can get a close approximation via the
-c option (though its complicated enough to warrant its own
post).
watch expr addr::wp -rwx [-L size] Set a watchpoint on the given region of memory.
info break
info watch
::events Display active watchpoints
and breakpoints. MDB will show you signal events as well.
delete [n] ::delete n Delete the given breakpoint or
watchpoints.

I think that's enough for now; hopefully the table is at least readable. More to come in a future post.

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I’ve fixed (as long as it’s not a security vulnerability). This week turned up a rather interesting specimen:

6198523 dirfindvp() can erroneously return ENOENT

This bug was first spotted by Casper back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we’d get error messages like:

sh: cannot determine current directory

Some ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, Jonathan hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.

Back in build 56 of S10, I moved getcwd(3c) into the kernel, along with changes to store pathnames with vnodes (which is used by the DTrace I/O provider as well as pfiles(1)). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path1. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.

To debug this problem, I used this D script to track the behavior of dirtopath(), the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:

dirtopath       /export/ws/build/usr/src/cmd/sgs/ld
lookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)
dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)
dirtopath() returned 2

Looking at this, it was clear that dirfindvp() (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that “sgs” is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (OpenSolaris, where are you?), it’s essentially structured like so:

while (!err && !eof) {
/* ... */
while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) {
/* ... */
/*
* We only want to bail out if there was an error other
* than ENOENT.  Otherwise, it could be that someone
* just removed an entry since the readdir() call, and
* the entry we want is further on in the directory.
*/
if (err != ENOENT) {
break;
}
}
}

The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for !err && !eof. If we fail to look up an entry, and it’s the last entry in the chunk we just read, then we’ll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn’t. Using this test program, it’s easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:

  • The bug had been in the gate for over a year, and there hadn’t been a single reported build failure.

  • It only happens when the cached vnode value is invalid, which is rare2.

  • It is a race condition between readdir, lookup, and remove.

  • On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.

  • Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for DTrace, which excels at solving these transient phenomena

A three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn’t been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.


1 There are many more subtleties here relating to Zones, and verifying that the path hasn’t been changed to refer to another file. The curious among you will have to wait for OpenSolaris.

2 I haven’t yet investigated why we ended up in the slow path in this case. First things first.

In the last few days you may have noticed that Google released a site filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It’s always great to see a company open sourcing their tools, as well as encouraging open standards to take advantage of their services.

That being said, I found the google coredumper particularly amusing. From the google page:

coredumper: Gives you the ability to dump cores from programs when it was previously not possible.

Being very close to the debugging tools on Solaris, I was a little taken aback by this statement. On Solaris, the gcore(1) command has always been a supported tool for generating standard Solaris core files readable by any debugger. Seeing as how I can’t imagine a UNIX system without this tool, I went looking in some old source trees to find out when it was originally written. While the current Solaris version has been re-written over the course of time, I did find this comment buried in the old SunOS 3.5 source:

/*
* gcore - get core images of running processes
*
* Author: Eric Cooper
* Written: Fall 1981.
*
* Inspired by a version 6 program by Len Levin, 1978.
* Several pieces of code lifted from Bill Joy's 4BSD ps.
*/

So this tool has been a standard part of UNIX since 1981, and based on sources as old as 1978. This is why the statement that it was “previously not possible” on Linux seemed shocking to me. Just to be sure, I logged into one of our machines running Linux and tried poking around:

$ find /usr/bin -name "*core*"
$

No luck. Intrigued, I took a look at the google project. From the included README:

The coredumper library can be compiled into applications to create
core dumps of the running program, without having to terminate
them. It supports both single- and multi-threaded core dumps, even if
the kernel does not have native support for multi-threaded core files.

So the design goal appears to be slightly different; being able to dump core from within the program itself. On Solaris, I would just fork/exec a copy of gcore(), or use the (unfortunately private) libproc interface to do so. I find it hard to believe that there are kernels out there without support for multi-threaded core files, though. I took a quick google search for ‘gcore linux’, and turned up a few mailing list articles here here and here. I went and downloaded the latest GDB sources, and sure enough there is a “gcore” command. I went back to our lab machine and tested it out with gdb 5.1, and sure enough it worked. But reading back the file was not as successful:

# gdb -p `pgrep nscd`
...
(gdb) info threads
7 Thread 5126 (LWP 1018)  0x420e7fc2 in accept () from /lib/i686/libc.so.6
6 Thread 4101 (LWP 1017)  0x420e7fc2 in accept () from /lib/i686/libc.so.6
5 Thread 3076 (LWP 1016)  0x420e7fc2 in accept () from /lib/i686/libc.so.6
4 Thread 2051 (LWP 1015)  0x420e0037 in poll () from /lib/i686/libc.so.6
3 Thread 1026 (LWP 1014)  0x420e7fc2 in accept () from /lib/i686/libc.so.6
2 Thread 2049 (LWP 1013)  0x420e0037 in poll () from /lib/i686/libc.so.6
1 Thread 1024 (LWP 1007)  0x420e7fc2 in accept () from /lib/i686/libc.so.6
(gdb) bt
#0  0x420e7fc2 in accept () from /lib/i686/libc.so.6
#1  0x40034603 in accept () from /lib/i686/libpthread.so.0
#2  0x0804acd5 in geteuid ()
#3  0x4002ffef in pthread_start_thread () from /lib/i686/libpthread.so.0
(gdb) gcore
Saved corefile core.1014
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
# gdb core.1014.
...
"/tmp/core.1014": not in executable format: File format not recognized
(gdb) quit
# gdb /usr/sbin/nscd core.1014
...
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 17, Child status changed.
#0  0x420e0037 in poll () from /lib/i686/libc.so.6
(gdb) info threads
7 process 67109878  0x420e7fc2 in accept () from /lib/i686/libc.so.6
6 process 134284278  0x420e0037 in poll () from /lib/i686/libc.so.6
5 process 67240950  0x420e7fc2 in accept () from /lib/i686/libc.so.6
4 process 134415350  0x420e7fc2 in accept () from /lib/i686/libc.so.6
3 process 201589750  0x420e7fc2 in accept () from /lib/i686/libc.so.6
2 process 268764150  0x420e7fc2 in accept () from /lib/i686/libc.so.6
* 1 process 335938550  0x420e0037 in poll () from /lib/i686/libc.so.6
(gdb) bt
#0  0x420e0037 in poll () from /lib/i686/libc.so.6
#1  0x0804aca8 in geteuid ()
#2  0x4002ffef in pthread_start_thread () from /lib/i686/libpthread.so.0
(gdb) quit
#

This whole exercise was rather distressing, and brought me straight back to college when I had to deal with gdb on a regular basis (Brown moved to Linux my senior year and I was responsible (together with Rob) for porting the Brown Simulator and Weenix OS from Solaris). Everything seemed fine when first attaching to the process; the gcore command appeared to work fine. But when reading back a corefile, gdb can’t understand a lone corefile, the process/thread IDs have been completely garbled, and I’ve lost floating point state (not shown above). It makes me glad that we have MDB, and configurable corefile content in Solaris 10.

This is likely an unfair comparison since it’s using GDB version 5.1, when the latest is 6.3, but at least it validates the existence of the google library. I always pay attention to debugging tools around the industry, but it seems like I need to get a little more hands-on experience to really guage the current state of affairs. I’ll have to get access to a system running a more recent version of GDB to see if it is any better before drawing any definitive conclusions. Then again, Solaris has had a working gcore(1) and mdb(1)/adb(1) since the SunOS days back in the 80s, so I don’t see why I should have to lower my expectations just because it’s GNU/Linux.

I know it’s been a long time since I posted a blog entry. But I’ve either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But last week I finally nailed down a nasty bug that had been haunting me for several weeks, so I thought I’d share some of the experience. I apologize if this post gets a little too technical and/or incomprehensible. But I found it to be an interesting exercise, and hopefully sharing it will get me back in the blogging spirit.

First a little background. In Solaris, we have a set of kernel functions known as ‘copyops’ used to transfer data between the kernel and userland. In order to support watchpoints and SunCluster, we maintain a backup vector of functions used when one of these fails. For example, if you have a piece of data on a watched page, we keep that page entirely unmapped. If the kernel tries to read data from this page, the copyin() function will initially fail, before falling back on watch_copyin(). This goes and temporarily maps in the page, does the copy (triggering a watchpoint if necessary) and then unmapping the page. In this way, the average kernel consumer has no idea that there was a watched area on the page.

Clustering uses this facility in their pxfs (proxy filesystem) implementation. In order to support ioctl() calls that access an unspecified amount of memory, they use the copyops vector to translate any reads or writes into over-the-wire requests for the necessary data. These requests are always done from kernel threads, with no attached user space, so any attempt to access userland should fault before vectoring off to their copyops vector.

OK, on to the bug. During testing, SunCluster folks found that they were getting essentially random memory corruption during some ioctl() calls over pxfs on SPARC machines. After trying in vain to understand the crash dumps, the Clustering folks were able to reproduce the problem on DEBUG bits. In addition to getting traptrace output (a black-box style record of OS traps), the kernel failed an ASSERT() deep in the sfmmu HAT (Spitfire Memory Management Unit Hardware Address Translation) layer during a copyin() call. This smoking gun pointed straight to the copyops. We expect a kernel thread accessing userland to generate a T_DATA_EXCEPTION trap, but instead we were getting a T_DATA_MMU_MISS trap, which the HAT was not prepared to handle (nor should it have to).

I spent nearly a week enhancing my copyops test suite, and following several wrong paths deep into SPARC trap tables and the HAT layer. But no amount of testing could reproduce the problem. Finally, I noticed that we had reached the sfmmu assertion as a kernel thread, but our secondary ASI was set to INVALID_CONTEXT instead of KCONTEXT. On SPARC, all addresses are implicitly tagged with an ASI (address space identifier) that lets us refer to kernel addresses and user addresses without having to share the address psace like we do on x86. All kernel threads are supposed to use KCONTEXT (0) as their secondary ASI. INVALID_CONTEXT (1) is reserved for userland threads in various invalid states. Needless to say, this was confusing.

I knew that somehow we were setting the secondary ASI improperly, or forgetting to set it when we should. I began adding some ASSERTs to a custom kernel and quickly ruled out the former. Finally I booted a kernel with some debug code added to resume(), and panicked almost instantly. It was clear that we were coming out of resume() as a kernel thread, but with INVALID_CONTEXT as our secondary ASI. Many hours of debugging later, I finally found my culprit in resume_from_zombie(), which is used when resuming from an exited thread. When a user thread is exiting, we re-parent to p0 (the kernel ‘sched’ process) and set our secondary ASI to INVALID_CONTEXT. If, in resume(), we switch from one of these threads to another kernel thread, we see that they both belong to the same process (p0) and don’t bother to re-initialize the secondary ASI. We even have a function, hat_thread_exit(), designed to do exactly this, only it was a no-op on SPARC. I added a call to sfmmu_setctx_sec() to this function, and the problem disappeared. Technically, this has been a bug since the dawn of time, but it had no ill side effects until I changed the way the copyops were used, and SunCluster began testing on S10.

Besides the sheer amount of debugging effort, this bug was interesting for several reasons:

  • It was impossible to root cause on a non-DEBUG kernel. While we try to make the normal kernel as debuggable as possible, memory corruption (especially due to corruption in the HAT layer) is one of those problems that needs to be caught as close to the source as possible. Solaris has a huge amount of debug code, as well as facilities like traptrace that can only be enabled on a debug kernel due to performance overhead.
  • The cause of the problem was separated from the symptom by an arbitrary period of time. Once we switched to a kernel thread with a bad ASI, we could harmlessly switch between any number of kernel threads before we find one that actually tries to access userland.
  • It was completely unobservable in constrained test scenarios. We not only needed to create kernel threads that accessed userland, but we needed to have a userland thread exit and then switch immediately to one of these threads. Needless to say, this is not easy to reproduce, especially when you don’t understand exactly what’s going on.
  • This would have been nearly unsolvable on most other OSes. Without a kernel debugger, post mortem crash dump analysis, and tools like DTrace and traptrace records, I doubt I could have ever solved this problem. This is one of those situations where a stack trace and a bunch of printf() calls would never have solved the problem.

While this wasn’t the most difficult problem I’ve ever had to debug, it certainly ranks up there in recent memory.

A long time ago I described a debugging problem where it was necessary to determine which threads owned a reader lock. In particular, I used the heuristic that if the address of the rwlock is in a particular thread’s stack, then it most likely held by the thread (and can be verified by examining the thread’s stack). This works 99% of the time, because you typically have the following:

rw_enter(lock, RW_READER);
/* ... do something ... */
rw_exit(lock);

The compiler has to preserve the address of the lock across all the junk in the middle, so it almost always ends up getting pushed on the stack. At described in the previous post, this means a combination of ::kgrep and ::whatis, plus some hand-pruning, to get the threads in question. At the time, I mentioned how nice it would be to have a dedicated command do this dirty work. Now that Solaris 10 has shipped, I finally sat down and gave it a try. In a testament to MDB’s well-designed interfaces, I was able to write the entire command in under 5 minutes with just 50 lines of code. On top of that, it runs in a fraction of the time. Rather than searching the entire address space, we only have to look at the stack for each thread. For example:

> c8d45bb6::kgrep | ::whatis
c8d45ae4 is c8d45aa0+44, allocated as a thread structure
cae92ed8 is in thread c8d45aa0's stack
cae92ee4 is in thread c8d45aa0's stack
cae92ef8 is in thread c8d45aa0's stack
cae92f24 is in thread c8d45aa0's stack
> c8d45bb6::whatthread
0xc8d45aa0
>

The simple output allows it to be piped to ::findstack to quickly locate questionable threads. There have been discussions about maintaining a very small set of held reader locks in the thread structure, but it’s a difficult problem to solve definitively (without introducing massive performance regressions).

This demonstrates an oft-overlooked benefit of MDB. Though very few developers exist outside of the Solaris group, developing MDB modules is extremely simple and powerful (there are more than 500 commands and walkers in MDB today). Over time, I think I’ve almost managed to suppress all the painful GDB memories from my college years…

Today, I thought I’d share a real-world experience that might portray DTrace in a slightly different light than you’re used to. The other week, I was helping a customer with the following question:

Why is automountd constantly taking up 1.2% of CPU time?

The first thought that came to mind was a broken automountd. But if that were the case, you’d be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong – some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:

#!/usr/sbin/dtrace -s
auto_lookup_request:entry
{
@lookups[execname, stringof(args[0]->fi_path)] = count();
}

The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.

Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don’t expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paperautomountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you’d still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.

After a little thought, I imagined a few Solaris 9 scenarios where I’d either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:

  • Disruptive to the running system
  • Not guaranteed to isolate the particular problem
  • Difficult for the customer to understand and execute

It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can’t EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support…

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives