Eric Schrock's Blog

Month: March 2005

This little gem came up in conversation last night, and it was suggested that it would make a rather amusing blog entry. A Solaris project had a command line utility with the following, unspeakably horrid, piece of code:

* Use the dynamic linker to look up the function we should
* call.
(void) snprintf(func_name, sizeof (func_name), "do_%s", cmd);
func_ptr = (int (*)(int, char **))
dlsym(RTLD_DEFAULT, func_name);
if (func_ptr == NULL) {
fprintf(stderr, "Unrecognized command %s", cmd);
return ((*func_ptr)(argc, argv));

So when you type “a.out foo”, the command would sprintf into a buffer to make “do_foo”, and rely on the dynamic linker to find the appropriate function to call. Before I get a stream of comments decrying the idiocy of Solaris programmers: the code will never reach the Solaris codebase, and the responsible party no longer works at Sun. The participants at the dinner table were equally disgusted that this piece of code came out of our organization. Suffice to say that this is much better served by a table:

for (i = 0; i < sizeof (func_table) / sizeof (func_table[0]); i++) {
if (strcmp(func_table[i].name, cmd) == 0)
return (func_table[i].func(argc, argv));
fprintf(stderr, "Unrecognized command %s", cmd);

I still can’t imagine the original motivation for this code. It is more code, harder to understand, and likely slower (depending on the number of commands and how much you trust the dynamic linker’s hash table). We continually preach software observability and transparency – but I never thought I’d see obfuscation of this magnitude within a 500 line command line utility. This prevents us from even searching for callers of do_foo() using cscope.

This serves as a good reminder that the most clever way of doing something is usually not the right answer. Unless you have a really good reason (such as performance), being overly clever will only make your code more difficult to maintain and more prone to error.

Update – Since some people seem a little confused, I thoght I’d elaborate two points. First off, there is no loadable library. This is a library linked directly to the application. There is no need to asynchronously update the commands. Second, the proposed function table does not have to live separately from the code. It would be quite simple to put the function table in the same file with the function definitions, which would improve maintainability and understability by an order of magnitude.

In the last few days you may have noticed that Google released a site filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It’s always great to see a company open sourcing their tools, as well as encouraging open standards to take advantage of their services.

That being said, I found the google coredumper particularly amusing. From the google page:

coredumper: Gives you the ability to dump cores from programs when it was previously not possible.

Being very close to the debugging tools on Solaris, I was a little taken aback by this statement. On Solaris, the gcore(1) command has always been a supported tool for generating standard Solaris core files readable by any debugger. Seeing as how I can’t imagine a UNIX system without this tool, I went looking in some old source trees to find out when it was originally written. While the current Solaris version has been re-written over the course of time, I did find this comment buried in the old SunOS 3.5 source:

* gcore - get core images of running processes
* Author: Eric Cooper
* Written: Fall 1981.
* Inspired by a version 6 program by Len Levin, 1978.
* Several pieces of code lifted from Bill Joy's 4BSD ps.

So this tool has been a standard part of UNIX since 1981, and based on sources as old as 1978. This is why the statement that it was “previously not possible” on Linux seemed shocking to me. Just to be sure, I logged into one of our machines running Linux and tried poking around:

$ find /usr/bin -name "*core*"

No luck. Intrigued, I took a look at the google project. From the included README:

The coredumper library can be compiled into applications to create
core dumps of the running program, without having to terminate
them. It supports both single- and multi-threaded core dumps, even if
the kernel does not have native support for multi-threaded core files.

So the design goal appears to be slightly different; being able to dump core from within the program itself. On Solaris, I would just fork/exec a copy of gcore(), or use the (unfortunately private) libproc interface to do so. I find it hard to believe that there are kernels out there without support for multi-threaded core files, though. I took a quick google search for ‘gcore linux’, and turned up a few mailing list articles here here and here. I went and downloaded the latest GDB sources, and sure enough there is a “gcore” command. I went back to our lab machine and tested it out with gdb 5.1, and sure enough it worked. But reading back the file was not as successful:

# gdb -p `pgrep nscd`
(gdb) info threads
7 Thread 5126 (LWP 1018)  0x420e7fc2 in accept () from /lib/i686/
6 Thread 4101 (LWP 1017)  0x420e7fc2 in accept () from /lib/i686/
5 Thread 3076 (LWP 1016)  0x420e7fc2 in accept () from /lib/i686/
4 Thread 2051 (LWP 1015)  0x420e0037 in poll () from /lib/i686/
3 Thread 1026 (LWP 1014)  0x420e7fc2 in accept () from /lib/i686/
2 Thread 2049 (LWP 1013)  0x420e0037 in poll () from /lib/i686/
1 Thread 1024 (LWP 1007)  0x420e7fc2 in accept () from /lib/i686/
(gdb) bt
#0  0x420e7fc2 in accept () from /lib/i686/
#1  0x40034603 in accept () from /lib/i686/
#2  0x0804acd5 in geteuid ()
#3  0x4002ffef in pthread_start_thread () from /lib/i686/
(gdb) gcore
Saved corefile core.1014
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
# gdb core.1014.
"/tmp/core.1014": not in executable format: File format not recognized
(gdb) quit
# gdb /usr/sbin/nscd core.1014
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 17, Child status changed.
#0  0x420e0037 in poll () from /lib/i686/
(gdb) info threads
7 process 67109878  0x420e7fc2 in accept () from /lib/i686/
6 process 134284278  0x420e0037 in poll () from /lib/i686/
5 process 67240950  0x420e7fc2 in accept () from /lib/i686/
4 process 134415350  0x420e7fc2 in accept () from /lib/i686/
3 process 201589750  0x420e7fc2 in accept () from /lib/i686/
2 process 268764150  0x420e7fc2 in accept () from /lib/i686/
* 1 process 335938550  0x420e0037 in poll () from /lib/i686/
(gdb) bt
#0  0x420e0037 in poll () from /lib/i686/
#1  0x0804aca8 in geteuid ()
#2  0x4002ffef in pthread_start_thread () from /lib/i686/
(gdb) quit

This whole exercise was rather distressing, and brought me straight back to college when I had to deal with gdb on a regular basis (Brown moved to Linux my senior year and I was responsible (together with Rob) for porting the Brown Simulator and Weenix OS from Solaris). Everything seemed fine when first attaching to the process; the gcore command appeared to work fine. But when reading back a corefile, gdb can’t understand a lone corefile, the process/thread IDs have been completely garbled, and I’ve lost floating point state (not shown above). It makes me glad that we have MDB, and configurable corefile content in Solaris 10.

This is likely an unfair comparison since it’s using GDB version 5.1, when the latest is 6.3, but at least it validates the existence of the google library. I always pay attention to debugging tools around the industry, but it seems like I need to get a little more hands-on experience to really guage the current state of affairs. I’ll have to get access to a system running a more recent version of GDB to see if it is any better before drawing any definitive conclusions. Then again, Solaris has had a working gcore(1) and mdb(1)/adb(1) since the SunOS days back in the 80s, so I don’t see why I should have to lower my expectations just because it’s GNU/Linux.

I know it’s been a long time since I posted a blog entry. But I’ve either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But last week I finally nailed down a nasty bug that had been haunting me for several weeks, so I thought I’d share some of the experience. I apologize if this post gets a little too technical and/or incomprehensible. But I found it to be an interesting exercise, and hopefully sharing it will get me back in the blogging spirit.

First a little background. In Solaris, we have a set of kernel functions known as ‘copyops’ used to transfer data between the kernel and userland. In order to support watchpoints and SunCluster, we maintain a backup vector of functions used when one of these fails. For example, if you have a piece of data on a watched page, we keep that page entirely unmapped. If the kernel tries to read data from this page, the copyin() function will initially fail, before falling back on watch_copyin(). This goes and temporarily maps in the page, does the copy (triggering a watchpoint if necessary) and then unmapping the page. In this way, the average kernel consumer has no idea that there was a watched area on the page.

Clustering uses this facility in their pxfs (proxy filesystem) implementation. In order to support ioctl() calls that access an unspecified amount of memory, they use the copyops vector to translate any reads or writes into over-the-wire requests for the necessary data. These requests are always done from kernel threads, with no attached user space, so any attempt to access userland should fault before vectoring off to their copyops vector.

OK, on to the bug. During testing, SunCluster folks found that they were getting essentially random memory corruption during some ioctl() calls over pxfs on SPARC machines. After trying in vain to understand the crash dumps, the Clustering folks were able to reproduce the problem on DEBUG bits. In addition to getting traptrace output (a black-box style record of OS traps), the kernel failed an ASSERT() deep in the sfmmu HAT (Spitfire Memory Management Unit Hardware Address Translation) layer during a copyin() call. This smoking gun pointed straight to the copyops. We expect a kernel thread accessing userland to generate a T_DATA_EXCEPTION trap, but instead we were getting a T_DATA_MMU_MISS trap, which the HAT was not prepared to handle (nor should it have to).

I spent nearly a week enhancing my copyops test suite, and following several wrong paths deep into SPARC trap tables and the HAT layer. But no amount of testing could reproduce the problem. Finally, I noticed that we had reached the sfmmu assertion as a kernel thread, but our secondary ASI was set to INVALID_CONTEXT instead of KCONTEXT. On SPARC, all addresses are implicitly tagged with an ASI (address space identifier) that lets us refer to kernel addresses and user addresses without having to share the address psace like we do on x86. All kernel threads are supposed to use KCONTEXT (0) as their secondary ASI. INVALID_CONTEXT (1) is reserved for userland threads in various invalid states. Needless to say, this was confusing.

I knew that somehow we were setting the secondary ASI improperly, or forgetting to set it when we should. I began adding some ASSERTs to a custom kernel and quickly ruled out the former. Finally I booted a kernel with some debug code added to resume(), and panicked almost instantly. It was clear that we were coming out of resume() as a kernel thread, but with INVALID_CONTEXT as our secondary ASI. Many hours of debugging later, I finally found my culprit in resume_from_zombie(), which is used when resuming from an exited thread. When a user thread is exiting, we re-parent to p0 (the kernel ‘sched’ process) and set our secondary ASI to INVALID_CONTEXT. If, in resume(), we switch from one of these threads to another kernel thread, we see that they both belong to the same process (p0) and don’t bother to re-initialize the secondary ASI. We even have a function, hat_thread_exit(), designed to do exactly this, only it was a no-op on SPARC. I added a call to sfmmu_setctx_sec() to this function, and the problem disappeared. Technically, this has been a bug since the dawn of time, but it had no ill side effects until I changed the way the copyops were used, and SunCluster began testing on S10.

Besides the sheer amount of debugging effort, this bug was interesting for several reasons:

  • It was impossible to root cause on a non-DEBUG kernel. While we try to make the normal kernel as debuggable as possible, memory corruption (especially due to corruption in the HAT layer) is one of those problems that needs to be caught as close to the source as possible. Solaris has a huge amount of debug code, as well as facilities like traptrace that can only be enabled on a debug kernel due to performance overhead.
  • The cause of the problem was separated from the symptom by an arbitrary period of time. Once we switched to a kernel thread with a bad ASI, we could harmlessly switch between any number of kernel threads before we find one that actually tries to access userland.
  • It was completely unobservable in constrained test scenarios. We not only needed to create kernel threads that accessed userland, but we needed to have a userland thread exit and then switch immediately to one of these threads. Needless to say, this is not easy to reproduce, especially when you don’t understand exactly what’s going on.
  • This would have been nearly unsolvable on most other OSes. Without a kernel debugger, post mortem crash dump analysis, and tools like DTrace and traptrace records, I doubt I could have ever solved this problem. This is one of those situations where a stack trace and a bunch of printf() calls would never have solved the problem.

While this wasn’t the most difficult problem I’ve ever had to debug, it certainly ranks up there in recent memory.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012