Eric Schrock's Blog

Google coredumper

March 18, 2005

In the last few days you may have noticed that Google released a site filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It’s always great to see a company open sourcing their tools, as well as encouraging open standards to take advantage of their services.

That being said, I found the google coredumper particularly amusing. From the google page:

coredumper: Gives you the ability to dump cores from programs when it was previously not possible.

Being very close to the debugging tools on Solaris, I was a little taken aback by this statement. On Solaris, the gcore(1) command has always been a supported tool for generating standard Solaris core files readable by any debugger. Seeing as how I can’t imagine a UNIX system without this tool, I went looking in some old source trees to find out when it was originally written. While the current Solaris version has been re-written over the course of time, I did find this comment buried in the old SunOS 3.5 source:

* gcore - get core images of running processes
* Author: Eric Cooper
* Written: Fall 1981.
* Inspired by a version 6 program by Len Levin, 1978.
* Several pieces of code lifted from Bill Joy's 4BSD ps.

So this tool has been a standard part of UNIX since 1981, and based on sources as old as 1978. This is why the statement that it was “previously not possible” on Linux seemed shocking to me. Just to be sure, I logged into one of our machines running Linux and tried poking around:

$ find /usr/bin -name "*core*"

No luck. Intrigued, I took a look at the google project. From the included README:

The coredumper library can be compiled into applications to create
core dumps of the running program, without having to terminate
them. It supports both single- and multi-threaded core dumps, even if
the kernel does not have native support for multi-threaded core files.

So the design goal appears to be slightly different; being able to dump core from within the program itself. On Solaris, I would just fork/exec a copy of gcore(), or use the (unfortunately private) libproc interface to do so. I find it hard to believe that there are kernels out there without support for multi-threaded core files, though. I took a quick google search for ‘gcore linux’, and turned up a few mailing list articles here here and here. I went and downloaded the latest GDB sources, and sure enough there is a “gcore” command. I went back to our lab machine and tested it out with gdb 5.1, and sure enough it worked. But reading back the file was not as successful:

# gdb -p `pgrep nscd`
(gdb) info threads
7 Thread 5126 (LWP 1018)  0x420e7fc2 in accept () from /lib/i686/
6 Thread 4101 (LWP 1017)  0x420e7fc2 in accept () from /lib/i686/
5 Thread 3076 (LWP 1016)  0x420e7fc2 in accept () from /lib/i686/
4 Thread 2051 (LWP 1015)  0x420e0037 in poll () from /lib/i686/
3 Thread 1026 (LWP 1014)  0x420e7fc2 in accept () from /lib/i686/
2 Thread 2049 (LWP 1013)  0x420e0037 in poll () from /lib/i686/
1 Thread 1024 (LWP 1007)  0x420e7fc2 in accept () from /lib/i686/
(gdb) bt
#0  0x420e7fc2 in accept () from /lib/i686/
#1  0x40034603 in accept () from /lib/i686/
#2  0x0804acd5 in geteuid ()
#3  0x4002ffef in pthread_start_thread () from /lib/i686/
(gdb) gcore
Saved corefile core.1014
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
# gdb core.1014.
"/tmp/core.1014": not in executable format: File format not recognized
(gdb) quit
# gdb /usr/sbin/nscd core.1014
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 17, Child status changed.
#0  0x420e0037 in poll () from /lib/i686/
(gdb) info threads
7 process 67109878  0x420e7fc2 in accept () from /lib/i686/
6 process 134284278  0x420e0037 in poll () from /lib/i686/
5 process 67240950  0x420e7fc2 in accept () from /lib/i686/
4 process 134415350  0x420e7fc2 in accept () from /lib/i686/
3 process 201589750  0x420e7fc2 in accept () from /lib/i686/
2 process 268764150  0x420e7fc2 in accept () from /lib/i686/
* 1 process 335938550  0x420e0037 in poll () from /lib/i686/
(gdb) bt
#0  0x420e0037 in poll () from /lib/i686/
#1  0x0804aca8 in geteuid ()
#2  0x4002ffef in pthread_start_thread () from /lib/i686/
(gdb) quit

This whole exercise was rather distressing, and brought me straight back to college when I had to deal with gdb on a regular basis (Brown moved to Linux my senior year and I was responsible (together with Rob) for porting the Brown Simulator and Weenix OS from Solaris). Everything seemed fine when first attaching to the process; the gcore command appeared to work fine. But when reading back a corefile, gdb can’t understand a lone corefile, the process/thread IDs have been completely garbled, and I’ve lost floating point state (not shown above). It makes me glad that we have MDB, and configurable corefile content in Solaris 10.

This is likely an unfair comparison since it’s using GDB version 5.1, when the latest is 6.3, but at least it validates the existence of the google library. I always pay attention to debugging tools around the industry, but it seems like I need to get a little more hands-on experience to really guage the current state of affairs. I’ll have to get access to a system running a more recent version of GDB to see if it is any better before drawing any definitive conclusions. Then again, Solaris has had a working gcore(1) and mdb(1)/adb(1) since the SunOS days back in the 80s, so I don’t see why I should have to lower my expectations just because it’s GNU/Linux.

6 Responses

  1. # gdb core.1014.

    “/tmp/core.1014”: not in executable format: File format not recognized

    I think the correct command to have gdb read a core file was
    gdb /path/to/binary -c
    When you say “gdb corefile” it interprets the “corefile” as a ELF binary which it is not.
    I don’t have a box handy to verify but I am likely correct.

  2. Small correction –
    “When you say “gdb corefile” it interprets the “corefile” as a ELF binary which it is not.”
    – Should be read as –
    “When you say “gdb corefile” gdb tries to interpret the “corefile” as an executable in platform specific format (it happens to be ELF Executable on Linux/Solaris) which it is not.”

  3. Parag –
    Thanks for the correction; I figured this out later after the post. But I find it hard to believe that gdb can’t detect that <tt>e_type</tt> is <tt>ET_CORE</tt> in the ELF header, and do the logical thing. And if you do run gdb with just a corefile, you lose all your symbolic information, rendering debugging essentially useless.

  4. Redhat Fedora Core 1

    [jperrie@trogdor jperrie]$ find /usr/bin -name “*core*”

    Redhat Enterprise Linux 4

    [jperrie@trogdor jperrie]$ find /usr/bin -name “*core*”

  5. Tony –
    Thanks for the pointers. The system I tested was running RHEL3, so obviously some work has been done in this area. Good to see.
    Fazal –
    While the programs target slightly different areas (post mortem/kernel vs. userland development), they do overlap considerably. A transition guide would be pretty straightforward; I’ll see what I can put together.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012