Eric Schrock's Blog

Month: August 2004

I’ve been missing almost a week, mostly because of my involvement with the amd64 bringup effort. A while ago, I was recruited to get the ptools and mdb up and running in 64-bit mode. This certainly made me appreciate some of the old war stories – all the Solaris veterans have their favorite bug that they debugged using only hex dumps, a pocket knife, and a ball of string. Over time, you start taking the Solaris debugging tools for granted: try going back to Solaris 9 after spending a year with DTrace1. I was in for quite a shock when I learned that my spoiled lifestyle wasn’t going to cut it in the jungles of amd64.

It’s no secret that we’ve had the amd64 kernel up and running for a while now. Thankfully, I was not part of the initial bringup effort. Back when I joined, the kernel was already booting multiuser, and I never had to lay my finger on a simulator or diagnose a double fault. 64 bit applications would load and run (thanks in part to a certain linker alien2), but debugging them was basically impossible: no truss, no mdb, no pstack. So where do you begin?

Thankfully, we’ve had a 64-bit OS for years, and most of the infrastructure was already working. All our tools worked with 64-bit ELF files out of the box, for example. But a lot of things were still broken. I ended up along roughly the following path:

pstack on corefiles

So pstack segfaulted the first time I ran it. At this point I could run elfdump on the corefile, but not much else. The first task was getting pstack to run on corefiles, so I at least knew where to begin inserting my printf() statements. Walking a stack on amd64 can be a tricky thing – so I begin with a simple version that works 99% of the time.

mdb for corefiles

The next step was to get MDB chewing on these corefiles. A stacktrace is all well and good, but we need to be able to exmine registers and memory. This turned out to be quite a bit of work; mdb is quite a heavy consumer of libproc, and uses some little-used interfaces in libc (in particular, getcontext(2) and makecontext(3c) were annoying). But with a lot of printfs, a few fixes and a few hacks, we had post mortem debugging.


Sadly, I can’t take credit for this one. This turned out to be just a bug in fork(2), and once that was fixed, truss worked flawlessly.

mdb for live processes

This was not too difficult thanks to the magic of libproc, which allows us to manipulate live processes and corefiles through the same interface. A few minor tweaks were needed here and there, and some of the finer bugs have yet to be fixed, but it’s basically working. Most of the ISA specific actions (such as setting breakpoints) are the same on ia32 and amd64.

agent LWP and pfiles

Finally, I had to get Psyscall (the libproc internal function that executes a system call in the context of a target process) working. This was particularly annoying, mostly because the code was poorly structured – rather than having separate ISA specific actions in different files, we had tons of #ifdefs scattered throughout the code. A large part of this was just ripping apart the code and restructuring it in a way that made porting easier. Someday when someone ports Solaris to run on Adam’s laptop, they’ll appreciate it.

In a testament to the portability of Solaris, there were no large infrastructure changes outside of Psyscall. Basically, I just fixed one small bug after another. So all the debugging tools are now up and running, and with Bryan and Matt helping, we have DTrace and KMDB as well. So now I can go back to a pampered life in my Hollywood Hills mansion; surrounded by DTrace, MDB, and a few of my closest ptools.

1 Solaris debugging can be roughly divided into three eras: pre-mdb (Paleozoic), pre-DTrace (Mesozoic), and modern day (Cenozoic). The arrival of CTF data could be seen end of the Triassic period and beginning of the Jurassic, while KMDB may begin the Pleistocene (a.k.a. modern) era. Sounds like an interesting science project…

2 There were many others involved in getting the kernel this far. But Mike’s the only one with a blog, so he gets all the credit.

The other day on vacation, I ran across a Slashdot article on UNIX branding and GNU/Linux. Tne original article was mildy interesting, to the point where I actually bothered to read the comments. Now, long ago I learned that 99% of Slashdot comments are worthless. Very rarely to you find thoughtful and objective comments; even browsing at +5 can be hazardous to your health. Even so, I managed to find this comment, which contained in part some perspective relating to my previous post on Linux innovation:

I have been saying that for several years now. UNIX is all but dead. The only commercial UNIX likely to still be arround in ten years time as an ongoing product is OS/X. Solaris will have long since joined IRIX, Digital UNIX and VMS as O/S you can still buy and occasionaly see a minor upgrade for it.

There is a basic set of core functions that O/S do and this has not changed in principle for over a decade. Log based file systems, threads that work etc are now standard, but none of this was new ten years ago.

The interesting stuff all takes place either above or below the O/S layer. .NET, J2EE etc are where interesting stuff is happening.

Clearly, this person is not the sharpest tool in the shed when in comes to Operating Systems. But it begs the question: How widespread is this point of view? We love innovation, and it shows in Solaris 10. We have yet to demo Solaris 10 to a customer without them being completely blown away by at least one of the many features. DTrace, Zones, FMA, SMF, and ZFS are but a few reasons why Solaris won’t have “joined IRIX, Digital UNIX, and VMS” in a few years.

Part of this is simply that people have yet to experience real OS innovation such as that found in Solaris 10. But more likely this is just a fundamental disconnection between OS developers and end users. If I managed to get my mom and dad running Java Desktop System on Solaris 10, they wouldn’t never know what DTrace, Zones, or ZFS is, simply because it’s not visible to the end user. But this doesn’t mean that it isn’t worthwhile innovation: as with all layered software, our innovations directly influence our immediate consumers. Solaris 10 is wildly popular among our customers, who are mostly admins and developers, with some “power users”. Even though these people are a quantitatively small portion of our user base, they are arguably the most important. OS innovation directly influences the quality of life and efficiency of developers and admins, which has a cascading effect on the rest of the software stack.

This cascading influence tends to be ignores in arguments over the commoditization of the OS. If you stand at any given layer, you can make a reasonable argument that the software two layers beneath you has become a commodity. JVM developers will argue that hardware is a commodity, while J2EE developers will argue that the OS is a commodity. Even if you’re out surfing the web and use a web service developed on J2EE, you’re implicitly relying on innovation that has its roots at the OS. Of course, the further you go from the OS the less prominent the influence is, but its still there.

So think twice before declaring the OS irrelevant. Even if you don’t use features directly provided by the OS, your quality of life has been improved by having them available to those that do use them.

More Solaris engineers are joining the blogging community every day. Recently we’ve been joined by Resource Management guru Andrei Dorofeev, Service Management Facility guru Liane Praza, and ZFS guru Matt Ahrens. Be sure to check out their blogs, as well as the other blogs listed at right; you’re bound to find some cool stuff and interesting discussions.

Update: Dilpreet Bindra became the first member of team FMA (a.k.a. Predictive Self Healing) to join the blogging effort.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012