Eric Schrock's Blog

Month: June 2004

So a few posts ago I asked for some suggestions on improving observability in Solaris, specifically with respect to LSOF. I thought I’d summarize the responses, which fell into two basic groups:

  1. Socket and process visibility. Something along the lines of lsof -i or netstat -p on Linux.
  2. Per-process mpstat, vmstat, and iostat.

I’ll defer the first suggestion for the moment. The second suggestion is straightforward, thanks to the mystical powers of DTrace. As you can see from my previous post, it’s simple to aggregate I/O on a per-process basis. Thanks to the vminfo and sysinfo DTrace providers, we can do the same for most any interesting statistic. The problem with traditional kstats1 is that they present static state after the fact – you cannot tell why or when a counter was incremented. But for every kstat reported by vmstat and mpstat, a DTrace probe exists wherever it’s incremented. Throw in some predicates and aggregations, and we’re talking instant observability.

I envision two forms of these tools. The first, as suggested in previous comments, would be present prstat(1) style output, sorted according to the user’s choice of statistic. This would be aimed as administrators trying to understand systemic problems. The second form would take a pid and show all the relevant statistics for just that process. This would be aimed at developers trying to understand their application’s behavior.

Today, anyone can write D scripts to do this. But there’s something to be said for having a canned tool to jumpstart analysis. It doesn’t have to be too powerful; once you get beyond these basic questions you’ll be needing to write custom D scripts anyway. I’m sure the DTrace team has given this far more thought than I have, but I thought I’d let you know that your comments aren’t descending into some kind of black hole. Blogging provides a unique forum for customer conversations; somewhere between a face to face meeting (which tends to not scale well) and a newsgroup posting (which lacks organization and personal attention). Many thanks to those in Sun who pushed for this new forum, and those of you out there reading and taking advantage of it.

1 The statistics used by these tools are part of the kstat(1M) facility. The kernel provides any number of statistics from every different subsystem, which can be extracted through a library interface and processed by user applications.

In my previous blog post, it was mentioned that it would be great to have a prstat-like tool for showing the most I/O hungry processes. As the poster suggested, this is definitely possible with the new io provider for DTrace. This will be available in the next Solaris Express release; see Adam’s DTrace schedule for more information.

For fun, I hacked together a quick DTrace script as a proof of concept. Five minutes later, I had the following script:

#!/usr/sbin/dtrace -s
#pragma D option quiet
printf("%-6s  %-20s  %s\n", "PID", "COMMAND", "BYTES/SEC");
printf("------  --------------------  ---------\n");
last = timestamp;
@io[pid, execname] = sum(args[0]->b_bcount);
trunc(@io, 10);
normalize(@io, (timestamp - last) / 1000000000);
printa("%-6d  %-20s  %@d\n", @io);
trunc(@io, 0);
last = timestamp;

This is truly a rough cut, but it will show the top ten processes issuing I/O, summarized every 5 seconds. Here’s a little output right as as I kicked off a new build:

# ./iotop.d
PID     COMMAND               BYTES/SEC
------  --------------------  ---------
216693  inetd                 5376
100357  nfsd                  6912
216644  nohup                 7680
216689  make                  8192
0       sched                 14336
216644  nightly               20480
216710  sh                    20992
216651  newtask               46336
216689  make.bin              141824
216710  java                  1107712
216746  sh                    7168
216793  make.bin              8192
216775  make.bin              9625
216781  make.bin              13926
216767  make.bin              14745
216720  ld                    32768
216713  nightly               77004
216768  make.bin              78438
216740  make.bin              174899
216796  make.bin              193740
216893  make.bin              9011
216767  make.bin              9011
216829  make.bin              9830
216872  make.bin              9830
216841  make.bin              19046
216851  make.bin              21504
216907  make.bin              31129
216805  make.bin              54476
216844  make.bin              81920
216796  make.bin              117350

In this case it was no surprise that ‘make.bin’ is doing the most I/O, but things could be more interesting on a larger machine.

While D scripts can answer questions quickly and effectively, you run out of rope pretty quickly when trying to write a general purpose utility. We will be looking into writing a new utility based off the C library interfaces1, where we can support multiple options and output formats, and massage the data into a more concise view. DTrace is still relatively new; in many ways it’s like living your entire life inside a dome before finding the door to the outside. What would your first stop be? We’re not sure ourselves2, so keep the suggestions coming!

1 The lockstat(1M) command is written using the DTrace lockstat provider, for example.

2 One thing that’s for sure is Adam’s work on userland static tracing and the plockstat utility. Stay tuned…

Over the years, we have been keeping an eye on the LSOF (LiSt Open Files) utility written by Vic Abell. It’s a well respected utility that provides indispensible information that cannot be found any other way. We have tried to emulate some aspects of lsof, but we still fall short in several categories. As part of the RAS group, I’m curious to know what observability features our system is lacking. This is a huge open-ended question, so I’m trying to limit the scope by performing a direct comparison to lsof. If you’re not familiar with lsof, then consider it a black box that will answer any questions about open files or connections on your system.

We have started to get better at this. The fuser(1M) and pfiles(1) commands are a decent start, especially since pfiles started having path information in Solaris 10. These two utilities still fail to address one major area of concern: systemic problems. We can tell you who has a given file open, and we can tell you which files a process has open, but we can’t tell you who has NFS files open on your system or which processes are bound to a particular port. Running pfiles on every process in the system is not an acceptable solution, if only because it involves stopping every single process on your system. Dtrace can identify problems as they occur, but it can’t give you a snapshot of the current state of the system.

I am not looking to clone lsof, but I do think we should be able to answer some of the questions that our customers are asking. This is not the first time this has come up within the Solaris group, but it is something I want to explore as we look beyond Solaris 10. I don’t need to know all of lsof’s features; I can read the manpages easily enough. So my question to you admins and developers out there is this: What specific questions does lsof allow you to answer that you otherwise cannot determine with the stock Solaris tools, and how does it make your life easier?

One of the most powerful but least understood aspects of the Solaris /proc implementation is what’s known as the ‘agent lwp’. The agent is a special thread that can be created on-demand by external processes. There are a few limitations: only one agent thread can exist in a process, the process must be fully stopped, and the agent cannot perform fork(2) or exec(2) calls. See proc(4) for all the gory details. So what’s its purpose?

Consider the pfiles command. It’s pretty easy to get the number of file descriptors for the process, and it’s pretty easy to get their path information (in Solaris 10). But there’s a lot of information there that can only be found through stat(2), fcntl(2), or getsockopt(3SOCKET). In this situation, we have generally three choices:

  1. Create a new system call. System calls are fast, but there aren’t many of them, and they’re generally reserved for something reasonably important. Not to mention the duplicated code and hairy locking problems.
  2. Expose the necessary information through /proc. This is marginally better than the above. We still have to write a good chunk of kernel code, but we don’t have to dedicate a system call for it. On the other hand, we have to expose a lot of information through /proc, which means it’s a public interface and will have to be supported for eternity. This is only done when we believe the information is useful to developers at large.
  3. Using the agent lwp, execute the necessary system call in the context of the controlled process.

For debugging utilities and various tools that are not performance critical, we typically opt for the third option above. Using the agent LWP and a borrowed stack, we do the following: First, we reserve enough stack space for all the arguments and throw in the necessary syscall intructions. We use the trace facilities of /proc to set the process running and wait for it to hit the syscall entry point. We then copy in our arguments, and wait for it to hit the syscall exit point. We then extract any altered values that we may need, clean up after ourselves, and get the return value of the system call.

If all of this sounds complicated, it’s because it is. When you throw everything into the mix, it’s about 450 lines of code to perform a basic system call, with many subtle factors to consider. To make our lives easier, we have created libproc, which includes a generic function to make any system call. libproc is extremely powerful, and provides many useful functions for dealing with ELF files and the often confusing semantics of /proc. Things like stepping over breakpoints and watchpoints can be extremely tricky when using the raw proc(4) interfaces. Unfortunately, the libproc APIs are private to Sun. Hopefully, one of my first tasks after the Solaris 10 crunch will be to clean up this library and present it to the world.

There are those among us who have other nefarious plans for the agent LWP. It’s a powerful tool that I find interesting (and sometimes scary). Hopefully we can make it more accesible in the near future.

Those of you who have used a UNIX system before are probably familiar with the /proc filesystem. This directory provides a view of processes running on the system. Before getting into the gory details of the Solaris implementation (see proc(4) if you’re curious), I thought I would go over some of the different variants over the years. You’ll have to excuse any inaccuracies presented here; this is a rather quick blog entry that probably doesn’t do the subject justice. Hopefully you’ll be inspired to go investigate some of this on your own.

Eighth Edition UNIX

Tom Killian wrote the first implementation of /proc, explained in his paper1 published in 1984. It was designed to replace the venerable ptrace system call, which until then was used for primitive process tracing. Each process was a file in /proc, allowing the user to read and write directly to the file, rather than using ptrace‘s cumbersome single-byte transfers.


The definitive /proc implementation, written by Roger Faulkner2 and Ron Gomes, explained in their paper3 published in 1991. This was a port of the Eighth Edition /proc, with some enhancements. The directory was still a flat directory, and each file supported read(), write(), and ioctl() interfaces. There were 37 ioctls total, including basic process control, signal/fault/syscall tracing, register manipulation, and status information. This created a powerful base for building tools such as ps without needing specialized system calls. Although useful, the system was not very user friendly, and was not particularly extensible. This system was brought into Solaris 2.0 with the move to a SVR4 base.

Solaris 2.6

The birth of the modern Solaris /proc, first conceived in 1992 and not fully implemented until 1996. This represented a massive restructuring of /proc; the most important change being that each pid now represented a directory. Each directory was populated with a multitude of files which removed the need for the ioctl interface. Process mappings, files, and objects can all be examined through calls to readdir() and read. Each LWP (thread) also has its own directory. The files are all binary files designed to be consumed by programs. The most interesting file is the /proc/<pid>/ctl file, which provides similar functionality of the old ioctl interfaces. The number of commands originally at 27, and they were much more powerful than their ioctl forbears. Very little has changed since this original implementation; only two new entries have been added to the directory, and only 4 new commands have been added. The tools built upon this interface, however, have changed dramatically.

4.4 BSD

The BSD kernel implements a version of procfs somewhere between the SVR4 version and the Solaris 2.6 version. Each process has its own directory, but there are only 8 entries in each directory, with the ability to access memory, registers, and current status. The control commands available are fairly primitive, allowing only for attach, detach, step, run, wait, and signal posting. In later derivatives (FreeBSD 4.10), the number of directory entries has expanded slightly to 12, though the control interfaces seem to have remained the same.


Linux takes a much different approach towards /proc. First of all, the Linux /proc contains a number of files and directories that don’t directly relate to procceses. Some of these files are migrating to the /system directory in the 2.6 kernel, but /proc is still a dumping ground for all sorts of device and system-wide files. Secondly, all the files are all plaintext files; a major departure from the historical use of /proc. A good amount of information is available in the the /proc pid directory, but the majority of control is still done through interfaces such as ptrace. I will certainly spend some time to see how things like gdb and strace interact with processes.

We in Solaris designed /proc as a tool for developers to build innovative solutions, not an end-user interface. The Linux community believes that ‘cat /proc/self/maps‘ is the best user interface, while we believe that pmap(1) is right answer. The reason for this is that mdb(1), truss(1), dtrace(1M) and a host of other tools all make use of this same information. It would be a waste of time to take binary information in the kernel, convert it to text, and then have the userland components all write their own (error prone) parsing routines to convert this information back into a custom binary form. Plus, we can change the options and output format of pmap without breaking other applications that depend on the contents of /proc. There are some very interesting ways in which we leverage this information which I’ll cover in future posts.

1 T. J. Killian. Processes as Files. Proceedings of the USENIX Software Tools Users Group Summer Conference, pp 203-207, June 1984.

2 Roger Faulkner was then at Sun, and continues to work here in a one-man race to be the oldest kernel hacker on the planet. These days you can find him in Michigan, pounding away at the amd64 port.

3 R. Faulkner and R. Gomes. The Process File System and Process Model in UNIX System V. USENIX Conference Proceedings. Dallas, Texas. January 1991.

All you developers out there are probably well acquainted with corefiles. Every CS student has had a program try to dereference a NULL pointer at least once. Frequent use of assert(3c) causes your program to abort(3c) in exceptional circumstances. Unfortunately, too many developers know little else about corefiles. They are often used for nothing more than a stack backtrace or a quick session with dbx. We in the Solaris group take corefiles very seriously – they get the same amount of attention as crash dumps. Over the past few releases, we’ve added some great features to Solaris relating to corefiles that not everyone may be familiar with, including some really great stuff in Solaris 10. Here is a short list of some of the things that can make your life easier as a developer, especially when servicing problems from the field.


The gcore(1) command will generate a corefile from a running program – essentially a snapshot at that point in time. The process will continue on as if nothing has happened, crucial when an app is misbehaving in a non-fatal way and you don’t want to resort to SIGABRT. Rather than trying to reproduce it or get access to the system while it is running, the customer can simply gcore the process and forward the corefile.


This is a command that system administrators and developers alike should be familiar with. If you run a non-development server, processes should never coredump. Unless it’s intentional (like sending SIGABRT or mucking about in /proc), every corefile produced is a bug. Admins can log all corefiles to a central location, so they know whom to blame when something goes wrong (usually us). Developers can have fine-grained control over the content of the corefile and where it gets saved. Having dozens of files named ‘core’ in every directory usually isn’t the most helpful thing in the world.

corefile content

Starting in Solaris 10, we now have fine-grained control over the exact content of every corefile generated on the system (many thanks to Adam Leventhal for this). Read up on coreadm(1M) for all the gory details, but the most important thing is that we now have library text segments with the corefile. It used to be that if you got a core from a customer, you would need to find a matching version of every library they linked to in order to decipher what was going on. This made debugging complicated customer problems extremely difficult.

CTF data for libraries

We have supported a special form of debugging information known as CTF (“Compact Type Format”) in the kernel since Solaris 9. We take the debugging information generated by the ‘-g’ compiler flag, and strip out everything but the type information. It is stored in a compact format so it is suitable for shipping with production binaries. This information is enormously useful, so we added userland support for it in Solaris 10. MDB consumes this information, so you can do interesting things like ::print a socket structure from a core generated by your production app. Unfortunately, the tools used to convert this information from STABS are not publicly available yet, so you cannot add CTF data to your own application. We’re working on it.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012