Eric Schrock's Blog

Month: January 2005

Finally, Solaris 10 has been released! Go download it at:

For a litte trivia, the first Solaris 10 putback was on January 22, 2002. So you can imagine how
good it feels to see this out the door after 3 years of hard work. I’ve only been involved for the last year and a half, and I’m overwhelmed. Kudos to the entire Solaris team for putting together such an amazing OS.

P.S. Now that things have settled down a bit, and OpenSolaris is revving up, I
promise (once again) to increase my blog output.

To go along with today’s announcement (as well as associated press), we also recently provided buildable source to the OpenSolaris pilot participants. The ensuing “build race” has produced a number of very happy people, both inside and outside of Sun.

Check out Ben’s blog for a screenshot, as well as Dennis’s Blog. Dennis also put the screenshot front and center at blastwave. Very cool stuff!

More info will certainly be forthcoming, so check out the full list of OpenSolaris blogs over at It’s going to be a fun year.

Last week I announced our bootchart results. Dan continued with a sample of zone boot, as well as some interesting bugs that have been found already thanks to this tool. While we’re working on getting the software released, I thought I’d go into some of the DTrace implementation.

To begin with, we were faced with the annoying task of creating a parsable log file. After looking at the existing implementation (which parses top output) and a bunch of groaning, Dan suggested that we should output XML data and leverage the existing Java APIs to make our life easier. Faced with the marriage between something as “lowlevel” as DTrace and something as “abstract” as XML, my first reaction was one of revulsion and guilt. But quickly we realized this was by far the best solution. Our resulting parser was 230 lines of code, compared with 670 for the set of parsers that make up the open source version.

Once we settled on an output format, we had to determine exactly what we would be tracing, and exactly how to do it. First off, we had to trace process lifetime events (fork, exec, exit, etc). With the top implementation, we cannot catch exact event times, nor can we catch short-lived processes which begin and end within a sample period. So we have the following D probes:

  • proc:::create – Fires when a new process is created. We log the parent PID, as well as the new child PID.
  • proc:::exec-success – Fires when a process calls exec(2) successfully. We log the new process name, so that we can convert between PIDs and process names at any future point.
  • proc:::exit – Logs an event when a process exits. We log the current PID.
  • exec_init:entry – This one is a little subtle. Due to the way in which init(1M) is started, we don’t get a traditional proc:::create probe. So we have to use FBT and catch calls to exec_init(), which is responsible for spawning init.

This was the easy part. The harder part was to gather usage statistics on a regular basis. The approach we used leveraged the following probes:

  • sched:::on-cpu, sched:::off-cpu – Fires when a thread goes on or off CPU. We keep track of the time spent on CPU, and increment an aggregation using the sum() aggregation.
  • profile:::tick-200ms – Fires on a single CPU every 200 milliseconds. We use printa() to dump the contents of the CPU aggregation on every interval

There were several wrinkles in this plan. First of all, printa() is processed entirely in userland. Given the following script:

#!/usr/sbin/dtrace -s
@count["count"] = count();

One would expect that you would see 5 consecutive outputs of “20”. Instead, you see one output of “100”, and four more of “0”. Because the default switchrate for DTrace is one second, and aggregations are processed by the dtrace(1M) process, we only see the aggregations once a second. This can be fixed by decreasing the switchrate tunable. This also means we can’t make use of printa() during anonymous tracing, so we had to have two separate scripts (one for early boot, one for later).

The results are reasonable, but Ziga (the original author of bootchart) suggested a much more clever way of keeping track of samples. Instead of relying on printa(), we key the aggregation based on “sample number” (time divided by a large constant), and then dump the entire aggregation at the end of boot. Provided the amount of data isn’t too large, the entire thing can be run anonymously, and we don’t have the overhead of DTrace waking up every 10 milliseconds (in the realtime class, no less) to spit out data. We’ll likely try this approach in the future.

There’s more to be said, but I’ll leave this post to be continued later by myself or Dan. In the meantime, you can check out a sample logfile produced by the D script.

I’ve been on a vacation for a while, but last time I mentioned that Dan and myself had been working on a Solaris port of the open source bootchart program. After about a week of late night hacking, we had a new version (we kept only the image renderer). You can see the results (running on my 2×2 GHz opteron desktop) by clicking on the thumbnail below:

In the next few posts, I’ll go over some of the implementation details. We are working on open sourcing the code, but in the meantime I can talk about the instrumentation methodology and some of the hurdles that had to be overcome. A few comparisons with the existing open source implementation:

  • Our graphs show every process used during boot. Unlike the open implementation, which relies on top, we can leverage DTrace to catch every process.

  • We don’t have any I/O statistics. Part of this is due to our instrumentation methodology, and part of it is because the system-wide I/O wait statistic has been eliminated from S10 (it was never very useful or accurate). Since we can do basically anything with DTrace, we hope to include per-process I/O statistics at a future date, as well as duplicating the iostat graph with a DTrace equivalent.

  • We include absolute boot time, and show the beginning of init(1M) and those processes that run before our instrumentation can start. So if you wish to compare the above chart with the open implementation, you will have to subtract approximately 9 seconds to get a comparable time.

  • We chose an “up” system as being one where all SMF services were running. It’s quite possible to log into a system well before this point, however. In the above graph, for example, one could log in locally on the console after about 20 seconds, and log in graphically after about 30 seconds.

  • We cleaned up the graph a little, altering colors and removing horizontal guidelines. We think it’s more readable (given the large number of processes), but your opinion may differ.

Stay tuned for more details.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012