Last week I announced our bootchart results. Dan continued with a sample of zone boot, as well as some interesting bugs that have been found already thanks to this tool. While we’re working on getting the software released, I thought I’d go into some of the DTrace implementation.
To begin with, we were faced with the annoying task of creating a parsable log file. After looking at the existing implementation (which parses top output) and a bunch of groaning, Dan suggested that we should output XML data and leverage the existing Java APIs to make our life easier. Faced with the marriage between something as “lowlevel” as DTrace and something as “abstract” as XML, my first reaction was one of revulsion and guilt. But quickly we realized this was by far the best solution. Our resulting parser was 230 lines of code, compared with 670 for the set of parsers that make up the open source version.
Once we settled on an output format, we had to determine exactly what we would be tracing, and exactly how to do it. First off, we had to trace process lifetime events (fork, exec, exit, etc). With the top implementation, we cannot catch exact event times, nor can we catch short-lived processes which begin and end within a sample period. So we have the following D probes:
- proc:::create – Fires when a new process is created. We log the parent PID, as well as the new child PID.
- proc:::exec-success – Fires when a process calls exec(2) successfully. We log the new process name, so that we can convert between PIDs and process names at any future point.
- proc:::exit – Logs an event when a process exits. We log the current PID.
- exec_init:entry – This one is a little subtle. Due to the way in which init(1M) is started, we don’t get a traditional proc:::create probe. So we have to use FBT and catch calls to exec_init(), which is responsible for spawning init.
This was the easy part. The harder part was to gather usage statistics on a regular basis. The approach we used leveraged the following probes:
- sched:::on-cpu, sched:::off-cpu – Fires when a thread goes on or off CPU. We keep track of the time spent on CPU, and increment an aggregation using the sum() aggregation.
- profile:::tick-200ms – Fires on a single CPU every 200 milliseconds. We use printa() to dump the contents of the CPU aggregation on every interval
There were several wrinkles in this plan. First of all, printa() is processed entirely in userland. Given the following script:
#!/usr/sbin/dtrace -s
profile:::tick-10ms
{
@count["count"] = count();
}
profile:::tick-200ms
{
printa(@count);
clear(@count);
}
One would expect that you would see 5 consecutive outputs of “20”. Instead, you see one output of “100”, and four more of “0”. Because the default switchrate for DTrace is one second, and aggregations are processed by the dtrace(1M) process, we only see the aggregations once a second. This can be fixed by decreasing the switchrate tunable. This also means we can’t make use of printa() during anonymous tracing, so we had to have two separate scripts (one for early boot, one for later).
The results are reasonable, but Ziga (the original author of bootchart) suggested a much more clever way of keeping track of samples. Instead of relying on printa(), we key the aggregation based on “sample number” (time divided by a large constant), and then dump the entire aggregation at the end of boot. Provided the amount of data isn’t too large, the entire thing can be run anonymously, and we don’t have the overhead of DTrace waking up every 10 milliseconds (in the realtime class, no less) to spit out data. We’ll likely try this approach in the future.
There’s more to be said, but I’ll leave this post to be continued later by myself or Dan. In the meantime, you can check out a sample logfile produced by the D script.