With the tidal wave of features that is Solaris 10, it’s all too easy to miss the less visible enhancements. Once of these “lost features” is microstate accounting turned on by default for all processes. Process accounting keeps track of time spent by each thread in various states – idle, in system mode, in user mode, etc. This is used for calculating the usage percentages you see in prstat(1) as well as time(1) and ptime(1). Historically there have been two ways of doing process accounting:
Virtually every operating system in existence has a clock function – a periodic interrupt used to perform regular system activity1. On Solaris, the clock function does a number of simple tasks, mostly having to do with process and CPU accounting. Clock based accounting examines each CPU, and increments per-process statistics depending on whether the current thread is in user mode, system mode, or idle. This uses statistical sampling (taking a snapshot every 10 milliseconds) to come up with a picture of system resource usage.
With microstate accouning, the counters are updated with each state transition as they occur in real time. This results in much more accurate, if more expensive, accounting information. Previously, you could turn this on through the /proc interfaces (proc(4)) using the PR_MSACCT control. This has since become a no-op; you cann’t disable microstate accounting in Solaris 10.
The clock-based sampling method has several drawbacks. One is that it is only a guess for rapidly transitioning threads. If you’re going between system and user mode faster than once every 10 milliseconds, the clock will only account for a single state during this time period. Even worse is the fact that it can miss threads entirely if it catches a CPU in an idle state. If you have a thread that wakes up every 10 milliseconds and does 9 milliseconds of work before going to sleep, you will never account for the fact that it is using 90% of the CPU2.
These ‘phantom’ threads can hide from traditional process tools. The now-infamous gtik2_applet2 outlined in the DTrace USENIX paper is a good example of this. These inaccuracies also affect the fair share scheduler (FSS(7)), which must make resource allocation decisions based on the accuracy of these accounting statistics.
With microstate accounting, all of these artifacts disappear. Since we do the calculations in situ, it’s impossible to “miss” a transition. The reason we haven’t always relied on microstate accounting is that it used to be a fairly expensive process. Threads transition between states with high frequency, and with each transition we need to get a timestamp to determine how long we spent in the last state.
The key obversation that made this practical is that the expensive part is not reading the hardware timestamp. The expensive part is the conversion to nanoseconds – clock ticks are useless when doing any sort of real accounting. The solution was to use clock ticks for all the internal accounting, and only do the conversion to nanoseconds when actually requested. With a few other tricks, we were able to reduce the impact to virtually nil. You can see the difference in micro-benchmark performance of short system calls (such as getpid(2)), but it’s unnoticeable in any normal system call (like read(2)), and nonexistent in any macro benchmark.
All the great benefits of microstate acocunting at a fraction of the price.
1 The clock function is actually a consumer of the cyclic subsystem – a nice callout mechanism written by Bryan back in Solaris 8. Certainly worth a post of its own in the future.
2This is especially true on x86, because historically the hardware clock has been the source of all timer activity. This means every thread ran lockstep with the clock with respect to timers and wakeups. This has been addressed in Solaris 10 with the introduction of arbitrary precision timer interrupts, which makes the cyclic subsystem far more interesting on x86 machines (Solaris SPARC has had this functionality for a while).