A little while ago I mentioned the cyclic subsystem. This is an interesting little area of Solaris, written by Bryan back in Solaris 8. It is the heart of all timer activity in the Solaris kernel.
The majority of this blog post comes straight from the source. Those of you inside Sun or part of the OpenSolaris pilot should check out usr/src/uts/common/os/cyclic.c. A precursor to the infamous sys/dtrace.h, this source file has 1258 lines of source code, and 1638 lines of comments. I’m going to briefly touch on the high-level aspects of the system; but as you can imagine, it’s quite complicated.
Traditional methods
Historically, operating systems have relied a regular clock interrupt. This is different from the clock frequency of the chip – the clock interrupt typically fires every 10ms. All regular kernel activity was scheduled around this omnipresent clock. One of these activities would be to check if there any expired timeouts that need to be triggered.
This granular frequency is usually enough for average activities, but can kill realtime applications that require high-precision timing becaus it forces timers to align on these artificial boundaries. For example, imagine we need a timer to fire every 13 milliseconds. Rather than having the timers fire at 13, 26, 39, and 52 ms, we would instead see it fire at 20, 30, 40, and 60 ms. This is clealy not what we wanted. The result is known as “jitter” – timing deviations from the ideal. Timing granuality, scheduling artifacts, system load, and interrupts all introduce arbitrary latency into the system. By using existing Solaris mechanisms (processor binding, realime scheduling class) we could eliminate much of the latency, but we were still stuck with the granularity of the system clock. The frequency could be tuned up, but this would also increase the time spent doing other clock activity (such as process accounting), and induce significant load on the system.
Cyclic subsystem basics
Enter the cyclic subsystem. It provides for highly accurate interval timers. The key feature is that it is based on programmable timestamp counters, which have been available for many years. In particular, these counters can be programmed (quickly) to generate an interrupt at arbitrary and accurate intervals. Originally available only for SPARC, x86 support (based on programmable APICs) is now available in Solaris 10.
The majority of the kernel sees a very simple interface – you can add, remove, or bind cyclics. Internally, we keep around a heap of cyclics, organized by expiration time. This internal interface connects to a hardware-specific backend. We pick off the next cyclic to process, and then program the hardware to notify us after the next interval. This basic layout can be seen on any system with the ::cycinfo -v dcmd:
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp uhci usba nca crypto random lofs nfs ptm ipc ]
> ::cycinfo -v
CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER
0 d2da0140 online 4 d2da00c0 50d7c1308c180 apic_redistribute_compute
1
|
+------------------+------------------+
0 3
| |
+---------+--------+ +---------+---------+
2
|
+----+----+
ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER
d2da00c0 0 1 high 0 50d7c1308c180 10000 cbe_hres_tick
d2da00e0 1 0 low 0 50d7c1308c180 10000 apic_redistribute_compute
d2da0100 2 3 lock 0 50d7c1308c180 10000 clock
d2da0120 3 2 high 0 50d7c35024400 1000000 deadman
>
On this system there are no realtime timers active, so the intervals (USECINT) are pretty boring. You may notice one elegant feature of this implementation – the clock() function is now just a cyclic consumer. If you’re wondering what ‘deadman’ is, and why it has such a high interval, it’s a debugging feature that saves the system from hanging indefinitely (most of the of the time). Turn it on by adding ‘set snooping = 1’ in /etc/system. If the clock cannot make forward progress in 50 seconds, a high level cyclic will fire and we’ll panic.
To register your own cyclic, use the timer_create(3RT) function with the CLOCK_HIGHRES type (assuming you have the PROC_CLOCK_HIGHRES privilege). This will create a low level cyclic with the appropriate timeout. The average latency is extremely small when done properly (bound to a CPU with interrupts disabled) – on the order of a few microseconds on modern hardware. Much better than the 10 millisecond artifacts possible with clock-based callouts.
More details
At a high level, this seems pretty straightforward. Once you figure out how to program the hardware, just toss some function pointers into an AVL tree and be done with it, right? Here are some of the significant wrinkles in this plan:
-
Fast operation – Because we’re dispatching real-time timers, we need to be able to trigger and re-schedule cyclics extremely quickly. In order to do this, we make use of per-CPU data structures, heap management that touches a minimal number of cache lines, and lock-free operation. The latter point is particularly difficult, considering the presence of low-level cyclics.
-
Low level cyclics – The cylic subsystem operates at a high interrupt level. But not all cyclics should run at such a high level (and very few do). In order to support low level cyclics, the subsystem will post a software interrupt to deal with the cyclic at a lower level interrupt. This opens up a whole can of worms, because we have to guarantee a 1-to-1 mapping, as well as maintain timing constraints.
-
Cyclic removal – While rare, it is occasionally necessary to remove pending cyclics (the most common occurence is when unloading modules with registered cyclics). This has to be done without disturbing the other running cyclics.
-
Resource resizing – The heap, as well as internal buffers used for pending lowlevel cyclics, must be able to handle any number of active cyclics. This means that they have to be resizable, while maintaining lock-free operation in the common path.
-
Cyclic jugging – In order to offline CPUs, we must be able to re-schedule cyclics on other active CPUs, without missing a timeout in the process.
As you can see, the cyclic subsystem is a complicated but well-contained subsystem. It uses a well-organized layout to expose a simple interface to the rest of the kernel, and provides great benefit to both in-kernel consumers and timing-sensitive realtime applications.