Eric Schrock's Blog

Month: November 2004

As Alan points out, Jonathan had a few comments about OpenSolaris in an interview in ComputerWorld. One interesting bit is an “official” timeline:

We will have the license announced by the end of this calendar year and the code fully available [by the] first quarter of next year.

I can tell you the OpenSolaris team is working like gangbusters to make this a reality. As soon as there is an exact date, you can bet that we’ll be making some noise. Jonathan also made a statement that seems to directly conflict with my blog post yesterday:

Is there anything preventing you from making all of Solaris open-source? Nothing at all. And let me repeat that. Nothing at all.

That’s a little different than my statement that “there are pieces of Solaris that cannot be open sourced due to encumbered licenses.” The point here is that there are no wide-ranging problems (patents, System V code, SCO, Novell) preventing us from opening all of Solaris. Nor is there any internal pressure to keep some parts of Solaris secret. The “pieces” that I mentioned are few and far between – a single file, a command, or maybe a driver. We’re also securing rights to these pieces on a monthly basis, so the number keeps dropping (and may reach zero by the time OpenSolaris goes live).

The other point is that we won’t be releasing a crippled system. Even if some pieces are not immediately available in source form, we will make sure that anyone can build the same version of Solaris that we do. We hope that these pieces can be re-written and released under the OpenSolaris license as quickly as possible.

So Novell has made a vague threat of litigation over OpenSolaris, which was prompty spun by the media into a declaration of war by Novell. But it has generated quite a bit of discussion over at OSNews. The claim itself is largely FUD (the “article” is little more than gossip), and the discussion covers a wide range of (often unrelated) topics. But I thought I’d pick out a few of the points that keep coming up and address them here, as the article over at LWN seems to make some of the same mistakes and assumptions.

Sun does not own sysV, and therefore cannot legally opensource it

No one can really say what Sun owns rights to. Unless you have had the privilege of reading the many contracts Sun has (which most Sun employees haven’t, myself included), it’s presumptuous to state definitively what we can or cannot do legally. We have spent a lot of money acquiring rights to the code in Solaris, and we have a large legal department that has been researching and extending those rights for a very long time. Novell thinks they own rights to some code that we may or may not be using, but I trust that our legal department (and the OpenSolaris team) has done due diligence in ensuring that we have the necessary rights to open source Solaris.

Sun has been “considering” open sourcing solaris for about five years now. It’s all just a PR stunt.

I can’t emphasize enough that this is not a PR stunt. We have a dozen engineers working full-time getting OpenSolaris out the door. We have fifty external customers participating in the OpenSolaris pilot. We have had discussions with dozens of customers and ISVs, as well as presenting at numerous BOFs across the country. This will happen. Yes, it has taken five years – there’s a lot of ground to cover when open sourcing 20 years of OS development.

Even if it is open source it still is proprietary, because no one can modify its code and can’t make changes, all one can do is watch and suggest to Sun.

We have already publicly stated that our goal is to build a community. There is zero benefit to us throwing source code over the wall as a half-hearted guesture towards the open source community. While it may not happen overnight, there will be community contributions to OpenSolaris. We want the responsibility to rest outside of Sun’s walls, at which point we become just another (rather large) contributor to an open source project.

However, the company has not yet announced a license, whether the license will be OSI-compliant or exactly how much of Solaris 10 will be under this open source license.

We have not announced a license, but we have also stated numerous times that it will be OSI-compliant. We know that using a non-OSI license will kill OpenSolaris before it leaves the gate. As to how much of Solaris will be released, the answer is “everything we possibly can.” There are pieces of Solaris that cannot be open sourced due to encumbered licenses. But time and again people suggest that we will open source “everything but the crown jewels” – as if we could open source everything but DTrace, or everything but x86 support. Every decision is made based on existing legal agreements, not some misguided attempt to create a crippled open source port.

OpenSolaris is still under development – some of the specifics (licensing, governance model, etc) are still being worked out. All of us are involved one way or another in the future of OpenSolaris. Our words may not carry the “official” tag associated with a press release or news conference, but we’re the ones working on OpenSolaris every single day. All of this will be settled when OpenSolaris goes live (as soon as we have a date we’ll let you know). Until then, we’ll keep trying to get the message out there. I encourage you to ignore your preconceived notions of Sun, of what has and has not been said in the media, and instead focus on the real message – straight from the engineers driving OpenSolaris.

A little while ago I mentioned the cyclic subsystem. This is an interesting little area of Solaris, written by Bryan back in Solaris 8. It is the heart of all timer activity in the Solaris kernel.

The majority of this blog post comes straight from the source. Those of you inside Sun or part of the OpenSolaris pilot should check out usr/src/uts/common/os/cyclic.c. A precursor to the infamous sys/dtrace.h, this source file has 1258 lines of source code, and 1638 lines of comments. I’m going to briefly touch on the high-level aspects of the system; but as you can imagine, it’s quite complicated.

Traditional methods

Historically, operating systems have relied a regular clock interrupt. This is different from the clock frequency of the chip – the clock interrupt typically fires every 10ms. All regular kernel activity was scheduled around this omnipresent clock. One of these activities would be to check if there any expired timeouts that need to be triggered.

This granular frequency is usually enough for average activities, but can kill realtime applications that require high-precision timing becaus it forces timers to align on these artificial boundaries. For example, imagine we need a timer to fire every 13 milliseconds. Rather than having the timers fire at 13, 26, 39, and 52 ms, we would instead see it fire at 20, 30, 40, and 60 ms. This is clealy not what we wanted. The result is known as “jitter” – timing deviations from the ideal. Timing granuality, scheduling artifacts, system load, and interrupts all introduce arbitrary latency into the system. By using existing Solaris mechanisms (processor binding, realime scheduling class) we could eliminate much of the latency, but we were still stuck with the granularity of the system clock. The frequency could be tuned up, but this would also increase the time spent doing other clock activity (such as process accounting), and induce significant load on the system.

Cyclic subsystem basics

Enter the cyclic subsystem. It provides for highly accurate interval timers. The key feature is that it is based on programmable timestamp counters, which have been available for many years. In particular, these counters can be programmed (quickly) to generate an interrupt at arbitrary and accurate intervals. Originally available only for SPARC, x86 support (based on programmable APICs) is now available in Solaris 10.

The majority of the kernel sees a very simple interface – you can add, remove, or bind cyclics. Internally, we keep around a heap of cyclics, organized by expiration time. This internal interface connects to a hardware-specific backend. We pick off the next cyclic to process, and then program the hardware to notify us after the next interval. This basic layout can be seen on any system with the ::cycinfo -v dcmd:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp uhci usba nca crypto random lofs nfs ptm ipc ]
> ::cycinfo -v
0 d2da0140  online      4 d2da00c0   50d7c1308c180 apic_redistribute_compute
0                                     3
|                                     |
+---------+--------+                  +---------+---------+
d2da00c0   0    1 high     0   50d7c1308c180   10000 cbe_hres_tick
d2da00e0   1    0  low     0   50d7c1308c180   10000 apic_redistribute_compute
d2da0100   2    3 lock     0   50d7c1308c180   10000 clock
d2da0120   3    2 high     0   50d7c35024400 1000000 deadman

On this system there are no realtime timers active, so the intervals (USECINT) are pretty boring. You may notice one elegant feature of this implementation – the clock() function is now just a cyclic consumer. If you’re wondering what ‘deadman’ is, and why it has such a high interval, it’s a debugging feature that saves the system from hanging indefinitely (most of the of the time). Turn it on by adding ‘set snooping = 1’ in /etc/system. If the clock cannot make forward progress in 50 seconds, a high level cyclic will fire and we’ll panic.

To register your own cyclic, use the timer_create(3RT) function with the CLOCK_HIGHRES type (assuming you have the PROC_CLOCK_HIGHRES privilege). This will create a low level cyclic with the appropriate timeout. The average latency is extremely small when done properly (bound to a CPU with interrupts disabled) – on the order of a few microseconds on modern hardware. Much better than the 10 millisecond artifacts possible with clock-based callouts.

More details

At a high level, this seems pretty straightforward. Once you figure out how to program the hardware, just toss some function pointers into an AVL tree and be done with it, right? Here are some of the significant wrinkles in this plan:

  • Fast operation – Because we’re dispatching real-time timers, we need to be able to trigger and re-schedule cyclics extremely quickly. In order to do this, we make use of per-CPU data structures, heap management that touches a minimal number of cache lines, and lock-free operation. The latter point is particularly difficult, considering the presence of low-level cyclics.

  • Low level cyclics – The cylic subsystem operates at a high interrupt level. But not all cyclics should run at such a high level (and very few do). In order to support low level cyclics, the subsystem will post a software interrupt to deal with the cyclic at a lower level interrupt. This opens up a whole can of worms, because we have to guarantee a 1-to-1 mapping, as well as maintain timing constraints.

  • Cyclic removal – While rare, it is occasionally necessary to remove pending cyclics (the most common occurence is when unloading modules with registered cyclics). This has to be done without disturbing the other running cyclics.

  • Resource resizing – The heap, as well as internal buffers used for pending lowlevel cyclics, must be able to handle any number of active cyclics. This means that they have to be resizable, while maintaining lock-free operation in the common path.

  • Cyclic jugging – In order to offline CPUs, we must be able to re-schedule cyclics on other active CPUs, without missing a timeout in the process.

As you can see, the cyclic subsystem is a complicated but well-contained subsystem. It uses a well-organized layout to expose a simple interface to the rest of the kernel, and provides great benefit to both in-kernel consumers and timing-sensitive realtime applications.

If you haven’t already, go check out Tom Adelstein’s article at LXer, as well as Jim’s blog entry. Quite a departure from the usual Sun analysis. It’s an interesting take: Sun is (and has been) doing great things, but corporate messaging (both our competitors and our own) has muddied the waters to the point where most people don’t know what we’re about anymore. I also like the view that Linux and Solaris can and should live in harmony – I’m tired of the “all for one and one for all” attitude when it comes to Linux. It’ll be interesting to see if OpenSolaris will sway public opinion, or whether we’re too firmly cast into the role of evil empire.

So it’s no secret that AMD and Intel are in a mad sprint to the finish for dual-core x86 chips. The offical AMD roadmap, as well as public demos have all shown AMD well on track. The latest tidbits of information indicate Linux is up and running on these dual-core systems. Very cool.

Given our close relationship with AMD and the sensitive nature of hardware plans, I’ll refrain from saying what we may or may not have running in our labs. But Solaris has some great features that make it well-suited for these dual core chips. First of all, Solaris 10 has had support for both Chip Multi Threading (hyperthreading) and Chip Multi Processing (multi core) for about a year and half now. Solaris has also been NUMA-aware for much longer (with the current lgroups coming in mid-2001, or Solaris 9). I’m sure AMD has made these cores appear as two processesors for legacy purposes, but with a little cpuid tweaks, we’ll see them as sibling cores and get all the benefits inherent in Solaris 10 CMP.

Despite this, the NUMA system in Solaris is undergoing drastic change due to the Opteron memory architecture. While Solaris is NUMA-aware, it uses a simplistic memory heirarchy based on the physical architecture of Sun’s high end SPARC systems. We have the notion of a “locality group”, which represents the logical relationship of CPUs and memory. Currently, there are only two notions of locality – “near” and “far”. Solaris tries its best to keep logically connected memory and processes in the same locality group. On Opteron, things get a bit more complicated due to the integrated memory controller and HyperTransport layout. On 4-way machines the processors are laid out in a square, and on 8-way machines we have a ladder formation. Memory transfers must pass through neighboring memory controllers, so now memory could be “near”, “far”, or “farther”. We’re revamping the current lgroup system to support arbitrary memory heirachies, which should produce some nice performance gains on 4- and 8-way Opteron machines. Hopefully one of the NUMA folks will blog some more detailed information once this project integrates.

In conclusion: Opterons are cool, but dual-core Opterons are cooler. And Solaris will rip on both of them.

Given that the amd64 ABI is nearly set in stone, and (as pointed out in comments on my last entry) future OpenSolaris ports could run into similar problems on other architectures (like PowerPC), you may wonder how we can make life easier in Solaris. In this entry I’ll elaborate on two possibilities. Note that these are little more than fantasies at the moment – no real engineering work has been done, nor is there any guarantee that they will appear in a future Solaris release.

DWARF Support for MDB

Even though DWARF is a complex beast, it’s not impossible to write an interpreter. It’s just a matter of doing the work. The more subtle problem is designing it correctly, and making the data accessible in the kernel. Since MDB and KMDB are primarily kernel or post-mortem userland tools, this has not been a high priority. CTF gives us most of what we need, and including all the DWARF information in the kernel (or corefiles) is prohibitively expensive. That being said, there are those among us that would like to see MDB take a more prominent userland role (where it would compete with dbx and gdb), at which point proper DWARF support would be a very nice thing to have.

If this is done properly, we’ll end up with a debugging library that’s format-independent. Whether the target has CTF, STABS, or DWARF data, MDB (and KMDB) will just “do the right thing”. No one argues that this isn’t a cool idea – it’s just a matter of engineering resources and business justification.

Programmatic Disassembler

The alternative solution is to create a disassembler library that understands code at a semantic level. Once you have a disassembler that understands the logical breakdown of a program, you can determine (via simulation) the original argument values to functions. Of course, it’s not always guaranteed to work, but you’ll always know when you’re guessing (even DWARF can’t be correct 100% of the time). This requires no debugging information, only the machine text. It will also help out the DTrace pid provider, which has to wrestle with jump tables and other werid compiler-isms. Of course, this is monumentally more difficult than a DWARF parser – especially on x86.

This idea (along with a prototype) has been around for many years. The converted have prophesized that libdis will bring peace to the world and an end to world hunger. As with many great ideas, there just hasn’t been justification for devoting the necessary engineering resources. But if it can get the arguments to functions on amd64 correct in 98% of the situations, it would be incredibly valuable.

OpenSolaris Debugging Futures

There are a host of other ideas that we have kicking around here in the Solaris group. They range from pretty mundance to completely insane. As OpenSolaris finishes getting in gear, I’m looking forward to getting these ideas out in the public and finding support for all the cool possibilities that just aren’t high enough priority for us right now. The existence of a larger development community will also make good debugging tools a much better business proposition.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012