Eric Schrock's Blog

Month: July 2005

There’s an interesting discussion over at opensolaris-code, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between Solaris and other operating systems out there. I encourage you to read the thread in its entirety, but here’s an executive summary:

  • When possible, the system should be auto-tuning – If you are creating a tunable to control fine grained behavior of your program or operating system, you should first ask yourself: “Why does this tunable exist? Why can’t I just pick the best value?” More often than not, you’ll find the answer is “Because I’m lazy” or “The problem is too hard.” Only in rare circumstances is there ever a definite need for a tunable, and almost always control coarse on-off behavior.

  • If a tunable is necessary, it should be as specific as possible – The days of dumping every tunable under the sun into /etc/system are over. Very rarely do tunables need to be system wide. Most tunables should be per process, per connection, or per filesystem. We are continually converting our old system-wide tunables into per-object controls.

  • Tunables should be controlled by a well defined interface/etc/system and /proc are not your personal landfills. /etc/system is by nature undocumented, and designing it as your primary interface is fundamentally wrong. While /proc is well documented, but it’s also well defined to be a process filesystem. Besides the enormous breakage you’d introduce by adding /proc/tunables, its philosophically wrong. The /system directory is a slightly better choice, but it’s intended primarily for observability of subsystems that translate well to a hierarchical layout. In general, we don’t view filesystems as a primary administrative interface, but a programmatic API upon which more sophisticated tools can be built.

One of the best examples of these principles can been seen in the updated System V IPC tunables. Dave Powell rewrote this arcane set of /etc/system tunables during the course of Solaris 10. Many of the tunables were made auto-tuning, and those that couldn’t be were converted into resource controls administered on a per process basis using standard Solaris administrative tools. Hopefully Dave will blog at some point about this process, the decisions he made, and why.

There are, of course, always going to be exceptions to the above rules. We still have far too many documented /etc/system tunables in Solaris today, and there will always be some that are absolutely necessary. But our philosophy is focused around these principles, as illustrated by the following story from the discussion thread:

Indeed, one of the more amusing stories was a Platinum Beta customer
showing us some slideware from a certain company comparing their OS
against Solaris. The slides were discussing available tunables, and the
basic gist was something like:

“We used to have way fewer tunables than Solaris, but now we’ve caught
up and have many more than they do. Our OS rules!”

Needless to say, we thought they company was missing the point.

Tags:

Like most of Sun’s US employees, I’ll be taking the next week off for vacation. On top of that, I’ll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my brother’s wedding. I’ll leave you with an MDB challenge, this time much more involved than past “puzzles”. I don’t have any prizes lying around, but this one would certainly be worth one if I had anything to give.

So what’s the task? To implement munges as a dcmd. Here’s the complete description:

Implement a new dcmd, ::stacklist, that will walk all threads (or all threads within a specific process when given a proc_t address) and summarize the different stacks by frequency. By default, it should display output identical to ‘munges’:

> ::stacklist
73      ##################################  tp: fffffe800000bc80
swtch+0xdf()
cv_wait+0x6a()
taskq_thread+0x1ef()
thread_start+8()
38      ##################################  tp: ffffffff82b21880
swtch+0xdf()
cv_wait_sig_swap_core+0x177()
cv_wait_sig_swap+0xb()
cv_waituntil_sig+0xd7()
lwp_park+0x1b1()
syslwp_park+0x4e()
sys_syscall32+0x1ff()
...

The first number is the frequency of the given stack, and the ‘tp’ pointer should be a representative thread of the group. The stacks should be organized by frequency, with the most frequent ones first. When given the ‘-v’ option, the dcmd should print out all threads containing the given stack trace. For extra credit, the ability to walk all threads with a matching stack (addr::walk samestack) would be nice.

This is not an easy dcmd to write, at least when doing it correctly. The first key is to use as little memory as possible. This dcmd must be capable of being run within kmdb(1M), where we have limited memory available. The second key is to leverage existing MDB functionality without duplicating code. You should not be copying code from ::findstack or ::stack into your dcmd. Ideally, you should be able to invoke ::findstack without worry about its inner workings. Alternatively, restructuring the code to share a common routine would also be acceptable.

This command would be hugely beneficial when examining system hangs or other “soft failures,” where there is no obvious culprit (such as a panicking thread). Having this functionality in KMDB (where we cannot invoke ‘munges’) would make debugging a whole class of problems much easier. This is also a great RFE to get started with OpenSolaris. It is self contained, low risk, but non-trivial, and gets you familiar with MDB at the same time. Personally, I have always found the observability tools a great place to start working on Solaris, because the risk is low while still requiring (hence learning) internal knowledge of the kernel.

If you do manage to write this dcmd, please email me (Eric dot Schrock at sun dot com) and I will gladly be your sponsor to get it integrated into OpenSolaris. I might even be able to dig up a prize somewhere…

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives