Eric Schrock's Blog

Category: OpenSolaris

Yes, I am still here. And yes, I’m still working on ZFS as fast as I can. But I do have a small amount of free time, and managed to pitch in with some of the OpenSolaris bug sponsor efforts over at the request-sponsor forum. I figure I can handle a bug every week or two even with ZFS in full “end game” swing, and hopefully inspire others to jump on the sponsor bandwagon during this interim period. In particular, I helped Rich Lowe integrate two basic code cleanup fixes into the Nevada gate. Nothing spectacular, but worthy of a proof of concept, and it adds another name to the list of contributors who have had fixes putback into Nevada. Next week I’ll try to grab one of the remaining bugfixes to lend a hand. Maybe someday I’ll have enough time to blog for real, but don’t expect much until ZFS is back in the gate.

Also, check out the Nevada putback logs for build 22. Very cool stuff – kudos to Steve and the rest of the OpenSolaris team. Pay attention to the fixes contributed by Shawn Walker and Jeremy Teo – It’s nice to see active work being done, despite the fact that we still have so much work left to do in building an effective community.

Technorati Tag:

Thanks to Jarod Jenson (of DTrace and Aeysis fame), I now have a shiny new 512MB iPod shuffle to give away to a worthy OpenSolaris cause. Back when I posted my original MDB challenge, I had no cool stuff to entice potential suitors. So now I’ll offer this iPod shuffle to the first person who submits an acceptable solution to the problem and follows through to integrate the code into OpenSolaris (I will sponsor any such RFE). Send your diffs against the latest OpenSolaris source to me at eric dot schrock at sun dot com. We’ll put a time limit of, say, a month and a half (until 10/1) so that I can safely recycle the iPod shuffle into another challenge should no one respond.

Once again, the original challenge is here.

So besides the fame and glory of integrating the first non-bite size RFE into OpenSolaris, you’ll also walk away with a cool toy. Not to mention all the MDB knowledge you’ll have under your belt. Feel free to email me questions, or head over to the mdb-discuss forum. Good Luck!

Tags:

It’s been almost a month since my last blog post, so I thought I’d post an update. I spent the month of July in Massachusetts, alternately on vacation, working remotely, and attending my brother’s wedding. The rest of the LAE (Linux Application Environment) team joined me (and Nils) for a week out there, and we made some huge progress on the project. For the curious, we’re working on how best to leverage OpenSolaris to help the project and the community, at which point we can go into more details about what the final product will look like. Until then, suffice to say “we’re working on it”. All this time on LAE did prevent me from spending time with my other girlfriend, ZFS. Since getting back, I’ve caught up with most of the ZFS work in my queue, and the team has made huge progress on ZFS in my absence. As much as I’d like to talk about details (or a schedule), I can’t 🙁 But trust me, you’ll know when ZFS integrates into Nevada; there are many bloggers who will not be so quiet when that putback notice comes by. Not to mention that the source code will hit OpenSolaris shortly thereafter.

Tomorrow I’ll be up at LinuxWorld, hanging out at the booth with Ben and hosting the OpenSolaris BOF along with Adam and Bryan (Dan will be there as well, though he didn’t make the “official” billing). Whether you know nothing about OpenSolaris or are one of our dedicated community members, come check it out.

There’s an interesting discussion over at opensolaris-code, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between Solaris and other operating systems out there. I encourage you to read the thread in its entirety, but here’s an executive summary:

  • When possible, the system should be auto-tuning – If you are creating a tunable to control fine grained behavior of your program or operating system, you should first ask yourself: “Why does this tunable exist? Why can’t I just pick the best value?” More often than not, you’ll find the answer is “Because I’m lazy” or “The problem is too hard.” Only in rare circumstances is there ever a definite need for a tunable, and almost always control coarse on-off behavior.

  • If a tunable is necessary, it should be as specific as possible – The days of dumping every tunable under the sun into /etc/system are over. Very rarely do tunables need to be system wide. Most tunables should be per process, per connection, or per filesystem. We are continually converting our old system-wide tunables into per-object controls.

  • Tunables should be controlled by a well defined interface/etc/system and /proc are not your personal landfills. /etc/system is by nature undocumented, and designing it as your primary interface is fundamentally wrong. While /proc is well documented, but it’s also well defined to be a process filesystem. Besides the enormous breakage you’d introduce by adding /proc/tunables, its philosophically wrong. The /system directory is a slightly better choice, but it’s intended primarily for observability of subsystems that translate well to a hierarchical layout. In general, we don’t view filesystems as a primary administrative interface, but a programmatic API upon which more sophisticated tools can be built.

One of the best examples of these principles can been seen in the updated System V IPC tunables. Dave Powell rewrote this arcane set of /etc/system tunables during the course of Solaris 10. Many of the tunables were made auto-tuning, and those that couldn’t be were converted into resource controls administered on a per process basis using standard Solaris administrative tools. Hopefully Dave will blog at some point about this process, the decisions he made, and why.

There are, of course, always going to be exceptions to the above rules. We still have far too many documented /etc/system tunables in Solaris today, and there will always be some that are absolutely necessary. But our philosophy is focused around these principles, as illustrated by the following story from the discussion thread:

Indeed, one of the more amusing stories was a Platinum Beta customer
showing us some slideware from a certain company comparing their OS
against Solaris. The slides were discussing available tunables, and the
basic gist was something like:

“We used to have way fewer tunables than Solaris, but now we’ve caught
up and have many more than they do. Our OS rules!”

Needless to say, we thought they company was missing the point.

Tags:

Like most of Sun’s US employees, I’ll be taking the next week off for vacation. On top of that, I’ll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my brother’s wedding. I’ll leave you with an MDB challenge, this time much more involved than past “puzzles”. I don’t have any prizes lying around, but this one would certainly be worth one if I had anything to give.

So what’s the task? To implement munges as a dcmd. Here’s the complete description:

Implement a new dcmd, ::stacklist, that will walk all threads (or all threads within a specific process when given a proc_t address) and summarize the different stacks by frequency. By default, it should display output identical to ‘munges’:

> ::stacklist
73      ##################################  tp: fffffe800000bc80
swtch+0xdf()
cv_wait+0x6a()
taskq_thread+0x1ef()
thread_start+8()
38      ##################################  tp: ffffffff82b21880
swtch+0xdf()
cv_wait_sig_swap_core+0x177()
cv_wait_sig_swap+0xb()
cv_waituntil_sig+0xd7()
lwp_park+0x1b1()
syslwp_park+0x4e()
sys_syscall32+0x1ff()
...

The first number is the frequency of the given stack, and the ‘tp’ pointer should be a representative thread of the group. The stacks should be organized by frequency, with the most frequent ones first. When given the ‘-v’ option, the dcmd should print out all threads containing the given stack trace. For extra credit, the ability to walk all threads with a matching stack (addr::walk samestack) would be nice.

This is not an easy dcmd to write, at least when doing it correctly. The first key is to use as little memory as possible. This dcmd must be capable of being run within kmdb(1M), where we have limited memory available. The second key is to leverage existing MDB functionality without duplicating code. You should not be copying code from ::findstack or ::stack into your dcmd. Ideally, you should be able to invoke ::findstack without worry about its inner workings. Alternatively, restructuring the code to share a common routine would also be acceptable.

This command would be hugely beneficial when examining system hangs or other “soft failures,” where there is no obvious culprit (such as a panicking thread). Having this functionality in KMDB (where we cannot invoke ‘munges’) would make debugging a whole class of problems much easier. This is also a great RFE to get started with OpenSolaris. It is self contained, low risk, but non-trivial, and gets you familiar with MDB at the same time. Personally, I have always found the observability tools a great place to start working on Solaris, because the risk is low while still requiring (hence learning) internal knowledge of the kernel.

If you do manage to write this dcmd, please email me (Eric dot Schrock at sun dot com) and I will gladly be your sponsor to get it integrated into OpenSolaris. I might even be able to dig up a prize somewhere…

There’s actually a decent piece over at eWeek discussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out there about what we’re trying to accomplish, I thought I’d follow up with a little technical background on virtualization and why we’re investing in these different technologies. Keep in mind that these are my personal beliefs based on interactions with customers and other Solaris engineers. Any resemblance to a corporate strategy is purely coincidental 😉

Before diving in, I should point out that this will be a rather broad coverage of virtualization strategies. For a more detailed comparison of Zones and Jails in particular, check out James Dickens’ Zones comparison chart.

Benefits of Virtualization

First off, virtualization is here to stay. Our customers need virtualization – it dramatically reduces the cost of deploying and maintaining multiple machines and applications. The success of companies such as VMWare is proof enough that such a market exists, though we have been hearing it from our customers for a long time. What we find, however, is that customers are often confused about exactly what they’re trying to accomplish, and companies try to pitch a single solution to virtualization problems without recognizing that more appropriate solutions may exist. The most common need for virtualization (as judged by our customer base) is application consolidation. Many of the larger apps have become so complex that they become a system in themselves – and often they don’t play nicely with other applications on the box. So “one app per machine” has become the common paradigm. The second most common need is security, either for your application administrators or your developers. Other reasons certainly exist (rapid test environment deployment, distributed system simulation, etc), but these are the two primary ones.

So what does virtualization buy you? It’s all about reducing costs, but there are really two types of cost associated with running a system:

  1. Hardware costs – This includes the cost of the machine, but also the costs associated with running that machine (power, A/C).
  2. Software management costs – This includes the cost of deploying new machines, and upgrading/patching software, and observing software behavior.

As we’ll see, different virtualization strategies provide different qualities of the above savings.

Hardware virtualization

One of the most well-established forms of virtualization, the most common examples today are Sun Domains and IBM Logical Partitions. In each case, the hardware is responsible for dividing existing resources in such a way as to present multiple machines to the user. This has the advantage of requiring no software layer, no performance impact, and hardware fault isolation. The downside to this is that it requires specialized hardware that is extremely expensive, and provides zero benefit for reducing software management costs.

Software machine virtualization

This approach is probably the one most commonly associated with the term
“virtualization”. In this scheme, a software layer is created which allows
multiple OS instances to run on the same hardware. The most commercialized
versions are VMware and Virtual PC,
but other projects exist (such as qemu and PearPC). Typically, they require a
“host” operating system as well as multiple “guests” (although VMware ESX server
runs a custom kernel as the host). While Xen uses a
paravitualization technique that requires changes to the guest OS, it is still
fundamentally a machine virtualization technique. And Usermode Linux takes a
radically different approach, but accomplishes the basic same task.

In the end, this approach has similar strengths and weaknesses as the hardware assisted
virtualization. You don’t have to buy expensive special-purpose hardware, but
you give up the hardware fault isolation and often sacrifice performance (Xen’s
approach lessens this impact, but its still visible). But most importantly, you
still don’t save any costs associated with software management – administering
software on 10 virtual machines is just as expensive as administering 10
separate machines. And you have no visibility into what’s happening within the
virtual machine – you may be able to tell that Xen is consuming 50% of your CPU,
but you can’t tell why unless you log into the virtual system itself.

Software application virtualization

On the grand scale of virtualization, this ranks as the “least virtualized”.
With this approach, the operating system uses various tricks and techniques to
present an alternate view of the machine. This can range from simple
chroot(1), to BSD
Jails
, to Solaris
Zones
. Each of these provide a more complete OS view with varying degrees
of isolation. While Zones is the most complete and the most secure, they all
use the same fundamental idea of a single operating system presenting an
“alternate reality” that appears to be a complete system at the application
level. The upcoming Linux Application Environment on OpenSolaris will take this
approach by leveraging Zones and emulating Linux at the system call layer.

The most significant downside to this approach is the fact there is a single kernel. You cannot run different operating systems (though LAE will add an interesting twist), and the “guest” environments have limited access to hardware facilities. On the other hand, this approach results in huge savings on the software management front. Because applications are still processes within the host environment, you have total visibility into what is happening within each guest, using standard operating system tools, as well as manage them as you would any other processes, using standard resource management tools. You can deploy, patch, and upgrade software from a single point without having to physically log into each machine. While not all applications will run in such a reduced environment, those that do will be able to benefit from vastly simplified software management. This approach also has the added bonus that it tends to make better use of shared resources. In Zones, for example, the most common configuration includes a shared /usr directory, so that no additional disk space is needed (and only one copy of each library needs to be resident in memory).

OpenSolaris virtualization in the future

So what does this all mean for OpenSolaris? Why are we continuing to pursue Zones, LAE, and Xen? The short answer is because “our customers want us to.” And hopefully, from what’s been said above, it’s obvious that there is no one virtualization strategy that is correct for everyone. If you want to consolidate servers running a variety of different operating systems (including older versions of Solaris), then Xen is probably the right approach. If you want to consolidate machines running Solaris applications, then Zones is probably your best bet. If you require the ability to survive hardware faults between virtual machines, then domains is the only choice. If you want to take advantage of Solaris FMA and performance, but still want to run the latest and greatest from RedHat with support, then Xen is your option. If you have 90% of your applications on Solaris, and you’re just missing that one last app, then LAE is for you. Similarly, if you have a Linux app that you want to debug with DTrace, you can leverage LAE without having to port to Solaris first.

With respect to Linux virtualization in particular, we are always going to pursue ISV certification first. No one at Sun wants you to run Oracle under LAE or Xen. Given the choice, we will always aggressively pursue ISVs to do a native port to Solaris. But we understand that there is an entire ecosystem of applications (typically in-house apps) that just won’t run on Solaris x86. We want users to have a choice between virtualization options, and we want all those options to be a fundamental part of the operating system.

I hope that helps clear up the grand strategy. There will always be people who disagree with this vision, but we honestly believe we’re making the best choices for our customers.

Tags:


You may note, that I failed to mention cross-architecture virtualization. This is most common at the system level (like PearPC), but application-level solutions do exist (including Apple’s upcoming Rosetta). This type of virtualization simply doesn’t factor into our plans, yet, and still falls under the umbrella of one of the broad virtualization types.

I also apologize for any virtualization projects out there that I missed. There are undoubtedly many more, but the ones mentioned above serve to illustrate my point.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives