Eric Schrock's Blog

Yesterday, several of us from Delphix, Nexenta, Joyent, and elsewhere, convened before the OpenStorage summit as part of an illumos hackathon.  The idea was to get a bunch of illumos coders in a room, brainstorm a bunch of small project ideas, and then break off to go implement them over the course of the day.  That was the idea, at least – in reality we didn’t know what to expect or how it would turn out.  Suffice to say that the hackathon was an amazing success.  There were a lot of cool ideas, and a lot of great mentors in the room that could lead people through unfamiliar territory.

For my mini-project (suggested by ahl), I implemented MDB’s ::print functionality in DTrace via a new print() action. Today, we have the trace() action, but the result is somewhat less than useful when dealing with structs, as it degenerates into tracemem():

# dtrace -qn 'BEGIN{trace(`p0); exit(0)}'
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 00 00 00 00 00 00 00 00 60 02 c3 fb ff ff ff ff  ........`.......
        10: c8 c9 c6 fb ff ff ff ff 00 00 00 00 00 00 00 00  ................
        20: b0 ad 14 c6 00 ff ff ff 00 00 00 00 02 00 00 00  ................

The results aren’t pretty, and we end up throwing away all that useful proc_t type information. With a little tweaks to dtrace, and some cribbing from mdb_print.c, we can do much better:

# dtrace -qn 'BEGIN{print(`p0); exit(0)}'
proc_t {
    struct vnode *p_exec = 0
    struct as *p_as = 0xfffffffffbc30260
    struct plock *p_lockp = 0xfffffffffbc6c9c8
    kmutex_t p_crlock = {
        void *[1] _opaque = [ 0 ]
    struct cred *p_cred = 0xffffff00c614adb0
    int p_swapcnt = 0
    char p_stat = '02'

Much better! Now, how did we get there from here? The answer was an interesting journey through libdtrace, the kernel dtrace implementation, CTF, and the horrors of bitfields.

To action or not to action?

The first question I set out to answer is what the user-visible interface should be. It seemed clear that this should be an operation on the same level as trace(), allowing arbitrary D expressions, but simply preserving the type of the result and pretty-printing it later. After briefly considering printt() (for “print type”), I decided upon just print(), since this seemed like a logical My first inclination was to create a new DTRACEACT_PRINT, but after some discussion with Adam, we decided this was extraneous – the behavior was identical to DTRACEACT_DIFEXPR (the internal name for trace), but just with type information.

Through the looking glass with types and formats

The real issue is that what we compile (dtrace statements) and what we consume (dtrace epids and records) are two very different things, and never the twain shall meet. At the time we go to generate the DIFEXPR statement in dt_cc.c, we have the CTF data in hand. We don’t want to change the DIF we generate, simply do post-processing on the other side, so we just need some way to get back to that type information in dt_consume_cpu(). We can’t simply hang it off our dtrace statement, as that would break anonymous tracing (and violate the rest of the DTrace architecture to boot).

Thankfully, this problem had already been solved for printf() (and related actions) because we need to preserve the original format string for the exact same reason. To do this, we take the action-specific integer argument, and use it to point into the DOF string table, where we stash the original format string. I simply had to hijack dtrace_dof_create() and have it do the same thing for the type information, right?

If only it could be so simple. There were two complications here: there is a lot of code that explicitly treats these as printf strings, and parses them into internal argv-style representations. Pretending our types were just format strings would cause all kinds of problems in this code. So I had to modify libdtrace to treat this more explicitly as raw ‘string data’ that is (optionally) used with the DIFEXPR action. Even with that in place, the formats I was sending down were not making it back out of the kernel. Because the argument is action-specific, the kernel needed to be modified to recognize this new argument in dtrace_ecb_action_add. With that change in place, I was able to get the format string back in userland when consuming the CPU buffers.

Bitfields, or why the D compiler cost me an hour of my life

With the trace data and type string in hand, I then proceeded to copy the mdb ::print code, first from apptrace (which turned out to be complete garbage) and then fixing it up bit by bit. Finally, after tweaking the code for an hour or two, I had it looking pretty much like identical ::print output. But when I fed it a klwp_t structure, I found that the user_desc_t structure bitfields weren’t being printed correctly:

# dtrace -n 'BEGIN{print(*((user_desc_t*)0xffffff00cb0a4d90)); exit(0)}'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN user_desc_t {
    unsigned long usd_lolimit = 0xcff3000000ffff
    unsigned long usd_lobase = 0xcff3000000
    unsigned long usd_midbase = 0xcff300
    unsigned long usd_type = 0xcff3
    unsigned long usd_dpl :64 = 0xcff3
    unsigned long usd_p :64 = 0xcff3
    unsigned long usd_hilimit = 0xcf
    unsigned long usd_avl :64 = 0xcf
    unsigned long usd_long :64 = 0xcf
    unsigned long usd_def32 :64 = 0xcf
    unsigned long usd_gran :64 = 0xcf
    unsigned long usd_hibase = 0

I spent an hour trying to debug this, only to find that the CTF IDs weren’t matching what I expected from the underlying object. I finally tracked it down to the fact that the D compiler, by virtue of processing the /usr/lib/dtrace files, pulls in its own version of klwp_t from the system header files. But it botches the bitfields, leaving the user with a subtly incorrect data. Switching the type to be genunix`user_desc_t fixed the problem.

What’s next

Given the usefulness of this feature, the next steps are to clean up the code, get it reviewed, and push to the illumos gate. It should hopefully be finding its way to an illumos distribution near you soon. Here’s a final print() invocation to leave you with:

# dtrace -n 'zio_done:entry{print(*args[0]); exit(0)}'
dtrace: description 'zio_done:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  42594                   zio_done:entry zio_t {
    zbookmark_t io_bookmark = {
        uint64_t zb_objset = 0
        uint64_t zb_object = 0
        int64_t zb_level = 0
        uint64_t zb_blkid = 0
    zio_prop_t io_prop = {
        enum zio_checksum zp_checksum = ZIO_CHECKSUM_INHERIT
        enum zio_compress zp_compress = ZIO_COMPRESS_INHERIT
        dmu_object_type_t zp_type = DMU_OT_NONE
        uint8_t zp_level = 0
        uint8_t zp_copies = 0
        uint8_t zp_dedup = 0
        uint8_t zp_dedup_verify = 0
    zio_type_t io_type = ZIO_TYPE_NULL
    enum zio_child io_child_type = ZIO_CHILD_VDEV
    int io_cmd = 0
    uint8_t io_priority = 0
    uint8_t io_reexecute = 0
    uint8_t [2] io_state = [ 0x1, 0 ]
    uint64_t io_txg = 0
    spa_t *io_spa = 0xffffff00c6806580
    blkptr_t *io_bp = 0
    blkptr_t *io_bp_override = 0
    blkptr_t io_bp_copy = {
        dva_t [3] blk_dva = [ 
            dva_t {
                uint64_t [2] dva_word = [ 0, 0 ]

With our first illumos-based distribution (2.6) out the door, we’ve posted the illumos-derived sources to github:

This repository contains the following types of changes from the illumos gate:

  • Changes that are complete and generally useful to the illumos community.  We have been (and will continue to be) proactive about pushing these changes to the illumos trunk ourselves.  We missed a few this time around, so we’ll be going back through to pick up anything we missed.
  • Changes that are sufficient to meet the needs of our product, but are not complete or generally useful for the larger community.  Our hope is that by pushing these changes to github, others can pick up such pieces of work and integrate them in a form that is acceptable to the illumos community at large.
  • Changes that represent distro-specific changes unique to our product.  It is unlikely that these will be of interest to anyone except the morbidly curious.

We will post updates with each release of the software.  This allows us to make sure the code is fully baked and tested, while still allowing us to proactively push complete pieces of work more frequently.

If you have questions about any particular change, feel free to email the author for more information.  You can also find us on the illumos developer mailing list and the #illumos IRC channel on freenode.

It’s been a little over six months since I left Oracle to join Delphix.  I’m not here to dwell on the reasons for my departure, as I think the results speak for themselves.

It is with a sad heart, however, that I look at the work so many put into making OpenSolaris what it was, only to see it turned into the next HP-UX – a commercially relevant but ultimately technologically uninteresting operating system.  This is not to denigrate the work of those at Oracle working on Solaris 11, but I personally believe that a truly innovative OS requires an engaged developer base interacting with the source code, and unique technologies that are adopted across multiple platforms.  With no one inside or outside of Oracle believing the unofficial pinky swear to release source code at some abstract future date, many may wonder what will happen to the bevy of cutting edge technologies that made up OpenSolaris.

The good news is that those technologies are alive and well in the illumos project, and many of us who left Oracle have joined companies such as Delphix, Joyent, and Nexenta that are building innovative solutions on top of the enterprise-quality OS at the core of illumos.  Combined with those dedicated souls who continue to tirelessly work on the source in their free time, the community continues to grow and evolve.  We are here today because we stand on the shoulders of giants, and we will continue to improve the source and help the community make the platform successful in whatever form it may take in the future.

And the contributions continue to pour in.  There are nasty DTrace bugs and new features, new COMSTAR protocol support, TCP/IP stability and scalability fixes, ZFS data corruption and performance improvements, and much more.  And there is a ZFS working group spanning multiple platforms and companies with a diverse set of interests helping to coordinate future ZFS development.

Suffice to say that OpenSolaris is alive and well outside the walls of Oracle, so give it a spin and get involved!

In my seven years at Sun and Oracle, I’ve had the opportunity to work with some incredible people on some truly amazing technology. But the time has come for me to move on, and today will be my last day at Oracle.

When I started in the Solaris kernel group in 2003, I had no idea that I was entering a truly unique environment – a culture of innovation and creativity that is difficult to find anywhere, let alone working on a system as old and complex as Solaris in a company as large as Sun. While there, I worked with others to reshape the operating system landscape through technologies like Zones, SMF, DTrace, and FMA, and fortunate to be part of the team that created ZFS, one of the most innovative filesystems ever. From there I became a member of the Fishworks team that created the Sun Storage 7000 series; working with such a close-knit talented team to create a groundbreaking integrated product like that was an experience that I will never forget.

I learned so much and grew in so many ways thanks to the people I had the chance to meet and work with over the past seven years. I would not be the person I am today without your help and guidance. While I am leaving Oracle, I will always be part of the community, and I look forward to our paths crossing in the future.

Despite not updating this blog as much as I’d like, I do hope to blog in the future at my new home:

Thanks for all the memories.

When the Sun Storage 7000 was first introduced, a key design decision was to allow only a single ZFS storage pool per host. This forces users to fully take advantage of the ZFS pool storage model, and prevents them from adopting ludicrous schemes such as “one pool per filesystem.” While RAID-Z has non-trivial performance implications for IOPs-bound workloads, the hope was that by allowing logzilla and readzilla devices to be configured per-filesystem, users could adjust relative performance and implement different qualities of service on a single pool.

While this works for the majority of workloads, there are still some that benefit from mirrored performance even in the presence of cache and log devices. As the maximum size of Sun Storage 7000 systems increases, it became apparent that we needed a way to allow pools with different RAS and performance characteristics in the same system. With this in mind, we relaxed the “one pool per system” rule1 with the 2010.Q1 release.

The storage configuration user experience is relatively unchanged. Instead of having a single pool (or two pools in a cluster), and being able to configure one or the other, you can simply click the ‘+’ button and add pools as needed. When creating a pool, you can now specify a name for the pool. When importing a pool, you can either accept the existing name or give it a new one at the time you select the pool. Ownership of pools in a cluster is now managed exclusively through the Configuration -> Cluster screen, as with other shared resources.

When managing shares, there is a new dropdown menu at the top left of the navigation bar. This controls which shares are shown in the UI. In the CLI, the equivalent setting is the ‘pool’ property at the ‘shares’ node.

While this gives some flexibility in storage configuration, it also allows users to create poorly constructed storage topologies. The intent is to allow the user to create pools with different RAS and performance characteristics, not to create dozens of different pools with the same properties. If you attempt to do this, the UI will present a warning summarizing the drawbacks if you were to continue:

  • Wastes system resources that could be shared in a single pool.
  • Decreases overall performance
  • Increases administrative complexity.
  • Log and cache devices can be enabled on a per-share basis.

You can still commit the operation, but such configurations are discouraged. The exception is when configuring a second pool on one head in a cluster.

We hope this feature will allow users to continue to consolidate storage and expand use of the Sun Storage 7000 series in more complicated environments.

  1. Clever users figured out that this mechanism could be circumvented in a cluster to have two pools active on the same host in an active/passive configuration.

In my previous entry, I described the overall architecture of shadow migration. This post will dive into the details of how it’s actually implemented, and the motivation behind some of the original design decisions.

VFS interposition

A very early desire was that we wanted something that could migrate data from many different sources. And while ZFS is the primary filesystem for Solaris, we also wanted to allow for arbitrary local targets. For this reason, the shadow migration infrastructure is implemented entirely at the VFS (Virtual FileSystem) layer. At the kernel level, there is a new ‘shadow’ mountpoint option, which is the path to another filesystem on the system. The kernel has no notion of whether a source filesystem is local or remote, and doesn’t differentiate between synchronous access and background migration. Any filesystem access, whether it is local or over some other protocol (CIFS, NFS, etc) will use the VFS interfaces and therefore be fed through our interposition layer.

The only other work the kernel does when mounting a shadow filesystem is check to see if the root directory is empty. If it is empty, we create a blank SUNWshadow extended attribute on the root directory. Once set, this will trigger all subsequent migration as long as the filesystem is always mounted with the ‘shadow’ attribute. Each VFS operation first checks to see if the filesystem is shadowing another (a quick check), and then whether the file or directory has the SUNWshadow attribute set (slightly more expensive, but cached with the vnode). If the attribute isn’t present, then we fall through to the local filesystem. Otherwise, we migrate the file or directory and then fall through to the local filesystem.

Migration of directories

In order to migrate a directory, we have to migrate all the entriest. When migrating an entry for a file, we don’t want to migrate the complete contents until the file is accessed, but we do need to migrate enough metadata such that access control can be enforced. We start by opening the directory on the remote side whose relative path is indicated by the SUNWshadow attribute. For each directory entry, we create a sparse file with the appropriate ownership, ACLs, system attributes and extended attributes.

Once the entry has been migrated, we then set a SUNWshadow attribute that is the same as the parent but with “/” appended where “name” is the directory entry name. This attribute always represents the relative path of the unmigrated entity on the source. This allows files and directories to be arbitrarily renamed without losing track of where they are located on the source. It also allows the source to change (i.e. restored to a different host) if needed. Note that there are also types of files (symlinks, devices, etc) that do not have contents, in which case we simply migrate the entire object at once.

Once the diretory has been completely migrated, we remove the SUNWshadow attribute so that future accesses all use the native filesystem. If the process is interrupted (system reset, etc), then the attribute will still be on the parent directory so we will migrate it again when the user (or background process) tries to access it.

Migration of files

Migrating a plain file is much simpler. We use the SUNWshadow attribute to locate the file on the source, and then read the source file and write the corresponding data to the local filesystem. In the current software version, this happens all at once, meaning the first access of a large file will have to wait for the entire file to be migrated. Future software versions will remove this limitation and migrate only enough data to satisfy the request, as well as allowing concurrent accesses to the file. Once all the data is migrated, we remove the SUNWshadow attribute and future accesses will go straight to the local filesystem.

Dealing with hard links

One issue we knew we’d have to deal with is hard links. Migrating a hard link requires that the same file reference appear in multiple locations within the filesystem. Obviously, we do not know every reference to a file in the source filesystem, so we need to migrate these links as we discover them. To do this, we have a special directory in the root of the filesystem where we can create files named by their source FID. A FID (file ID) is a unique identifier for the file within the filesystem. We create the file in this hard link directory with a name derived from its FID. Each time we encounter a file with a link count greater than 1, we lookup the source FID in our special directory. If it exists, we create a link to the directory entry instead of migrating a new instance of the file. This way, it doesn’t matter if files are moved around, removed from the local filesystem, or additional links created. We can always recreate a link to the original file. The one wrinkle is that we can migrate from nested source filesystems, so we also need to track the FSID (filesystem ID) which, while not persistent, can be stored in a table and reconstructed using source path information.

Completing migration

A key feature of the shadow migration design is that it treats all accesses the same, and allows background migration of data to be driven from userland, where it’s easier to control policy. The downside is that we need the ability to know when we have finished migrating every single file and directory on the source. Because the local filesystem is actively being modified while we are traversing, it’s impossible to know whether you’ve visited every object based only on walking the directory hierarchy. To address this, we keep track of a “pending” list of files and directories with shadow attributes. Every object with a shadow attribute must be present in this list, though this list can contain objects without shadow attributes, or non-existant objects. This allows us to be synchronous when it counts (appending entries) and lazy when it doesn’t (rewriting file with entries removed). Most of the time, we’ll find all the objects during traversal, and the pending list will contain no entries at all. In the case we missed an object, we can issue an ioctl() to the kernel to do the work for us. When that list is empty we know that we are finished and can remove the shadow setting.

ZFS integration

The ability to specify the shadow mount option for arbitrary filesystems is useful, but is also difficult to manage. It must be specified as an absolute path, meaning that the remote source of the mount must be tracked elsewhere, and has to be mounted before the filesystem itself. To make this easier, a new ‘shadow’ property was added for ZFS filesystems. This can be set using an abstract URI syntax (“nfs://host/path”), and libzfs will take care of automatically mounting the shadow filesystem and passing the correct absolute path to the kernel. This way, the user can manage a semantically meaningful relationship without worrying about how the internal mechanisms are connected. It also allows us to expand the set of possible sources in the future in a compatible fashion.

Hopefully this provides a reasonable view into how exactly shadow migration works, and the design decisions behind it. The goal is to eventually have this available in Solaris, at which point all the gritty details should be available to the curious.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012