Eric Schrock's Blog

I’ve been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I’ve been working on some problems which leverage sysevents (libsysevent(3LIB)) as a common transport mechanism. While trying to understand exactly what sysevents were being generated from where, I found the lack of observability astounding. After poking around with DTrace, I found that tracking down the exact semantics was not exactly straightforward. First of all, we have two orthogonal sysevent mechanisms, the original syseventd legacy mechanism, and the more recent general purpose event channel (GPEC) mechanism, used by FMA. On top of this, the sysevent_impl_t structure isn’t exactly straightforward, because all the data is packed together in a single block of memory. Knowing that this would be important for my upcoming work, I decided that adding a stable DTrace sysevent provider would be useful.

The provider has a single probe, sysevent:::post, which fires whenever a sysevent post attempt is made. It doesn’t necessarily indicate that the syevent was successfully queued or received. The probe has the following semantics:

# dtrace -lvP sysevent
ID   PROVIDER            MODULE                          FUNCTION NAME
44528   sysevent           genunix                    queue_sysevent post
Probe Description Attributes
Identifier Names: Private
Data Semantics:   Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics:   Evolving
Dependency Class: ISA
Argument Types
args[0]: syseventchaninfo_t *
args[1]: syseventinfo_t *

The ‘syseventchaninfo_t’ translator has a single member, ‘ec_name’,which is the name of the event channel. If this is being posted via the legacy sysevent mechanism, then this member will be NULL. The ‘syeventinfo_t’ translator has three members, ‘se_publisher’, ‘se_class’, and ‘se_subclass’. These mirror the arguments to sysevent_post(). The following script will dump all sysevents posted to syseventd(1M):

#!/usr/sbin/dtrace -s
#pragma D option quiet
BEGIN
{
printf("%-30s  %-20s  %s\n", "PUBLISHER", "CLASS",
"SUBCLASS");
}
sysevent:::post
/args[0]->ec_name == NULL/
{
printf("%-30s  %-20s  %s\n", args[1]->se_publisher,
args[1]->se_class, args[1]->se_subclass);
}

And the output during a cfgadm -c unconfigure:

PUBLISHER                       CLASS                 SUBCLASS
SUNW:usr:devfsadmd:100237       EC_dev_remove         disk
SUNW:usr:devfsadmd:100237       EC_dev_branch         ESC_dev_branch_remove
SUNW:kern:ddi                   EC_devfs              ESC_devfs_devi_remove

This has already proven quite useful in my ongoing work, and hopefully some other developers out there will also find it useful.

I’ve been meaning to get around to blogging about these features that I
putback a while ago, but have been caught up in a few too many things.
In any case, the following new ZFS features were putback to build 48 of
Nevada, and should be availble in the next Solaris Express

Create Time Properties

An old RFE has been to provide a way to specify properties at create
time. For users, this simplifies admnistration by reducing the number
of commands which need to be run. It also allows some race conditions
to be eliminated. For example, if you want to create a new dataset with
a mountpoint of ‘none’, you first have to create it and the underlying
inherited mountpoint, only to remove it later by invoking ‘zfs set
mountpoint=none’.

From an implementation perspective, this allows us to unify our
implementation of the ‘volsize’ and ‘volblocksize’ properties, and pave
the way for future create-time only properties. Instead of having a
separate ioctl() to create a volume and passing in the two size
parameters, we simply pass them down as create-time options.

The end result is pretty straightforward:

# zfs create -o compression=on tank/home
# zfs create -o mountpoint=/export -o atime=off tank/export

‘canmount’ property

The ‘canmount’ property allows you create a ZFS dataset that serves
solely as a mechanism for inheriting properties. When we first created the
hierarchical dataset model, we had the notion of ‘containers’ –
filesystems with no associated data. Only these datasets could contain
other datasets, and you had to make the decision at create-time.

This turned out to be a bad idea for a number of reasons. It
complicated the CLI, forced the user to make a create-time decision that
could not be changed, and led to confusion when files were accidentally
created on the underlying filesystem. So we made every filesystem able
to have child filesystems, and all seemed well.

However, there is power in having a dataset that exists in the hierarchy
but has no associated filesystem data (or effectively none by preventing
from being mounted). One can do this today by setting the ‘mountpoint’
property to ‘none’. However, this property is inherited by child
datasets, and the administrator cannot leverage the power of inherited
mountpoints. In particular, some users have expressed desire to have
two sets of directories, belonging to different ZFS parents (or even to
UFS filesystems), share the same inherited directory. With the new
‘canmount’ property, this becomes trivial:

# zfs create -o mountpoint=/export -o canmount=off tank/accounting
# zfs create -o mountpoint=/export -o canmount=off tank/engineering
# zfs create tank/accounting/bob
# zfs create tank/engineering/anne

Now, both anne and bob have directories at ‘/export/’, except that
they are inheriting ZFS properties from different datasets in the
hierarchy. The adminsitrator may decide to turn compression on for one
group of people or another, or set a quota to limit the amount of space
consumed by the group. Or simply have a way to view the total amount of
space consumed by each group without resorting to scripted du(1).

User Defined Properties

The last major RFE in this wad added the ability to set arbitrary
properties on ZFS datasets. This provides a way for administrators to
annotate their own filesystems, as well as ISVs to layer intelligent
software without having to modify the ZFS code to introduce a new
property.

A user-defined property name is one which contains a colon (:). This
provides a unique namespace which is guaranteed to not overlap with
native ZFS properties. The emphasis is to use the colon to separate a
module and property name, where ‘module’ should be a reverse DNS name.
For example, a theoretical Sun backup product might do:

# zfs set com.sun.sunbackup:frequency=1hr tank/home

The property value is an arbitrary string, and no additional validation
is done on it. These values are always inherited. A local adminstrator
might do:

# zfs set localhost:backedup=9/19/06 tank/home
# zfs list -o name,localhost:backedup
NAME            LOCALHOST:BACKEDUP
tank            -
tank/home       9/19/06
tank/ws         9/10/06

The hope is that this will serve as a basis for some innovative products
and home grown solutions which interact with ZFS datasets in a
well-defined manner.

More exciting news on the ZFS OpenSolaris front. In addition to the existing ZFS on FUSE/Linux work, we now have a second active port of ZFS, this time for FreeBSD. Pawel Dawidek has been hard at work, and has made astounding progress after just 10 days (!). This is both a testament to his ability as well as the portability of ZFS. As with any port, the hard part comes down to integrating the VFS layer, but Pawel has already made good progress there. The current prototype can already mount fielsystems, create files, and list directory contents. Of course, our code isn’t completely without portability headaches, but thanks to Pawel (and Ricardo on FUSE/Linux), we can take patches and implement the changes upstream to ease future maintenance. You can find the FreeBSD repository Here. If you’re a FreeBSD developer or user, please give Pawel whatever support you can, whether it’s code contributions, testing, or just plain old compliments. We’ll be helping out where we can on the OpenSolaris side.

In related news, Ricard Correia has made significant progress on the FUSE/Linux port. All the management functionality of zfs(1M) and zpool(1M) is there, and he’s working on mounting ZFS filesystems. All in all, it’s an exciting time, and we’re all crossing our fingers that ZFS will follow in the footsteps of its older brother DTrace.

As Jeff mentioned previously, Ricardo Correia has been working on porting ZFS to FUSE/Linux as part of Google SoC. Last week, Ricardo got libzpool and ztest running on Linux, which is a major first step of the project.

The interesting part is the set of changes that he had to make in order to get it working. libzpool was designed to be run from userland and the kernel from the start, so we’ve already done most of the work of separating out the OS-dependent interfaces. The most prolific changes were to satisfy GCC warnings. We do compile ON with gcc, but not using the default options. I’ve since updated the ZFS porting page with info about gcc use in ON, which should make future ports easier. The second most common change was header files that are available in both userland and kernel on Solaris, but nevertheless should be placed in zfs_context.h, concentrating platform-specific knowledge in this one file. Finally, there were some simple changes we could make (such as using pthread_create() instead of thr_create()) to make ports of the tools easier. It would also be helpful to have ports of libnvpair and libavl, much like some have done for libumem, so that developers don’t have to continually port the same libraries over and over.

The next step (getting zfs(1M) and zpool(1M) working) is going to require significantly more changes to our source code. Unlike libzpool, these tools (libzfs in particular) were not designed to be portable. They include a number of Solaris specific interfaces (such as zones and NFS shares) that will be totally different on other platforms. I look forward to seeing Ricardo’s progress to know how this will work out.

It’s been a long time since the last time I wrote a blog entry. I’ve been working heads-down on a new project and haven’t had the time to keep up my regular blogging. Hopefully I’ll be able to keep something going from now on.

Last week the ZFS team put the following back to ON:

PSARC 2006/223 ZFS Hot Spares
PSARC 2006/303 ZFS Clone Promotion
6276916 support for "clone swap"
6288488 du reports misleading size on RAID-Z
6393490 libzfs should be a real library
6397148 fbufs debug code should be removed from buf_hash_insert()
6405966 Hot Spare support in ZFS
6409302 passing a non-root vdev via zpool_create() panics system
6415739 assertion failed: !(zio->io_flags & 0x00040)
6416759 ::dbufs does not find bonus buffers anymore
6417978 double parity RAID-Z a.k.a. RAID6
6424554 full block re-writes need not read data in
6425111 detaching an offline device can result in import confusion

There are a couple of cool features mixed in here. Most importantly, hot spares, clone swap, and double-parity RAID-Z. I’ll focus this entry on hot spares, since I wrote the code for that feature. If you want to see the original ARC case and some of the discussion behind the feature, you should check out the original zfs-discuss thread.

The following features make up hot spare support:

Associating hot spares with pools

Hot spares can be specified when creating a pool or adding devices by using the spare vdev type. For example, you could create a mirrored pool with a single hot spare by doing:

# zpool create test mirror c0t0d0 c0t1d0 spare c0t2d0
# zpool status test
pool: test
state: ONLINE
scrub: none requested
config:
NAME        STATE     READ WRITE CKSUM
test        ONLINE       0     0     0
mirror    ONLINE       0     0     0
c0t0d0  ONLINE       0     0     0
c0t1d0  ONLINE       0     0     0
spares
c0t2d0    AVAIL
errors: No known data errors

Notice that there is one spare, and it currently available for use. Spares can be shared between multiple pools, allowing for a single set of global spares on systems with multiple spares.

Replacing a device with a hot spare

There is now an FMA agent, zfs-retire, which subscribes to vdev failure faults and automatically initiates replacements if there are any hot spares available. But if you want to play around with this yourself (without forcibly faulting drives), you can just use ‘zpool replace’. For example:

# zpool offline test c0t0d0
Bringing device c0t0d0 offline
# zpool replace test c0t0d0 c0t2d0
# zpool status test
pool: test
state: DEGRADED
status: One or more devices has been taken offline by the adminstrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: resilver completed with 0 errors on Tue Jun  6 08:48:41 2006
config:
NAME          STATE     READ WRITE CKSUM
test          DEGRADED     0     0     0
mirror      DEGRADED     0     0     0
spare     DEGRADED     0     0     0
c0t0d0  OFFLINE      0     0     0
c0t2d0  ONLINE       0     0     0
c0t1d0    ONLINE       0     0     0
spares
c0t2d0      INUSE     currently in use
errors: No known data errors

Note that the offline is optional, but it helps visualize what the pool would look like should and actual device fail. Note that even though the resilver is completed, the ‘spare’ vdev stays in-place (unlike a ‘replacing’ vdev). This is because the replacement is only temporary. Once the original device is replaced, then the spare will be returned to the pool.

Relieving a hot spare

A hot spare can be returned to its previous state by replacing the original faulted drive. For example:

# zpool replace test c0t0d0 c0t3d0
# zpool status test
pool: test
state: DEGRADED
scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:
NAME             STATE     READ WRITE CKSUM
test             DEGRADED     0     0     0
mirror         DEGRADED     0     0     0
spare        DEGRADED     0     0     0
replacing  DEGRADED     0     0     0
c0t0d0   OFFLINE      0     0     0
c0t3d0   ONLINE       0     0     0
c0t2d0     ONLINE       0     0     0
c0t1d0       ONLINE       0     0     0
spares
c0t2d0         INUSE     currently in use
errors: No known data errors
# zpool status test
pool: test
state: ONLINE
scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:
NAME        STATE     READ WRITE CKSUM
test        ONLINE       0     0     0
mirror    ONLINE       0     0     0
c0t3d0  ONLINE       0     0     0
c0t1d0  ONLINE       0     0     0
spares
c0t2d0    AVAIL
errors: No known data errors

The drive is actively being replaced for a short period of time. Once the replacement is completed, the old device is removed, and the hot spare is returned to the list of available spares. If you want a hot spare replacement to become permanent, you can zpool detach the original device, at which point the spare will be removed from the hot spare list of any active pools. You can also zpool detach the spare itself to cancel the hot spare operation.

Removing a spare from a pool

To remove a hot spare from a pool, simply use the zpool remove command. For example:

# zpool remove test c0t2d0
# zpool status
pool: test
state: ONLINE
scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:
NAME        STATE     READ WRITE CKSUM
test        ONLINE       0     0     0
mirror    ONLINE       0     0     0
c0t3d0  ONLINE       0     0     0
c0t1d0  ONLINE       0     0     0
errors: No known data errors

Unfortunately, we don’t yet support removing anything other than hot spares (it’s on our list, we swear). But you can see how hot spares naturally fit into the existing ZFS scheme. Keep in mind that to use hot spares, you will need to upgrade your pools (via ‘zpool upgrade’) to version 3 or later.

Next Steps

Despite the obvious usefulness of this feature, there is one more step that needs to be done for it to be truly useful. This involves phase two of the ZFS/FMA integration. Currently, a drive is only considered faulted if it ‘goes away’ completely (i.e. ldi_open() fails). This covers only subset of known drive failure modes. It’s possible for a drive to continually return errors, and yet be openable. The next phase of ZFS and FMA will introduce a more intelligent diagnosis engine to watch I/O and checksum errors as well as the SMART predictive failure bit in order to proactively offline devices when they are experiencing an abnormal amount of errors, or appear like they are going to fail. With this functionality, ZFS will be able to better respond to failing drives, thereby making hot spare replacement much more valuable.

What a party.

I can’t say much more – I’m afraid I won’t do it justice. Suffice to say that we rolled back into our hotel room at 5:00 AM after hanging out with Swedish royalty and Nobel laureates in what has to be one of the most amazing ceremony and banquest ever conceived.

You’ll have to ask me in person for some of the details, but a few highlights include my father escorting princess Madeleine down the staircase, as well as my mother being escorted (and talking with) the King of Sweden in the more private dinner the next night. Not to mention way too many late night drinking escapades with the likes of Grubbs and Nocera.

The only downside is that my flight home was delayed 5 hours (while we were on the plane), so I missed my connection and am now hanging out in a Newark hotel for a night. At least I get a midway point to adjust to the new timezone…

Hopefully the banquest footage will be available soon at nobelprize.org. Check it out when it is.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives