Month: April 2005

Designing for Failure

April 14, 2005

In the last few weeks, I’ve been completely re-designing the ZFS commands from the ground up¹. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped out at me². I thought I’d use this blog entry to focus on one that near and dear to me. Having spent a great deal of time with the debugging and observability tools, I’ve invariably focused on answering the question “How do I diagnose and fix a problem when something goes wrong?”. When it comes to command line utilities, the core this problem is in well-designed error messages. To wit, running the following (former) ZFS command demonstrates the number one mistake when reporting error messages:

# zfs create -c pool/foo pool/bar
zfs: Can't create pool/bar: Invalid argument
#

The words “Invalid argument” should never appear as an error message. This means that at some point in the software stack, you were able to determine there was a specific problem with an argument. But in the course of passing that error up the stack, any semantic information about the exact nature of the problem has been reduced to simply EINVAL. In the above case, all we know is that one of the two arguments was invalid for some unknown reason, and we have no way of knowing how to fix it. When choosing to display an error message, you should always take the following into account:

An error message must clearly identify the source of the problem in a way that that the user can understand.

An error message must suggest what the user can do to fix the problem.

If you print an error message that the administrator can’t understand or doesn’t suggest what to do, then you have failed and your design is fundamentally broken. All too often, error semantics are given a back seat during the design process. When approaching the ZFS user interface, I made sure that error semantics were a fundamental part of the design document. Every command has complete usage documentation, examples, and every possible error message that can be emitted. By making this part of the design process, I was forced to examine every possible error scenario from the perspective of an administrator.

A grand vision of proper failure analysis can be seen in the Fault Management Architecture in Solaris 10, part of Predictive Self Healing. A complete explanation of FMA and its ramifications is beyond the scope of a single blog entry, but the basic premise is to move from a series of unrelated error messages to a unified framework of fault diagnosis. Historically, when hardware errors would occur, an arbitrary error message may or may not have been sent to the system log. The error may have been transient (such as an isolated memory CE), or the result of some other fault. Administrators were forced to make costly decisions based on a vague understanding of our hardware failure semantics. When error messages did succeed in describing the problem sufficiently, they invariably failed in suggesting how to fix the problem. With FMA, the sequence of errors is instead fed to a diagnosis engine that is intimately familiar with the characteristics of the hardware, and is able to produce a fault message that both adequately describes the real problem, as well as how to fix it (when it cannot be automatically repaired by FMA).

Such a wide-ranging problem doesn’t necessarily compare to a simple set of command line utilities. A smaller scale example can be seen with the Solaris Management Facility. When SMF first integrated, it was incredibly difficult to diagnose problems when they occurred³. The result, after a few weeks of struggle, was one of the best tools to come out of SMF, svcs -x. If you haven’t tried this command on your Solaris 10 box, you should give it a shot. It does automated gathering of error information and combines it into output that is specific, intelligible, and repair-focused. During development of the final ZFS command line interface, I’ve taken a great deal of inspiration from both svcs -x and FMA. I hope that this is reflected in the final product.

So what does this mean for you? First of all, if there’s any Solaris error message that is unclear or uninformative that is a bug. There are some rare cases when we have no other choice (because we’re relying on an arbitrary subsystem that can only communicate via errno values), but 90% of the time its because the system hasn’t been sufficiently designed with failure in mind.

I’ll also leave you with a few cardinal⁴ rules of proper error design beyond the two principles above:

Never distill multiple faults into a single error code. Any error that gets passed between functions or subsystems must be traceable back to a single specific failure.
Stay away from strerror(3c) at all costs. Unless you are truly interfacing with an arbitrary UNIX system, the errno values are rarely sufficient.
Design your error reporting at the same time you design the interface. Put all possible error messages in a single document and make sure they are both consistent and effective.
When possible, perform automated diagnosis to reduce the amount of unimportant data or give the user more specific data to work with.
Distance yourself from the implementation and make sure that any error message makes sense to the average user.

¹No, I cannot tell you when ZFS will integrate, or when it will be available. Sorry.

²This is not intended as a jab at the ZFS team. They have been working full steam on the (significantly more complicated) implementation. The commands have grown organically over time, and are beginning to show their age.

³Again, this is not meant to disparage the SMF team. There were many more factors here, and all the problems have since been fixed.

⁴ “cardinal” might be a stretch here. A better phrase is probably “random list of rules I came up with on the spot”.

Bug of the week

April 3, 2005

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I’ve fixed (as long as it’s not a security vulnerability). This week turned up a rather interesting specimen:

6198523 dirfindvp() can erroneously return ENOENT

This bug was first spotted by Casper back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we’d get error messages like:

sh: cannot determine current directory

Some ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, Jonathan hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.

Back in build 56 of S10, I moved getcwd(3c) into the kernel, along with changes to store pathnames with vnodes (which is used by the DTrace I/O provider as well as pfiles(1)). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path¹. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.

To debug this problem, I used this D script to track the behavior of dirtopath(), the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:

dirtopath       /export/ws/build/usr/src/cmd/sgs/ld
lookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)
dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)
dirtopath() returned 2

Looking at this, it was clear that dirfindvp() (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that “sgs” is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (OpenSolaris, where are you?), it’s essentially structured like so:

while (!err && !eof) {
/* ... */
while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) {
/* ... */
/*
* We only want to bail out if there was an error other
* than ENOENT.  Otherwise, it could be that someone
* just removed an entry since the readdir() call, and
* the entry we want is further on in the directory.
*/
if (err != ENOENT) {
break;
}
}
}

The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for !err && !eof. If we fail to look up an entry, and it’s the last entry in the chunk we just read, then we’ll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn’t. Using this test program, it’s easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:

The bug had been in the gate for over a year, and there hadn’t been a single reported build failure.
It only happens when the cached vnode value is invalid, which is rare².
It is a race condition between readdir, lookup, and remove.
On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.
Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for DTrace, which excels at solving these transient phenomena

A three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn’t been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.

¹ There are many more subtleties here relating to Zones, and verifying that the path hasn’t been changed to refer to another file. The curious among you will have to wait for OpenSolaris.

² I haven’t yet investigated why we ended up in the slow path in this case. First things first.

Eric Schrock's Blog

Month: April 2005

Designing for Failure

Bug of the week

Recent Posts

Agile Data Technology

Enterprise Software Hackathons

Engineer Anti-Patterns

A node.js CLI?

Data Replication: Building a better NDMP

Data Replication: Approaching the Problem

Archives

Archives