Eric Schrock's Blog

It’s hard to believe that this day has finally come. After more
than two and a half years, our first Fishworks-based product has been
released. You can keep up to date with the latest info at the
Fishworks blog.

For my first technical post, I’d thought I’d give an introduction to
the chassis subsystem at the heart of our hardware integration strategy. This
subsystem is responsible for gathering, cataloging, and presenting a unified
view of the hardware topology. It
underwent two major rewrites (one by myself and one by Keith) but the
fundamental design has remained the same. While it
may not be the most glamorous feature (no one’s going to purchase a box
because they can get model information on their DIMMs), I found it an
interesting cross-section of disparate technologies and awash in subtle
complexity. You can find a video of myself talking about and
demonstrating this feature

libtopo discovery

At the heart of the chassis subsystem is the FMA topology as
exported by
This library is already
capable of enumerating hardware in a physically meaningful manner, and
FMRIs (fault managed resource identifiers) form the basis of
FMA fault diagnosis. This alone provides us the following basic

  • Discover external storage enclosures
  • Identify bays and disks
  • Identify CPUs
  • Identify power supplies and fans
  • Manage LEDs
  • Identify PCI functions beneath a particular slot

Much of this requires platform-specific XML files, or leverages IPMI
behind the scenes, but this minimal integration work is common to Solaris. Any
platform supported by Solaris is supported by the FishWorks software

Additional metadata

Unfortunately, this falls short of a complete picture:

  • No way to identify absent CPUs, DIMMs, or empty PCI slots
  • DIMM enumeration not supported on all platforms
  • Human-readable labels often wrong or missing
  • No way to identify complete PCI cards
  • No integration with visual images of the chassis

To address these limitations (most of which lie outside the purview of
libtopo), we leverage additional metadata for each supported
chassis. This metadata identifies all physical slots (even those that
may not be occupied), cleans up various labels, and includes visual
information about the chassis and its components. And
we can identify physical cards based on devinfo properties extracted
from firmware and/or the pattern of PCI functions and their attributes
(a process worthy of its own blog entry). Combined with libtopo, we have
images that we can assemble into a
complete view based on the current physical layout, highlight
components within the image, and respond to user mouse clicks.

Supplemental information

However, we are still missing many of
the component details. Our goal is to be able to provide complete
information for every FRU on the system. With just libtopo, we can get
this for disks but not much else. We need to look to alternate
sources of information.


For CPUs, there is a rather rich set of information available via
traditional kstat interfaces. While we use libtopo to identify CPUs
(it lets us correlate physical CPUs), the
bulk of the information comes from kstats. This is used to get model,
speed, and the number of cores.


The device tree snapshot provides additional information for PCI
devices that can only be retrieved by private driver interfaces.
Despite the existence of a VPD (Vital Product Data)
standard, effectively no vendors implement it. Instead, it is read by some firmware-specific
mechanism private to the driver. By exporting these as properties in
the devinfo snapshot, we can transparently pull in dynamic FRU
information for PCI cards. This is used to get model, part, and
revision information for HBAs and 10G NICs.


IPMI (Intelligent Platform Management Interface) is used to
communicate with the service processor on most enterprise class
systems. It is used within libtopo for power supply and fan
enumeration in libtopo as well as LED management. But IPMI
also supports FRU data, which includes a lot of juicy tidbits
that only the SP knows. We reference this FRU information directly to
get model and part information for power supplies and DIMMs.


Even with IPMI, there are bits of information that exist only in SMBIOS,
a standard is supposed to provide information about the physical
resources on the system. Sadly, it does not provide enough information
to correlate OS-visible abstractions with their underlying physical
counterparts. With metadata, however, we can use SMBIOS to make this
correlation. This is used to enumerate DIMMs on platforms not
supported by libtopo, and to supplement DIMM information with data
available only via SMBIOS.


Last but not least, there is chassis-specific metadata. Some
components simply don’t have FRUID information, either because they are
too simple (fans) or there exists no mechanism to get the information
(most PCI cards). In this situation, we use metadata to provide
vendor, model, and part information as that is generally static for a
particular component within the system. We cannot get information
specific to the component (such as a serial number), but at least the
user will be able to know what it is and know how to order another

Putting it all together

With all of this information tied together under one subsystem, we
can finally present the user complete information about their hardware,
including images showing the physical layout of the system. In addition,
this also forms the basis for reporting problems and analytics (using
labels from metadata), manipulating chassis state (toggling LEDs, setting
chassis identifiers), and making programmatic distinctions about the
hardware (such as whether external HBAs are present). Over the
next few weeks I hope to expound on some of these details in further
blog posts.

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.

You can find a detailed description of the changes in the original FMA portfolio here, but it’s much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:

group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
label             string    Cooling Fan  0
FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
target-path       string    /dev/es/ses3
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
type              uint32    0x1 (LOCATE)
mode              uint32    0x0 (OFF)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
type              uint32    0x0 (SERVICE)
mode              uint32    0x0 (OFF)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
sensor-class      string    threshold
type              uint32    0x4 (FAN)
units             uint32    0x12 (RPM)
reading           double    3490.000000
state             uint32    0x0 (0x00)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
sensor-class      string    discrete
type              uint32    0x103 (GENERIC_STATE)
state             uint32    0x1 (DEASSERTED)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f

Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it’s not used by any software. But that will change shortly, as we work on the next phases:

  • Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it’s in the system chassis or an external enclosure.
  • Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.
  • Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a “vdev” – a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it’s possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.
  • Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.

Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Over the past few years, I’ve been working on various parts of Solaris platform integration, with an emphasis on disk monitoring. While the majority of my time has been focused on fishworks, I have managed to implement a few more pieces of the original design.

About two months ago, I integrated the libscsi and libses libraries into Solaris Nevada. These libraries, originally written by Keith Wesolowski, form an abstraction layer upon which higher level software can be built. The modular nature of libses makes it easy to extend with vendor-specific support libraries in order to provide additional information and functionality not present in the SES standard, something difficult to do with the kernel-based ses(7d) driver. And since it is written in userland, it is easy to port to other operating systems. This library is used as part of the fwflash firmware upgrade tool, and will be used in future Sun storage management products.

While libses itself is an interesting platform, it’s true raison d’etre is to serve as the basis for enumeration of external enclosures as part of libtopo. Enumeration of components in a physically meaningful manner is a key component of the FMA strategy. These components form FMRIs (fault managed resource identifiers) that are the target of diagnoses. These FMRIs provide a way of not just identifying that “disk c1t0d0 is broken”, but that this device is actually in bay 17 of the storage enclosure whose chassis serial number is “2029QTF0809QCK012”. In order to do that effectively, we need a way to discover the physical topology of the enclosures connected to the system (chassis and bays) and correlate it with the in-band I/O view of the devices (SAS addresses). This is where SES (SCSI enclosure services) comes into play. SES processes show up as targets in the SAS fabric, and by using the additional element status descriptors, we can correlate physical bays with the attached devices under Solaris. In addition, we can also enumerate components not directly visible to Solaris, such as fans and power supplies.

The SES enumerator was integrated in build 93 of nevada, and all of these components now show up in the libtopo hardware topology (commonly referred to as the “hc scheme”). To do this, we walk over al the SES targets visible to the system, grouping targets into logical chassis (something that is not as straightforward as it should be). We use this list of targets and a snapshot of the Solaris device tree to fill in which devices are present on the system. You can see the result by running fmtopo on a build 93 or later Solaris machine:

# /usr/lib/fm/fmd/fmtopo

To really get all the details, you can use the ‘-V’ option to fmtopo to dump all available properties:

# fmtopo -V '*/ses-enclosure=0/bay=0/disk=0'
TIME                 UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
ASRU              fmri      dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
label             string    SCSI Device  0
FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0809QCK012
server-id         string
group: io                             version: 1   stability: Private/Private
devfs-path        string    /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
devid             string    id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
phys-path         string[]  [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
group: storage                        version: 1   stability: Private/Private
logical-disk      string    c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
manufacturer      string    SEAGATE
model             string    ST37500NSSUN750G 0720A0PC3X
serial-number     string    5QD0PC3X
firmware-revision string       3.AZK
capacity-in-bytes string    750156374016

So what does this mean, other than providing a way for you to finally figure out where disk ‘c3t0d6’ is really located? Currently, it allows the disks to be monitored by the disk-transport fmd module to generate faults based on predictive failure, over temperature, and self-test failure. The really interesting part is where we go from here. In the near future, thanks to work by Rob Johnston on the sensor framework, we’ll have the ability to manage LEDs for disks that are part of external enclosures, diagnose failures of power supplies and fans, as well as the ability to read sensor data (such as fan speeds and temperature) as part of a unified framework.

I often like to joke about the amount of time that I have spent just getting a single LED to light. At first glance, it seems like a pretty simple task. But to do it in a generic fashion that can be generalized across a wide variety of platforms, correlated with physically meaningful labels, and incorporate a diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. Once it’s all said and done, however, future platforms will require little to no integration work, and you’ll be able to see a bad drive generate checksum errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, activate a hot spare, and light the fault LED on the drive bay (wherever it may be). Only then will we have accomplished our goal of an end-to-end storage strategy for Solaris – and hopefully someone besides me will know what it has taken to get that little LED to light.

For those of you who have been following my recent work with Solaris platform integration, be sure to check out the work Cindi and the FMA team are doing as part of the Sensor Abstraction Layer project. Cindi recently posted an initial version of the Phase 1 design document. Take a look if you’re interested in the details, and join the discussion if you’re interested in defining the Solaris platform experience.

The implications of this project for unified platform integration are obvious. With respect to what I’ve been working on, you’ll likely see the current disk monitoring infrastructure converted into generic sensors, as well as the sfx4500-disk LED support converted into indicators. I plan to leverage this work as well as the SCSI FMA work to enable correlated ZFS diagnosis across internal and external storage.

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.

This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:

  • A new private library, libdiskstatus, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, disk_status_get() that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.

  • This library is used by the SATA libtopo module to export a generic TOPO_METH_DISK_STATUS method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.

  • A new fmd module, disk-transport, periodically iterates over libtopo nodes and invokes the TOPO_METH_DISK_STATUS method on any supported nodes. The module generates FMA ereports for any detected errors.

  • These ereports are translated to faults by a simple eversholt DE. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.

These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:

The FMA team, as part of the sensor framework, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the x4500 specific properties and associated machinery with generic code.

The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the disk-transport module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.

Finally, I am finishing up my long-overdue ZFS FMA work, a necessary step towards connecting ZFS and disk diagnosis. Stay tuned for more info.

As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) fmd platform module. This module represents the latest attempt at Solaris platform integration, and an indication of where we are headed in the future.

When I say “platform integration”, this is more involved than the platform support most people typically think of. The platform teams make sure that the system boots and that all the hardware is supported properly by Solaris (drivers, etc). Thanks to the FMA effort, platform teams must also deliver a FMA portfolio which covers FMA support for all the hardware and a unified serviceability plan. Unfortunately, there is still more work to be done beyond this, of which the most important is interacting with hardware in response to OS-visible events. This includes ability to light LEDs in response to faults and device hotplug, as well as monitoring the service processor and keeping external FRU information up to date.

The sfx4500-disk module is the latest attempt at providing this functionality. It does the job, but is afflicted by the same problems that often plague platform integration attempts. It’s overcomplicated, monolithic, and much of what it does should be generic Solaris functionality. Among the things this module does:

  • Reads SMART data from disks and creates ereports
  • Diagnoses ereports into corresponding disk faults
  • Implements an IPMI interface directly on top of /dev/bmc
  • Responds to disk faults by turning on the appropriate ‘fault’ disk LED
  • Listens for hotplug and DR events, updating the ‘ok2rm’ and ‘present’ LEDs
  • Updates SP-controlled FRU information
  • Monitors the service process for resets and resyncs necessary information

Needless to say, every single item on the above list is applicable to a wide variety of Sun platforms, not just the x4500, and it certainly doesn’t need to be in a single monolithic module. This is not meant to be a slight against the authors of the module. As with most platform integration activities, this effort wasn’t communicated by the hardware team until far too late, resulting in an unrealistic schedule with millions of dollars of revenue behind it. It doesn’t help that all these features need to be supported on Solaris 10, making the schedule pressure all the more acute, since the code must soak in Nevada and then be backported in time for the product release. In these environments even the most fervent pleas for architectural purity tend to fall on deaf ears, and the engineers doing the work quickly find themselves between a rock and a hard place.

As I was wandering through this code and thinking about how this would interact with ZFS and future Sun products, it became clear that it needed a massive overhaul. More specifically, it needed to be burned to the ground and rebuilt as a set of distinct, general purpose, components. Since refactoring 12,000 lines of code with such a variety of different functions is non-trivial and difficult to test, I began by factoring out different pieces individually, redesigning the interfaces and re-integrating them into Solaris on a piece-by-piece basis.

Of all the functionality provided by the module, the easiest thing to separate was the IPMI logic. The Intelligent Platform Management Interface is a specification for communicating with service Pprocessors to discover and control available hardware. Sadly, it’s anything but “intelligent”. If you had asked me a year ago what I’d be doing at the beginning of this year, I’m pretty sure that reading the IPMI specification would have been at the bottom of my list (right below driving stakes through my eyeballs). Thankfully, the IPMI functionality needed was very small, and the best choice was a minimally functional private library, designed solely for the purpose of communicating with the Service Processor on supported Sun platforms. Existing libraries such as OpenIPMI were too complicated, and in their efforts to present a generic abstracted interface, didn’t provide what we really needed. The design goals are different, and the ON-private IPMI library and OpenIPMI will continue to develop and serve different purposes in the future.

Last week I finally integrated libipmi. In the process, I eliminated 2,000 lines of platform-specific code and created a common interface that can be leveraged by other FMA efforts and future projects. It is provided for both x86 and SPARC, even though there are currently no supported SPARC machines with an IPMI-capable service processor (this is being worked on). This library is private and evolving quite rapidly, so don’t use it in any non-ON software unless you’re prepared to keep up with a changing API.

As part of this work, I also created a common fmd module, sp-monitor, that monitors the service processor, if present, and generates a new ESC_PLATFORM_RESET sysevent to notify consumers when the service processor is reset. The existing sfx4500-disk module then consumes this sysevent instead of monitoring the service processor directly.

This is the first of many steps towards eliminating this module in its current form, as well as laying groundwork for future platform integration work. I’ll post updates to this blog with information about generic disk monitoring, libtopo indicators, and generic hotplug management as I add this functionality. The eventual goal is to reduce the platform-specific portion of this module to a single .xml file delivered via libtopo that all these generic consumers will use to provide the same functionality that’s present on the x4500 today. Only at this point can we start looking towards future applications, some of which I will describe in upcoming posts.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012