Eric Schrock's Blog

What is Shadow Migration?

September 16, 2009

In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed “shadow migration.” When we launched the product, there was no integrated way to migrate data from existing systems to the new systems. This resulted in customers rolling it by hand (rsync, tar, etc), or paying for professional services to do the work for them. We felt that we could present a superior model that would provide for a more integrated experience as well as let the customer leverage the investment in the system even before the migration was complete.

The idea in and of itself is not new, and various prototypes of this have been kicking around inside of Sun under various monikers (“brain slug”, “hoover”, etc) without ever becoming a complete product. When Adam and myself sat down shortly before the initial launch of the product, we decide we could do this without too much work by integrating the functionality directly in the kernel. The basic design requirements we had were:

  • We must be able to migrate over standard data protocols (NFS) from arbitrary data sources without the need to have special software running on the source system.
  • Migrated data must be available before the entire migration is complete, and must be accessible with native performance.
  • All the data to migrate the filesystem must be stored within the filesystem itself, and must not rely on an external database to ensure consistency.

With these requirements in hand, our key insight was that we could create a “shadow” filesystem that could pull data from the original source if necessary, but then fall through to the native filesystem for reads and writes one the file has been migrated. What’s more, we could leverage the NFS client on Solaris and do this entirely at the VFS (virtual filesystem) layer, allowing us to migrate data between shares locally or (eventually) over other protocols as well without changing the interpositioning layer. The other nice thing about this architecture is that the kernel remains ignorant of the larger migration process. Both synchronous requests (from clients) and background requests (from the management daemon) appear the same. This allows us to control policy within the userland software stack, without pushing that complexity into the kernel. It also allows us to write a very comprehensive automated test suite that runs entirely on local filesystems without need a complex multi-system environment.

So what’s better (and worse) about shadow migration compared to other migration strategies? For that, I’ll defer to the documentation, which I’ve reproduced here for those who don’t have a (virtual) system available to run the 2009.Q3 release:

Migration via synchronization

This method works by taking an active host X and migrating data to the new host Y while X remains active. Clients still read and write to the original host while this migration is underway. Once the data is initially migrated, incremental changes are repeatedly sent until the delta is small enough to be sent within a single downtime window. At this point the original share is made read-only, the final delta is sent to the new host, and all clients are updated to point to the new location. The most common way of accomplishing this is through the rsync tool, though other integrated tools exist. This mechanism has several drawbacks:

  • The anticipated downtime, while small, is not easily quantified. If a user commits a large amount of change immediately before the scheduled downtime, this can increase the downtime window.
  • During migration, the new server is idle. Since new servers typically come with new features or performance improvements, this represents a waste of resources during a potentially long migration period.
  • Coordinating across multiple filesystems is burdensome. When migrating dozens or hundreds of filesystems, each migration will take a different amount of time, and downtime will have to be scheduled across the union of all filesystems.

Migration via external interposition

This method works by taking an active host X and inserting a new appliance M that migrates data to a new host Y. All clients are updated at once to point to M, and data is automatically migrated in the background. This provides more flexibility in migration options (for example, being able to migrate to a new server in the future without downtime), and leverages the new server for already migrated data, but also has significant drawbacks:

  • The migration appliance represents a new physical machine, with associated costs (initial investment, support costs, power and cooling) and additional management overhead.
  • The migration appliance represents a new point of failure within the system.
  • The migration appliance interposes on already migrated data, incurring extra latency, often permanently. These appliances are typically left in place, though it would be possible to schedule another downtime window and decommission the migration appliance.

Shadow migration

Shadow migration uses interposition, but is integrated into the appliance and doesn’t require a separate physical machine. When shares are created, they can optionally “shadow” an existing directory, either locally (see below) or over NFS. In this scenario, downtime is scheduled once where the source appliance X is placed into read-only mode, a share is created with the shadow property set, and clients are updated to point to the new share on the Sun Storage 7000 appliance. Clients can then access the appliance in read-write mode.

Once the shadow property is set, data is transparently migrated in the background from the source appliance locally. If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request. This may incur some initial latency for some client requests, but once a file has been migrated all accesses are local to the appliance and have native performance. It is often the case that the current working set for a filesystem is much smaller than the total size, so once this working set has been migrated, regardless of the total native size on the source, there will be no perceived impact on performance.

The downside to shadow migration is that it requires a commitment before the data has finished migrating, though this is the case with any interposition method. During the migration, portions of the data exists in two locations, which means that backups are more complicated, and snapshots may be incomplete and/or exist only on one host. Because of this, it is extremely important that any migration between two hosts first be tested thoroughly to make sure that identity management and access controls are setup correctly. This need not test the entire data migration, but it should be verified that files or directories that are not world readable are migrated correctly, ACLs (if any) are preserved, and identities are properly represented on the new system.

Shadow migration implemented using on-disk data within the filesystem, so there is no external database and no data stored locally outside the storage pool. If a pool is failed over in a cluster, or both system disks fail and a new head node is required, all data necessary to continue shadow migration without interruption will be kept with the storage pool.

In a subsequent post, I’ll discuss some of the thorny implementation detail we had to solve, as well as provide some screenshots of migration in progress. In the meantime, I suggest folks download the simulator and upgrade to the latest software to give it a try.

13 Responses

  1. Do this with FC and you have a lot of interesting possibilities… I will not that certain competitors sell linux boxes that are mainly interposers for fc for a _very_ hefty markup (for storage ‘virtualization’ and migration). That alone could be a separate product (as well as part of an array).

  2. I wonder if Solaris Engineering could use this work for a CacheFS follow-on …. at the end this would be not much more than a perpetual shadow migration plus the LRU-stuff.

  3. This release absolutely rocks.
    Live migration came just in time, as we have a large migration ahead.
    BTW: Did you also improve the gzip algorithm in this release? Seems much faster now…

  4. @Jason – The current migration only works with filesystems, but definitely on my list of things for a future release is migration of LUNs. While the basic idea is the same, the mechanism is quite different. We don’t have the same ability to store metadata with devices, so it will have to be baked into ZFS, as opposed to the generic VFS level.
    @Joerg – We do have some crazy ideas about possible future directions for the technology. We’ll see where it leads.
    @Anonymous – There were no specific changes to the gzip algorithm, but depending on what release you’re coming from, you may be noticing the effects of 6586537, which dedicates more threads to the task on larger systems.

  5. Will this also preserve the UFS extended ACL permission converting them to ZFS compatible permission (clients are solaris10 hosts)?

  6. @Eli – No, it does not do any ACL conversion. It will preserve basic UNIX permissions, as well as NFSv4 ACLs. One strategy might be to mount the shares over NFSv4 and have the server do the conversion before it goes over the wire. I don’t know if the Solaris 10 server supports this.

  7. This is starting to look a lot like an archiving filesystem (with the "old" server acting as a backing store), albeit only archiving for the duration of a migration. Very interesting…

  8. This sounds like a dream come true for us, with one possible gotcha: When you say "If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request." do you mean "whole file" ?
    I ask because our typical S7410 usage is with databases, where the files are hundreds of gigabytes apiece. So "some initial latency", in our case, is actually a lot.
    Also, can the shadow migration be throttled? We have issues with rsync’ing from one S7410 to another and it saturating the disks such that other clients get wrecked.

  9. @Don –
    Currently, files are migrated all at once. I am working on adding partial migration in the 2009.Q4 release for exactly the reasons you describe.
    The background migration can be controlled by specifying the number of threads devoted to the task. Any given file (synchronous or background) is always migrated at the maximum speed of a single thread. More aggressive throttling will most likely be done through IP QoS controls in a future software release.

  10. Any chance of turning this around? It could make a nice archiving solution.
    I’m thinking along these lines:
    Specify the size of the front end FS (quota)
    Specify the back end FS
    Specify the migration rule (lru, etc)
    The main advantage is that backups of the full front end will be much faster since it is smaller and there are fewer files. The front end could also exist on higher end hardware like a 7410 since it is dealing with the ‘hot’ data while a 7210 could serve up the static content.
    Of course, the devil is in the details 🙂

  11. @Alessandro –
    Do you have a support case open? This is almost certainly because you have ‘.zfs/snapshot’ set to ‘visible’ at the project level, a bug which is fixed in the upcoming minor release. If you set it to ‘hidden’ it should work. If not, please work the issue through the support channels.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012