Eric Schrock's Blog

Applications are the nexus of the modern enterprise. They simplify operations, speed execution, and drive competitive advantage. Accelerating the application lifecycle means accelerating the business. Increasingly, organizations turn to public and private clouds, SaaS offerings, and outsourcing to hasten development and reduce risk, only to find themselves held hostage by their data.

Applications are nothing without data. Enterprise applications have data anchored in infrastructure, tied down by production requirements, legacy architecture, and security regulations. But projects demand fresh data under the control of their developers and testers, requiring processes to work around these impediments. The suboptimal result leads to cost overruns, schedule delays, and poor quality.

Agile development requires agile data. Agile data empowers developers and testers to control their data on their schedule. It unburdens IT by efficiently providing data where it is needed independent of underlying infrastructure. And it accelerates application delivery by providing fresh and complete data whenever necessary. It grants its users super powers.

Many technologies can solve part of the agile data problem, but a partial solution still leaves you with suboptimal processes that impede your business. A complete agile data solution must embrace the following attributes.

Non Disruptive Synchronization

Production data is sensitive. The environment has been highly optimized and secured, and its continued operation is critical to the success of the business – introducing risk is unacceptable. An agile data solution must automatically synchronize with production data such that it can provide fresh and relevant data copies, but it cannot mandate changes to how the production environment is managed, nor can its operation jeopardize the performance or success of business critical activities.

Service Provisioning

Data is more than just a sequence of bits. Projects access data through relational databases, NoSQL databases, REST APIs, or other APIs. An agile data solution must move beyond copying the physical representation of the data by instantiating and configuring the systems to access that data. Leaving this process to the end users induces delays and amplifies risk.

Source Freedom

Data is pervasive. Efforts to mandate a single data representation, be it a particular relational or NoSQL system, rarely succeed and limit the ability of projects to choose the data representation most appropriate for their needs. As projects needs diversify the data landscape, the ability to manage all data through a single experience becomes essential. This unified agile experience necessitates a solution not tied to a single data source.

Platform Independence

The premier storage, virtualization, and compute platforms of today may be next year’s legacy architecture. Solutions limited to a single platform inhibit the ability of organizations to capitalize on advances in the industry, be it a high performance flash array or new private cloud software. Agility over time requires a solution that is not tied to the implementation of a particular hardware or software platform.

Efficient Copies

Storage costs money, and time costs the business. Agile development requires a proliferation of data copies for each developer and tester, magnifying these effects. Working around the issue with partial data leads to costly errors that are caught late in the application lifecycle, if at all. An agile solution must be able to create, refresh, and rollback copies of production data in minutes while consuming a fraction of the space required for a full copy.

Workflow Customization

Each development environment has its own application lifecycle workflow. Data may need to be masked, projects may need multiple branches with different schemas, or developers may need to restart services as data is refreshed. Pushing responsibility to the end user is error prone and impedes application delivery. An agile solution must provide stable interfaces for automation and customization such that it can adapt to any development workflow.

Self Service Data

Developers and testers dictate the pace of their assigned tasks, and each task affects the data. Agile development mandates that developers have the ability to transform, refresh, and roll back their data without interference. This experience should shield the user from the implementation details of the environment to limit confusion and reduce opportunity for error.

Resource Management

Each data copy consumes resources through storage, memory, and compute. Once developers experience the power of agile data, they will want more \copies, run workloads on them for which they were not designed, and forget to delete them when they are through. As these resources become scarce, the failure modes (such as poor performance) become more expensive to diagnose and repair. Combatting this data sprawl requires visibility into performance and capacity usage, accountability through auditing and reports, and proactive resource constraints.

 

Delphix is the agile data platform of the future. You can sync to your production database, instantly provision virtual databases where they are needed using miniscule amount space, and provide each developer their own copy of the data that can be refreshed and rolled back on demand. This platform will become only more powerful over time as we add new data sources, provide richer workflows targeting specific applications and use cases, and streamline the self service model. An enterprise data strategy without Delphix is just a path to more data anchors, necessitating suboptimal processes that continue to slow application development and your business.

At Delphix, we just concluded one of our recurring Engineering Kickoff events where we get everyone together for a few days of collaboration, discussion, idea sharing, and fun. In this case it included, for the first time, an all-day hackathon event. To be honest, it was a bit of an experiment and one where we were unsure of how it would be received. We had all read about, participated in, or hear praise of, hackathons at other companies, but these companies were always more consumer-focused or had technologies that were more easily assembled into different creations. As an enterprise software company, we were concerned that even the simplest projects would be too complex to turn around over the course of a day. Given the potential benefit, however, it was clearly something we wanted to experiment with – any failure would also be a learning opportunity.

Some companies go big or go home when it comes to hackathons – week long activities, physical hacks, etc. We wanted to preserve freedom but be a little more targeted. The directive was simple: spend a day doing something unrelated to your normal day job that in some way connects to the business. People volunteered ideas and mentorship ahead of time so that even the newest engineers could meaningfully participate. The result was a resounding success. Whether people were able to give a demo, sketch on a whiteboard, or just speak to their ideas and the challenges they faced, everyone pushed themselves in new directions and walked away having learned something through the experience.

The set of activities covered a wide swath of engineering, including:

  • Using D3.js for visualizing analytics data
  • “zero copy” iSCSI in illumos
  • web portal for customer data analysis
  • “zpool dump” to store pool metadata for offline zdb(1M) use
  • Real time engineering dashboard to aggregate commits, bugs, reviews, and more
  • “D++” DTrace syntactic sugar: function elapsed time, unrolled while loops, callers array
  • Mobile application to monitor Delphix alerts and faults
  • Global symbol tab completion for MDB
  • Network performance tool
  • Speeding up unit tests
  • Browser usage analytics
  • ‘zfs send’ to a POSIX filesystem
  • BTrace++ (a.k.a. CATrace) to make java tracing safe and easy
  • New V2P (virtual to physical) mechanisms in Delphix
  • Tools to more easily deploy changes to VMs

For myself, I put together a prototype of a hosted SSH/HTTP proxy for use by our support organization. This was my first real foray into the world of true PaaS cloud software – running node.js, redis, and cloudAMQP in a heroku instance, and it’s been incredibly interesting to finally play with all these tools I’ve read about but never had a reason to use. I will post details (and hopefully code) once I get it into slightly better shape.

Only a fraction of these are really what I would consider a contribution to the product itself, which is where our initial trepidation around a hackathon went awry. No matter how complex your product or how high the barriers to entry , engineers will find a way to build cool things and try out new ideas in a hackathon setting. Everything that people did, from learning how to make changes to our OS to improving our quality of life as engineers to testing new product ideas, will provide real value to the engineering organization. On top of that, it was incredibly fun and a great way to get everyone working together in different ways.

It’s something we’ll certainly look at doing again, and I’d recommend that every company, organization, or group, find some way to get engineers together with the express purpose of working on ideas not directly related to their regular work. You’ll end up with some cool ideas and prototypes, and everyone will learn new things while having fun doing it.

The other week I had a particularly disheartening discussion with a potential new hire. I typically describe our engineering organization at Delphix as a bottoms-up meritocracy where smart people seize opportunities to impact the company through through thoughtful execution and data-driven methodology (a.k.a. buzzword bingo gold). In this case, after hours of discussions, I couldn’t shake this engineer’s fixation with understanding how his title would affect his ability to have impact at Delphix. After deciding that it was not a good cultural fit, I spent some time thinking about what defines our engineering culture and what exactly it was that I felt was such a mismatch. Rather than writing some pseudo-recruiting material extolling the virtues of Delphix, I thought I’d take a cue from Bryan’s presentation on corporate open source anti-patterns (video) and instead  look at some engineering cultural anti-patterns that I’ve encountered in the past. What follows is a random collection of cultural engineering pathologies that I’ve observed in the past and have worked to eschew at Delphix.

The Thinker

This engineer believes his responsibility is to think up new ideas, while others actually execute those ideas. While there are certainly execution-oriented engineers with an architect title out there that do great work, at Delphix we intentionally don’t have such a title because it can send the wrong message: that “architecting” is somehow separate from executing, and permission to architect is something to be given as a reward. The hardest part of engineering comes through execution – plotting an achievable path through a maze of code and possible deliverables while maintaining a deep understanding of the customer problem and constraints of the system. It’s important to have people who can think big, deeply understand customer needs, and build complex architectures, but without a tangible connection to the code and engineering execution those ideas are rarely actionable.

The Talker

Often coinciding with “The Thinker”, this engineer would rather talk in perpetuity rather than sit down and do actual work. Talkers will give plenty of excuses about why they can’t get something done, but the majority of the time those problems could be solved simply by standing up and doing things themselves. Even more annoying is their penchant for refusing to concede any argument, resulting in orders of magnitude more verbiage with each ensuing email, despite attempts to bring the discussion to a close. In the worst case the talker will provide tacit agreement publicly but fume privately for inordinate amounts of time. In many cases the sum total of time spent talking about the problem exceeds the time it would take to simply fix the underlying issue.

The Entitled

This engineer believes that titles are granted to individuals in order to expand her influence in the company; that being a Senior Staff Engineer enables her to do something that cannot be accomplished as a Staff Engineer. Titles should be a reflection of the impact already achieved through hard work, not a license granted by a benevolent management. When someone is promoted, the reasons should be obvious to the organization as a whole, not a stroke of luck or the result of clever political maneuvering. Leadership is something earned by gaining the respect of your peers through execution, and people who would use their title to make up for a lack of execution and respect of their peers can do an incredible amount of damage within an enabling culture.

The Owner

This engineer believes that the path to greater impact in the organization is through “owning” ideas and swaths of technology. While ownership of execution is key to any successful engineering organization (clear responsibility and accountability are a necessity), ownership of ideas is toxic. This can lead to passive-aggressive counter-productive acts by senior engineering, and an environment where junior engineers struggle to get their ideas heard. The owner rarely takes code review comments well, bullies colleagues that encroach on her territory, and generally holds parts of the product hostage to her tirades. Metastasized in middle management, this leads to ever growing fiefdoms where technical decisions are made for entirely wrong organizational reasons. Ideas and innovation come from everywhere, and while different parts of the organization are better suited to execution of large projects based on their area of expertise, no one should be forbidden from articulating their ideas due to arbitrary assignment.

The Recluse

Also known as “the specialist”, this engineer defines his role in the most narrow fashion possible, creating an ivory tower limited by his skill set and attitude. Good engineers seize a hard problem and drive it to completion, even when that problem pushes them well beyond their comfort zone. The recluse, however, will use any excuse to make something someone else’s problem. Even when the problem falls within his limited domain, he will solve only the smallest portion of the problem, preferring to file a bug to have someone else finish the overall work. When confronted on architectural issues, he will often agree to do it your way, but then does it his way anyway. Months later it can turn out he never understood what you had said in the first place or discussed it in the interim, and by then it’s too late to undo the damage done.

All of us have the potential for these anti-patterns in us. It’s only through regular introspection and frank discussions with colleagues that we can hope to have enough self awareness to avoid going down these paths. Most importantly, we all need to work to create a strong engineering culture where it is impossible for these pathologies to thrive. Once these pathologies become a fixture in a culture, they breed similar mentalities as the organization grows and can be impossible to eradicate at scale.

Over the past several months, one of the new features I’ve been working on for the next release is the development of the new CLI for our appliance. While the CLI is the epitome of a checkbox item to most users, as a completely different client-side consumer of web APIs it can have a staggering maintenance cost. Our upcoming release introduces a number of new concepts that required that we gut our web services and built them  from the ground up – retrofitting the old CLI was simply not an option.

What we ended up building was local node.js CLI that takes programmatically defined web service API definitions and dynamically constructs a structured environment for manipulating state through those APIs. Users can log in with their Delphix credentials over SSH or the console and be presented with this CLI via custom PAM integration. Whenever I describe this to people, however, I get more than a few confused looks and questions:

  • Isn’t node.js for writing massively scalable cloud applications?
  • Does anyone care about a CLI?
  • Are you high?

Yes node.js is so hot right now.  Yes, a bunch of Joyeurs (the epicenter of node.js) are former Fishworks colleagues. But the path here, as with all engineering at Delphix, is ultimately data-driven based on real problems, and this was simply the best solution to those problems. I hope to have more blog posts about the architecture in the future, but as I was writing up a README for our own developers, I thought the content would make a reasonable first post. What follows is an engineering FAQ. From here I hope to describe our schemas and some of the basic structure of the CLI, stay tuned.

Why a local CLI?

The previous Delphix CLI was a client-side java application written in groovy. Running the CLI on the client both incurs a cost to users (they need to download an manage additional software) as well as making it a nightmare to manage different versions across different servers. Nearly every appliance shipped today has a SSH interface; doing something different just increases customer confusion. The purported benefit (there is no native Windows SSH client) has shown to be insignificant in practice, and there are other more scalable solutions to this problem (such as distributing a java SSH client).

Why Javascript?

We knew that we were going to need to be manipulating a lot of dynamic state, and the scope of the CLI would remain relatively small. A dynamic scripting language makes for a far more compelling development environment for rapid development, at the cost of needing a more robust unit test framework to catch what would otherwise be compile time errors in a strongly typed statically compiled language. We explicitly chose javascript because our new GUI will be built in javascript, and this both keeps the number of languages and environments used in production to a minimum, as well allowing these clients to share code where applicable.

Why node.js?

We knew v8 was the best-of-breed runtime when it comes to javascript, and we actually started with a custom v8 wrapper. As a single threaded environment, this was pretty straightforward. But once we started considering background tasks we knew we’d need to move to an asynchronous model. Between the cost of building infrastructure already provided by node (HTTP request handling, file manipulation, etc) and the desire to support async activity, node.js was the clear choice of runtime.

Why auto-generated?

Historically, the cost of maintaining the CLI at Delphix (and elsewhere) has been very high. CLI features lag behind the GUI, and developers face additional burden to port their APIs to multiple clients. We wanted to build a CLI that would be almost completely generated from shared schema. When developers change the schema in one place, we auto-generate both backend infrastructure (java objects and constants), GUI data model bindings, and the complete CLI experience.

Why a modal hierarchy?

The look and feel of the Delphix CLI is in many ways inspired by the Fishworks CLI. As engineers and users of many (bad) CLIs, our experience has led to the belief that a CLI with integrated help, tab completion, and a filesystem-like hierarchy promotes exploration and is more natural than a CLI with dozens of commands each with dozens of options. It also makes for a better representation of the actual web service APIs (and hence easier auto-generation), with user operations (list, select, update) that mirror the underlying CRUD API operations.

In my previous post I outlined some of the challenges faced when building a data replication solution, how the first Delphix implementation missed the mark, and how we set out to build something better the second time around.

The first thing that became clear after starting on the new replication subsystem was that we needed a better NDMP implementation. A binary-only separate daemon with poor error semantics that routinely left the system in an inconsistent state was not going to cut it. NDMP is a protocol built for a singular purpose: backing up files using a file-specific format (dump or tar) over arbitrary topologies (direct attached tape, 3-way restore, etc). By being both simultaneously so specific in the data semantics but so general in the control protocol, we end up with the worst of both worlds: baked-in concepts (such as file history, complete with inode numbers) that prevent us from adequately expressing Delphix concepts, and a limited control protocol (lacking multiple streams or resumable streams) with terrible error semantics. While we will ultimately replace NDMP for replication, we knew that we still needed it for backup, and that we didn’t have the time to replace both the implementation and the data protocol for the current release.

Illumos, the open source operating system our distribution is based on, provides an NDMP implementation, one that I had previously dealt with while at Fishworks (though Dave Pacheo was the one who did the actual NDMP integration). I spent some time looking at the implementation and came to the conclusion that it suffered from a number of fatal flaws:

  • Poor error semantics – The strategy was “log everything and worry about it later”. For an implementation shipped with a roll-your-own
    OS this was not a terrible strategy, but it was a deal breaker for an appliance implementation. We needed clear, concise failure modes that appeared
    integrated with our UI.
  • Embedded data semantics – The notion of tar as a backup format (or raw zfs send) was built very deeply into the architecture. We needed our own data protocol, but replacing the data operations without major surgery was out of the question. While raw ZFS send seems appealing, it is still assumes ownership and control of the filesystem namespace, something that wouldn’t fly in the Delphix world.
  • Unused code – There was tons of dead code, ranging from protocol varieties that were unnecessary (NDMPv2) to swaths of device handling
    code that did nothing.
  • Standalone daemon – A standalone daemon makes it difficult to exchange data across the process boundary, and introduces new complex failure modes.

With this in mind I looked at the ndmp.org SDK implementation, and found it to suffer from the same pathologies (and a much worse implementation to boot). It was clear that the Solaris implementation was derived from the SDK, and that there was no mythical “great NDMP implementation” waiting to be found. I was going to have to suck it up and get back to my Solaris roots to eviscerate this beast.

The first thing I did was recast the daemon as a library, elminating any code that deal with daemonizing, running a door server to report statistics, and
existing Solaris commands that communicated with the server. This allowed me to add a set of client-provided callback vector and configuration options to control state. With this library in place, we could use JNA to easily call into C code from our java management stack without having to worry about marshaling data to and from an external daemon.

The next step was to rip out all the data-handling functionality, instead creating a set of callback vectors in the library registration mechanism to start and stop backup. This left the actual implementation of the over-the-wire format up to the consumer. The sheer amount of code used to support tar and zfs send was staggering, and it had its tendrils all across the implementation. As I started to pull on the thread, more and more started to unravel. Data-specific operations would call into the “tape library management” code (which had very little to do with tape library management) that would then call back into common NDMP code, that would then do nothing.

With the data operations gone, I then had to finally address the hard part: making the code usable. The old error semantics were terrible. I had to go through every log call and non-zero return value, analyze its purpose, and restructure it to use the consumer-provided vector so that we could log such messages natively in the Delphix stack. While doing generic code cleanup, this led me to rip out huge swaths of unused code, from buffer management to NDMPv2 support (v3 has been in common use for more than a decade). This was rather painful, but the result has been quite a usable product. While the old Delphix implementation would have reported “NDMP server error CRITICAL: consult server log for details” (of course, there was no way for the customer to get to the “server log”), we would now get much more helpful messages like “NDMP client reported an error during data operation: out of space”.

The final piece of the puzzle was something that surprised me. By choosing NDMP as the replication protocol (again, a temporary choice), we needed a way to drive the 3-way restore operation from within the Delphix stack. This meant that we wanted to act as a DMA. As I looked at the unbelievable awful ‘ndmpcopy’ implementation shipped with the NDMP SDK, I noticed a lot of similarity to what we needed on the client and what we had on the server (processing requests was identical, even if the set of expected requests was quite different). Rather than build an entirely separate implementation, I converted libndmp such that it could act as a server or a client. This allowed us to build an NDMP copy operation in Java, as well as simulate a remote DMA (an invaluable testing tool).

It took more than a month of solid hard work and several more months of cleanup here and there, but the result was worth it. The new implementation clocks in at just over 11,000 lines of code, while the original was a staggering 43,000 lines of code. Our implementation doesn’t include any actual data handling, so it’s perhaps an unfair comparison. But we also include the ability to act as a full-featured DMA client, something the illumos implementation lacks.

The results of this effort will be available on github as soon as we release the next Delphix version (within a few weeks). While interesting, it’s unlikely to be useful to the general masses, and certainly not something that we’ll try to push upstream. I encourage others looking for an open-source embedded NDMP implementation to fork and improve what we have in Delphix – it’s a very flexible NDMP implementation that can be adopted for a variety of non-traditional NDMP scenarios. But with no built-in data processing, and no standalone daemon implementation, it’s a long way from replacing what can be found in illumos. If someone was so inspired, you could build a daemon on top of the current library – one that provides support for tar, dump, ZFS, and whatever other formats are supported by the current illumos implementation. It would not be a small amount of work, but I am happy to lend advice (if not code) to anyone interested.

Next up will be a post whose working title is “Data Replication: Metadata + Data = Crazy Pain in My Ass”.

With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers. The current Delphix version already has a replication solution, so how does this constitute a “new feature”? The reason is that it’s an entirely new system, the result of an endeavor to create a more reliable, maintainable, and extensible system. How we got here makes for an interesting tale of business analysis, architecture, and implementation.

Where did we come from?

Before we begin looking at the current implementation, we need to understand why we started with a blank sheet of paper when we already had a shipping solution. The short answer is that what we had was unusable: it was unreliable, undebuggable, and unmaintainable. And when you’re in charge of data consistency for disaster recovery, “unreliable” is not an acceptable state. While I had not written any of the replication infrastructure at Fishworks (my colleagues Adam Leventhal and Dave Pacheco deserve the credit for that), I had spent a lot of time in discussions with them, as well as thinking about how to build a distributed data architecture at Fishworks. So it seemed natural for me to take on this project at Delphix. As I started to unwind our current state, I found a series of decisions that, in hindsight, led to the untenable state we were in today.

  • Analysis of the business problem – At the core of the current replication system was the notion that its purpose was for disaster recovery. This is indeed a major use case of replication, but it’s not the only one (geographical distribution of data being another strong contender). While picking one major problem to tackle first is a reasonable approach to constrain scope, by not correctly identifying future opportunities we ended up with a solution that could only be used for active/passive disaster recovery.
  • Data protocol choice – There is another problem that is very similar to replication: offline backup/restore. Clearly, we want to leverage the same data format and serialization process, but do we want to use the same protocol? NDMP is the industry standard for backups, but it’s tailored to a very specific use case (files and filesystems). By choosing to use NDMP for replication, we sacrificed features (resumable operations, multiple streams) and usability (poor error semantics) and maintainability (unnecessarily complicated operation).
  • Outsourcing of work – At the time this architecture was created, it was decided that NDMP was not part of the company’s core competency, and we should contract with a third party to provide the NDMP solution. I’m a firm believer that engineering work should never be outsourced unless it’s known ahead of time that the result will be thrown away. Otherwise, you’re inevitably saddled with a part of your product that you have limited ability to change, debug, and support. In our case, this was compounded by the fact that the deliverable was binary objects – we didn’t even have source available.
  • Architectural design – By having a separate NDMP daemon we were forced to have an arcane communication mechanism (local HTTP) that lost information with each transition, resulting in a non-trivial amount of application logic resting in a binary we didn’t control. This made it difficult to adapt to core changes in the underlying abstractions.
  • Algorithmic design – There was a very early decision made that replication would be done on a per-group basis (Delphix arranges databases into logical groups). This was divorced from the reality of the underlying ZFS data dependencies, resulting a numerous oddities such as being unable to replicate non self-contained groups or cyclic dependencies between groups. This abstraction was deeply baked into the architecture such that it was impossible to fix in the original architecture.
  • Implementation – The implementation itself was built to be “isolated” of any other code in the system. When one is replicating the core representation of system metadata, this results in an unmaintainable and brittle mess. We had a completely separate copy of our object model that had to be maintained and updated along with the core model, and changes elsewhere in the system (such as deleting objects while replication was ongoing) could lead to obscure errors. The most egregious problems led to unrecoverable state – the target and source could get out of sync such that the only resolution was a new full replication from scratch.
  • Test infrastructure – There was no unit test infrastructure, no automated functional test infrastructure, and no way to test the majority of functionality without manually setting up multi-machine replication or working with a remote DMA. As a result only the most basic functionality worked, and even then it was unreliable most of the time.

Ideals for a new system

Given this list of limitations, I (later joined by Matt) sat down with a fresh sheet of paper. The following were some of the core ideals we set forth as we built this new system:

  • Separation of mechanism from protocol – Whatever choices we make in terms of protocol and replication topologies, we want the core serialization infrastructure to be entirely divorced from the protocol used to transfer the data.
  • Support for arbitrary topologies – We should be able to replicate from a host to any number of other hosts and vice versa, as well as provision from replicated objects.
  • Robust test infrastructure – We should be able to run protocol-level tests, simulate failures, and perform full replication within a single-system unit test framework.
  • Integrated with core object model – There should be one place where object definitions are maintained, such that the replication system can’t get out of sync with the primary source code.
  • Resilient to failure – No matter what, the system must be maintain consistent state in the face of failure. This includes both catastrophic system failure, as well as ongoing changes to the system (i.e. objects being created and deleted). At any point, we must be able to resume replication from a previously known good state without user intervention.
  • Clear error messages – Failures, when they do occur, must present a clear indication of the nature of the problem and what actions must be taken by the user, if any, to fix the underlying problem.

At the same time, we were forced to limit the scope of the project so we could deliver something in an appropriate timeframe. We stuck with NDMP as a protocol despite its inherent problems, as we needed to fix our backup/restore implementation as well. And we kept the active/passive deployment model so that we did not require any significant changes to the GUI.

Next, I’ll discuss the first major piece of work: building a better NDMP implementation.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives