We need a backup data deduplication layer

As backup data deduplication matures, it’s still very much a proprietary technology. We need standardization to eliminate some of today’s software-hardware headaches.

Just about everyone who works with disk-based backup understands the need for data deduplication. For many, that includes the use of a deduplication storage appliance as a data backup target. Most backup software products can use deduplication appliances that present themselves as either a file share (NFS or CIFS) or a tape device (virtual tape library or VTL). The challenge with those approaches is that the backup software doesn’t know it’s writing to a deduplication target. All the data is sent from the backup software to the appliance, and then most of the data is discarded when the appliance determines it has it stored.

If simply leveraging a deduplication appliance was “Dedupe 1.0,” then “Dedupe 2.0” is to optimize the process by making the backup software deduplication-aware. It seems as if almost every deduplication array now offers API libraries that enable backup software to optimize the backup process, such as:

  • EMC Data Domain with DD Boost
  • HP StoreOnce Catalyst
  • Quantum DXi with Accent

The list goes on, but the point is that for backup software users to betterleverage their deduplication hardware, their software provider has to embrace that particular hardware vendor’s accelerator APIs. Of course, many of those hardware providers also sell data backup software that leverages those APIs, such as EMC NetWorker with Data Domain or HP Data Protector with StoreOnce.

We can classify the alternatives a little differently, on a “good-better-best” scale:

  • “Good” deduplication is simply using a deduplication appliance.
  • “Better” deduplication involves a backup server that’s dedupe-aware.
  • “Best” deduplication enables deduplication at the production source server within the backup agents.

Unfortunately, there are very few hardware plus software “best” offerings. For example, EMC Data Domain offers a “best” solution with NetWorker (meaning its deduplication can occur client-side), whereas other software solutions that leverage Data Domain only offer “better” deduplication from the backup server. This isn’t a knock on EMC, but on the complexity of adding those deduplication APIs to the client agents. If a third-party software vendor wanted to deliver a “best” deduplication experience that still appealed to that software vendor’s broad customer base, they would have to engineer their agents to use DD Boost, Catalyst, Accent and others, and then absorb some appreciable development, testing and support requirements.

One backup software vendor, Symantec, has a different approach through its OpenStorage Technology (OST) mechanisms. Instead of the deduplication appliance presenting itself as a VTL or file share, or offering its own API accelerator, it can support Symantec’s OST standard that provides interoperability with Backup Exec and NetBackup. Essentially, instead of the software vendors writing to the APIs of one or more hardware vendors, the hardware vendors write to Symantec’s OST specifications. Hardware vendors do this because of the Symantec products’ time and presence in market, but what if other software vendors each published specifications similar to OST? This creates the same challenges described above, where each hardware vendor would have to develop and support multiple software specifications.

So, what’s the answer? In a perfect world, there would be a data deduplication API layer that works across wide ranges of backup software and hardware vendors. Symantec OST is used by many hardware vendors, but only with Symantec software products. EMC DD Boost has a broad ecosystem of software partners, but it only works with EMC Data Domain appliances. What would happen if Symantec or EMC licensed the API libraries for interoperability across all hardware/software players? Who would support it, and what would happen to the differentiation in the “better together” stacks? At first glance, it appears to benefit a variety of constituents, including participating vendors, partners, and IT organizations struggling with the current mixing and matching. But in reality, it’s a chicken-and-egg challenge and nobody is moving.

A standard backup data deduplication layer may seem like a pipe dream, but it isn’t the first time it’s been suggested. At one time, backup software vendors each wrote their own backup software agents per application workload. Each of those software vendors did their own engineering to understand how to get the data out of the databases and other production applications, as well as how to restore the data. Meanwhile, storage hardware vendors used to create their own application agents to enable their own data protection solutions. All those vendors talked about innovation and differentiation (and they were right), but it made solution designs and support challenging not just for IT teams, but for partners and vendors. What if someone had suggested a common layer for backup software, storage hardware and applications?

Actually, someone did. With Windows as the primary OS, and Exchange and SQL Server as key applications that would benefit, Microsoft introduced Volume Shadow Copy Service (VSS) several years ago to enable that common layer from within the Windows OS. Backup software vendors now include VSS requesters in their agents, hardware vendors include VSS providers and application vendors can now use VSS writers. Adoption was slow initially, but now almost every Windows application uses VSS — and everyone benefits. A common layer for backup software and storage hardware to interoperate happened once for application backups. Let’s hope it will happen again for deduplication.

 [Originally posted on TechTarget as a recurring columnist]

Leave a Reply