As more and more servers are virtualized in data centers, deduplication needs to play a bigger role in protecting their data.
“Improving data backup and recovery” and “Increased use of server virtualization” tied for the top priority in an IT spending survey conducted by Enterprise Strategy Group in 2012. While server virtualization regularly occupies the top spot, it’s interesting to see backup listed there, but not surprising. As server virtualization adoption continues, users are discovering that while it solves many problems, backup isn’t one of them.
VM sprawl and backup I/O
It’s arguably too easy to create new virtual machines (VMs), often resulting in “VM sprawl” and a seemingly unending stream of new virtual hard disk files. With only a few mouse clicks, a new VM is created and large chunks of storage are consumed.
Historically, backup was one of the most I/O-intensive operations for a server, but because physical servers were usually underutilized, they had I/O, memory and compute resources to spare. Server virtualization changes all that, and makes better use of physical hardware and every last ounce of I/O and computing capability. That’s fine until the backup begins, which may not only overwhelm the virtualized server, but its hypervisor host, other VMs on that host and even the storage system they (and perhaps the backup server) depend on.
So it’s no surprise that virtual server data protection strategies are top priorities. And deduplication and optimization are key design criteria for protecting a virtual infrastructure.
@pb
Optimizing for data protection
It’s imperative to bring those optimizations as close to the production workload as possible. To enable this, VMware vStorage 5 provides Changed Block Tracking (CBT) that allows the hypervisor and storage to track which disk blocks have been written to since the last backup, eliminating much of the comparative or other I/O operations. Other hypervisors’ file systems provide similar block-level tracking by enabling file-system filter drivers to monitor I/O and create volume or file-system bit maps for block selection. Some storage products also integrate with the hypervisors so that tracking is achieved in part through the array’s storage layer, perhaps including the spawning of an off-line copy of the backup. Regardless of how it’s done, the result is a significant reduction in I/O that can otherwise severely burden the hypervisor and all its VMs.
Dedupe in a virtual world
Along with overall optimization, deduplication has particular ramifications for protecting virtual environments. Aside from where dedupe happens (source, backup server or storage), the “how” and “how wide” must also be considered.
At its simplest, some deduplication works only on iterations of each file being protected. For example, if a VM is made up of two virtual hard disk (VHD) files, then somewhere on a production hypervisor’s storage system are two VHD files. If one were to back up a Word document and then change only a portion, it might be acceptable to only keep those new block chunks of the .doc file. Some dedupe methods only apply that logic to VHDs, so each time the VHD is backed up only the new blocks are maintained. But the file is constantly changing, and, moreover, there are likely many blocks within the VHD that are part of many other VHDs, such as the blocks that make up the OS for each VM.
Thinking broader than per-VHD deduplication, other methods will retain only those unique components across VHDs but limit them to a hypervisor’s LUN or volume. So, if a hypervisor has four volumes, each with 10 VMs all running Windows, volume-centric deduplication would end up with four iterations of the Windows OS or other application binaries. That’s better than 40 unique VHDs, but not ideal.
Other dedupe scenarios might have the unique blocks per hypervisor (so one Windows OS instance across those 40 VMs) but not deduplicate among the multiple hypervisors that are likely being protected by the same backup server or appliance. This is most often caused by the design of the source-side deduplication when it doesn’t have any awareness of what else is protected beyond what it can see from its hypervisor-centric view.
The last deduplication consideration is file-/object-level deduplication across both physical and virtual servers. Most environments aren’t 100% virtualized, so some physical servers will remain. Moreover, the block-level logic used by some deduplication mechanisms may not identify files that reside on physical servers as matching those that reside within VHDs, or across VHDs when spread across a hypervisor farm with varying storage systems.
The call to action is to understand:
- What optimization methods are in use to reduce potential I/O impact on the protected VM, its neighboring VMs, the host and its storage?
- Which method(s) are in use to determine what can be deduped within VHDs, across VHDs, and across hosts and their storage?
You’ll also need to watch for the evolution of hypervisors, whose plumbing (such as VMware vStorage APIs for Data Protection or Microsoft Hyper-V VSS) enables most of the backup software and hardware products to achieve what they do for better backups in virtualized environments.
[Originally posted on TechTarget as a recurring columnist]