With Data Deduplication, Less Is More

Deduplication technology is moving beyond backup and into other applications.

Howard Marks

May 12, 2008

11 Min Read
Network Computing logo

Ever-growing storage farms, shrinking backup windows, and increasing demand for nearly instantaneous data restores have led to some impressive innovation in the past half decade. Four short years ago, disk-to-disk backup was the hot new technology. The falling price and rising quality of commodity disk arrays made backing up to disk cost-effective, and the advantages, especially on the restore side, were obvious.

Into this hot market in 2004 came a startup called Data Domain, making what seemed to be outlandish claims that its DD400 appliance could not only hold its backups, but could also get data reduction ratios of 10- or 20-to-1. Users, used to getting less than the 2-to-1 data compression that tape drive vendors had been promising for years, had a hard time believing Data Domain's claim.

Fast-forward to the present, and data deduplication has become de rigueur for backup targets. With the possible exception of high-end virtual tape libraries (VTLs) for petabyte data centers, where high performance trumps capacity, the question isn't "Does your backup target deduplicate data?" but "How does your backup target deduplicate data?"

InformationWeek Reports

Deduplication technology also is escaping the backup ghetto, spreading to other applications, from archiving to primary storage and WAN optimization. Before moving on to those state-of-the-art applications, let's take a look at how backup data gets duplicated in the first place.

Duplicate data makes its way into your backup data store in two primary ways. Data gets duplicated in the temporal realm as the same files are backed up repeatedly over time. This could be duplicate copies of an identical file stored in the four backups that the usual weekly full backup keeps in a vault with a 30-day retention policy. It might also be the first 900 MB of the 1-GB mailbox file that stores your CEO's e-mail. Since she receives new mail every day, the nightly incremental backup makes a new copy every night. Of course, most of the file remains unchanged, but the whole file is backed up anyway.

Data also can be duplicated in the geographical realm. If you back up the system drive of 50 Windows servers, you now have 50 copies of the Windows server distribution in your backup vault taking up space. At the subfile level, think about how many copies of your company's logo are embedded within the hundreds of thousands of memos, letters, and other documents that fill your file servers.

While data deduplication is generally accepted as valuable technology, there are still a few debates over how and where it should be applied, and none is as heated as the in-line vs. post-process debate. In-line devices, such as those from Data Domain or VTLs based on Diligent Technologies' ProtecTier software, process and deduplicate data in real time, storing only deduplicated data. Since deduplication takes a lot of compute power, the maximum rate at which these devices can deduplicate data limits in-line device performance.

DIG DEEPER

EXTEND YOUR DATA CENTER

Are remote servers left out in the cold? It's time to virtually exted the data center.

Download this InformationWeek Report

>> See all our Reports <<

Promising faster backup rates, post-processing appliances, including those from ExaGrid Systems, Sepaton, and FalconStor Software OEMs such as Sun Microsystems and Copan Systems, first store the backup data and then deduplicate it. This requires that the system dedicate disk space to caching backup data until it's deduplicated, thereby reducing the benefit of deduplication.

In addition, an in-line device begins replicating deduplicated data to a remote site for disaster recovery as it accepts additional backup data, but post-processing could delay replication for several hours. Post-processing also could reduce performance or otherwise complicate spooling secondary copies off to tape.

The nightmare scenario for post-processing is one in which the post-process data deduplication isn't fast enough to complete deduplicating last night's backups when today's backups begin. The system would need even more space to store today's backups until they can be deduped and will run out of space unless the deduplication process can catch up.

The question users have to ask when choosing between an in-line and a post-process approach is whether the in-line system is fast enough for their backups to complete within the backup window. If it is, the simplicity of the in-line approach is a clear winner. If not, post-processing may be easier than managing multiple deduplication domains on multiple devices.

Today, higher-end in-line systems can ingest data at 200 Mbps to 400 Mbps per node. Data Domain and Quantum can build multinode clusters that are faster, but each node in the cluster represents its own deduplication domain, using more disk space and complicating management. Since the process is mostly CPU-bound, every time a new generation of processors comes out, vendors can boost performance by about 50%. Later this year, Diligent plans to release a two-node cluster version of ProtecTier that will share storage and act as a single deduplication domain while ingesting data at about 700 Mbps and eliminating the gateway as a single point of failure.

Impact Assessment: Data Deduplication

(click image for larger view)PICK AND CHOOSE
While we talk about data deduplication as if it were a single technique, there are several ways to slice and dice data. The simplest is single-instance storage, which uses symbolic links so one file can appear to be in multiple places with different names. Microsoft's Windows Storage Server has a background task that finds and eliminates duplicate files this way. Single-instance storage is a good start, but to get really impressive data reduction, you have to work at a finer level.

Hash-based systems such as NEC's Hydrastor and Quantum's DXi series divide data into blocks and use an algorithm like MD5 or SHA-1 to create a fingerprint for each block. When a second block generates the same fingerprint, the system creates a pointer to the first block rather than saving the second.

While hash-based deduplication sounds simple, the devil, as they say, is in the details. The first gotcha is figuring out where the block boundaries should go. While fixed-length blocks are easier to implement, they can miss a small amount of data inserted into an existing file. Variable block schemes are more complex but usually result in greater deduplication.

The other snag is the possibility of hash collisions. While the likelihood of a collision is on the order of 1 in 1020 per petabyte of data--a few billion times less likely than being hit by lightning--vendors have recognized that customers are concerned about even this low probability and have added byte-by-byte comparisons or calculate a second hash with another algorithm as a double check.

Hashing isn't the only way to deduplicate data. Content-aware systems like ExaGrid's and Sepaton's DeltaStor understand the format used by the backup applications that write to them. They compare the version of each file backed up to the version from the last backup, then store only the changes.

The rub is that content awareness alone only identifies data duplication in the temporal realm. It can be efficient at storing the 15 new messages in a huge e-mail file, but it won't reduce the amount of space that 400 copies of the corporate memo template take up across all users' home directories.

Whereas hash-based systems can deduplicate any data you throw at them, the makers of content-aware systems have to explicitly build in support for the file and/or tape data format of each application supported. This has caused some consternation--for example, among users that bought Sepaton VTLs and tried to use EMC's NetWorker or CommVault Galaxy, which Sepaton doesn't yet support. They're stuck waiting for Sepaton to add support for their backup programs.

diagram: Global Data Deduplication

Saving disk space in the data center is a pretty neat trick, but it's in remote-office and branch-office backups that deduplication really shines, reducing not just the amount of disk space you need at headquarters to back up all your remote offices, but also the amount of WAN bandwidth it takes to send the data. Organizations can eliminate all their remote-office tape-handling problems by replacing the tape drives in remote offices with a deduplicating appliance that replicates its data back to another appliance in the data center while keeping backup software and processes such as Quantum's DXi or Data Domain.

The other solution is to use remote-office and branch-office backup software like EMC's Avamar, Symantec's NetBackup PureDisk, or Asigra's Televaulting, which perform hash-based data deduplication at the source to vastly reduce the WAN bandwidth needed to transfer the backup data to company headquarters.

Like any conventional backup application making an incremental backup, remote-office and branch-office backups use the usual methods, such as archive bits, last-modified dates, and the file system change journal, to identify the files that have changed since the last backup. They then slice, dice, and julienne the file into smaller blocks and calculate hashes for each block.

Why Deduplicate?

LOCAL BACKUP
Deduplicating virtual tape libraries or other disk-based systems letsthem store 10 to 30 times as much data as nondeduplicated systems, thereby extending retention and simplifying restores.

REMOTE BACKUP
Replicating deduplicated data makes backup across the WAN practical. Global deduplication even eliminates duplicates across multiple remote offices.

ARCHIVES
In addition to saving disk space in the archive, deduplication hashes serve as data signatures to ensure data integrity and expand available storage

WAN ACCELERATION
Hashing data, eliminating duplicate blocks, and caching in WAN acceleration appliances speed applications and replication without the cache coherency problems associated with wide area file services.

The hashes are then compared with a local cache of the hashes of blocks that have been backed up at the local site. The hashes that don't appear in the local cache and file system metadata are then sent to a grid of servers that serve as the central backup data store, which then compares the data with its hash tables. The backup server sends back a list of the hashes that it hasn't seen before; the server being backed up then sends the data blocks represented by those hashes to the central data store for safekeeping.

These backup systems could reach even higher data-reduction levels than the backup targets by deduplicating not just the data from the set of servers that are backed up to a single target or even a cluster of targets, but across the entire enterprise. If the CEO sends a 100-MB PowerPoint presentation to all 500 branch offices, it will be backed up from the one whose backup schedule runs first. All the others will just send hashes to the home office and be told, "We already got that, thanks."

This approach is also less susceptible to the scalability issues that affect hash-based systems. Since each remote server only caches the hashes for its local data, that hash table shouldn't outgrow available space, and since the disk I/O system at the central site is much faster than the WAN feeding the backups, even searching a huge hash index on disk is much faster than sending the data.

Although Avamar, NetBackup PureDisk, and Televaulting all share a similar architecture and are priced based on the size of the deduplicated data store, there are some differences. NetBackup PureDisk uses a fixed 128-KB block size, whereas Televaulting and Avamar use variable block sizes, which could result in greater deduplication. Asigra also markets Televaulting for service providers so small businesses that don't want to set up their own infrastructure can take advantage of deduplication, too.

PACKAGE DEAL
We're also starting to see deduplication techniques appear to varying degrees in the established backup applications familiar to system administrators. This trend lets users keep backups online longer without the expense of a deduplicating target or the headache of replacing their backup software and processes.

CommVault's Galaxy now includes single-instance storage as a standard part of its backup-to-disk functions. This lets Galaxy eliminate duplicate copies of files from the backup set by fingerprinting files' contents and eliminating duplicates. This approach can't achieve the data reduction rates that subfile deduplication can, but it can eliminate duplicate files in both the temporal and geographical realms, letting Galaxy users retain data on their disks longer.

EMC and Symantec have both started integrating their source deduplicating remote-office and branch-office packages with their enterprise backup systems. Companies can use EMC's NetWorker to schedule, manage, and monitor Avamar backup jobs so a single console can manage both local and remote backup.

Still other vendors are offering deduplication technology for primary storage. NetApp, for example, added its deduplication technology, previously called A-SIS, to its OnTap NAS OS last year, aiming it at archival data.

However companies tackle data deduplication, it's clear that the technology has become an important part of their backup strategies. Look for it to spread even further through the network ecosystem, preserving valuable disk space and bandwidth.

Illustration by John Hersey

Continue to the sidebar:
Deduplication Checklist

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights