Primary Storage Dedupe For Backup?

Now that data deduplication for primary storage is going mainstream, I'm starting to get questions from students at backup school about using storage systems designed to deduplicate files stored on them as backup targets. While the idea of using the same technology for your primary and backup data sounds attractive, some of the folks I've spoken to who have tried substituting a ZFS box for a DataDomain appliance have seen disappointing results.

Howard Marks

February 22, 2011

3 Min Read
Network Computing logo

Now that data deduplication for primary storage is going mainstream, I'm starting to get questions from students at backup school about using storage systems designed to deduplicate files stored on them as backup targets. While the idea of using the same technology for your primary and backup data sounds attractive, some of the folks I've spoken to who have tried substituting a ZFS box for a DataDomain appliance have seen disappointing results.  

For a block and hash deduplication scheme to identify, and therefore deduplicate, duplicate data, it has to break data up into blocks so the same data falls in the same block the same way. If data in one block is identical to data in another, but offset by just one byte, the two blocks will generate different hashes and the deduplication system will store them both.  Some backup dedupe vendors, including Quantum and Data Domain, have sophisticated algorithms for figuring out where their variable size blocks should begin and end to maximize the probability of recognizing duplicate data. Quantum even has a patent on the technique that Data Domain gave them stock to license before EMC bought Data Domain.

Variable block deduplication requires a lot of compute power to figure out where the block boundaries should be. Since primary storage deduplication has to balance write latency against dedupe efficiency, and because it's usually implemented as an extension of an existing file system that wasn't designed to store on a block of 3,654 bytes and another of 1,892, primary storage deduplication systems usually use fixed block sizes.

That existing file system provides some assistance to the dedupe system by aligning the beginning of each file at the beginning of a block. This means primary storage dedupe will always identify duplicate files and work well with file systems that have a large number of small files. In addition, since many applications like databases read and write fixed size pages if the page size is a multiple of the underlying block size, duplicate data in a database or across databases will also be detected.

The problem primary storage dedupe systems have with backup data is that most conventional backup applications don't write the files they're backing up to the storage system intact, but instead create aggregate files that are the logical equivalent of .tar or .zip files. The media management functions of the backup apps then pretend that each aggregate file is a tape cartridge.Data within one of these aggregate files, even when it's the same data that we backed up last week, isn't necessarily in the same place it was the last time. If a new 27-byte file--say, c:aaaa.txt--is backed up at the beginning of the job, all the rest of the data will be offset 27 bytes, confusing the fixed block dedupe system. To add insult to injury, the aggregate file formats include metadata about both the data being backed up and the progress of the backup job. This metadata, which is interspersed with the real data, can also confuse a simple dedupe process.

The vendors that make deduping appliances for backup data spend a lot of time reverse engineering the common backup app aggregate formats to ensure that their systems can get the maximum deduplication ratios for the data they're storing. Vendors like Sepaton and Exagrid that do delta differential, as opposed to block and hash deduplication, get much, if not most, of their deduplication by examining the backup stream, seeing that WINSOCK.DLL is being backed up yet again and saving a pointer to the last copy.

So while we can dedupe primary storage and backup data systems, tools specific for the kind of data you're trying to store will probably give you a better combination of data reduction and performance than one designed for another task or a Swiss Army knife.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights