More On Chunking

The last time we looked at the chunking process in data deduplication engines ("Your Mileage Will Vary: Chunking"), we were looking pretty favorably at variable chunking that used the contents of the data to assign chunk boundaries. However, as deduplication moves from backup appliances accepting tape, or other backup application-specific format data, into backup applications and primary storage, the advantages of fixed-chunk deduplication start to become apparent.

Howard Marks

March 28, 2011

3 Min Read
Network Computing logo

The last time we looked at the chunking process in data deduplication engines ("Your Mileage Will Vary: Chunking"), we were looking pretty favorably at variable chunking that used the contents of the data to assign chunk boundaries. However, as deduplication moves from backup appliances accepting tape, or other backup application-specific format data, into backup applications and primary storage, the advantages of fixed-chunk deduplication start to become apparent.

The primary advantage of fixed-chunk deduplication is lower CPU overhead.  Fixed-chunk systems don't have to spend any CPU cycles examining data and determining where the chunk borders should be. They just break data up into chunks like any other file system. In fact, some primary storage deduplication, like NetApp's,  uses just the underlying file system's chunks for its foundation.

Lower overhead also means lower latency; computing where to put the chunk boundaries takes some time. While vendors have done their best to reduce this additional latency, and will claim it's not noticeable, it exists and might be a problem for primary storage deduplication systems.

Backup applications are a simple lot. In their heart of hearts they just want to be sending a stream of data to a tape drive somewhere. Since they're making large sequential write requests to a small number of large files, a few milliseconds of latency per request won't have a big impact. For conventional backup applications like NetBackup or Networker, throughput is all important, and latency less so.

Primary storage applications, even simple ones like hosting users' home folders, are much more latency-sensitive. In addition, rather than writing to a small number of very large files like backup applications do, primary storage environments have millions of files of all sizes. Since each file begins on a fresh data chunk, an insertion or other change that could throw off the chunk alignment will only affect one file's worth of data. Every new file will realign the process.Software-based deduplication software--especially applications that deduplicate at the source server like Avamar, PureDisk or Asigra's Cloud Backup--will also use the file start and end to determine their chunk boundaries. These applications first identify files that have changed, like a conventional incremental backup, then start the chunking process on each file.

Using file boundaries can optimize fixed-chunk chunking on backup targets if the deduplication engine in the backup target knows the format of the tape or aggregate Tarball-like files your backup application writes its data in. The dedupe engine can determine the start and end of each file within the Tarball and can realign chunks to those boundaries.  Content awareness also allows backup appliances to see the index marks and catalog data that backup applications insert into the Tarball and keep them from throwing off the chunking.

However, fixed-chunk systems can choke on some data. I know of one Data Domain user that used Exchange backups to test Symantec's PureDisk deduplication. They were retaining 40 backups of their Exchange servers in a given amount of storage on the Data Domains, but were unable to store four backups of the Exchange data in the same amount of storage deduped by PureDisk. Exchange data is a small number of large database files where the files change internally between backups, the worst case for PureDisk's dedupe engine. Now, if you used a fixed-chunk dedupe engine where the chunk was smaller than a database page ...

Disclosure:DeepStorage.net has done work for NetApp, Symantec and EMC, whose products were mentioned in this post.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights