Storage

10:25 AM
Howard Marks
Howard Marks
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

More On Chunking

The last time we looked at the chunking process in data deduplication engines ("Your Mileage Will Vary: Chunking"), we were looking pretty favorably at variable chunking that used the contents of the data to assign chunk boundaries. However, as deduplication moves from backup appliances accepting tape, or other backup application-specific format data, into backup applications and primary storage, the advantages of fixed-chunk deduplication start to become apparent.

The last time we looked at the chunking process in data deduplication engines ("Your Mileage Will Vary: Chunking"), we were looking pretty favorably at variable chunking that used the contents of the data to assign chunk boundaries. However, as deduplication moves from backup appliances accepting tape, or other backup application-specific format data, into backup applications and primary storage, the advantages of fixed-chunk deduplication start to become apparent.

The primary advantage of fixed-chunk deduplication is lower CPU overhead.  Fixed-chunk systems don't have to spend any CPU cycles examining data and determining where the chunk borders should be. They just break data up into chunks like any other file system. In fact, some primary storage deduplication, like NetApp's,  uses just the underlying file system's chunks for its foundation.

Lower overhead also means lower latency; computing where to put the chunk boundaries takes some time. While vendors have done their best to reduce this additional latency, and will claim it's not noticeable, it exists and might be a problem for primary storage deduplication systems.

Backup applications are a simple lot. In their heart of hearts they just want to be sending a stream of data to a tape drive somewhere. Since they're making large sequential write requests to a small number of large files, a few milliseconds of latency per request won't have a big impact. For conventional backup applications like NetBackup or Networker, throughput is all important, and latency less so.

Primary storage applications, even simple ones like hosting users' home folders, are much more latency-sensitive. In addition, rather than writing to a small number of very large files like backup applications do, primary storage environments have millions of files of all sizes. Since each file begins on a fresh data chunk, an insertion or other change that could throw off the chunk alignment will only affect one file's worth of data. Every new file will realign the process.

Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage ... View Full Bio
Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Cartoon
Slideshows
Audio Interviews
Archived Audio Interviews
Jeremy Schulman, founder of Schprockits, a network automation startup operating in stealth mode, joins us to explore whether networking professionals all need to learn programming in order to remain employed.
White Papers
Register for Network Computing Newsletters
Current Issue
Video
Twitter Feed