Storage

07:00 AM
Howard Marks
Howard Marks
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Your Mileage Will Vary: Chunking

I've said many times that nowhere in the field of information technology are the words "your mileage my vary" truer than when discussing data deduplication. How much your data will shrink when run through a given vendor's data deduplication engine can vary significantly depending on the data you're trying to dedupe and how well that particular deduplication engine handles that kind of data. One critical factor is how the deduping engine breaks your data down into chunks.

I've said many times that nowhere in the field of information technology are the words "your mileage my vary" truer than when discussing data deduplication. How much your data will shrink when run through a given vendor's data deduplication engine can vary significantly depending on the data you're trying to dedupe and how well that particular deduplication engine handles that kind of data. One critical factor is how the deduping engine breaks your data down into chunks.

Most data deduplication engines work by breaking the data into chunks and using a hash function to help identify which chunks contain the same data. Once the system has identified duplicate data, it stores one copy and uses pointers in its internal file, or chunk management system, to keep track of where that chunk was in the original data set.

While most deduplication systems uses this basic technique, the details of how they decide what data goes into what chunk varies significantly. Some systems just take your data and break it into fixed-size chunks. The system may, for example, decide that a chunk is 8KBytes or 64KBytes and then break your data into 8KByte chunks, regardless of the content of the data.

Other systems analyze the data mathematically and choose spots that generate higher or lower values from their secret chunk-making function as the boundary between data chunks. On these systems, data chunks will vary in size based on the magic formula but within some limits, so chunks on these systems may be 8KBytes to 64KBytes, depending on the data.

If we implement these two techniques on backup appliances and back up a set of servers with a conventional backup application like NetBackup or Arcserve, the backup app will walk the file system and concatenate the data into a tape format file on the backup appliance.  

Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage ... View Full Bio
Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Slideshows
Cartoon
Audio Interviews
Archived Audio Interviews
Jeremy Schulman, founder of Schprockits, a network automation startup operating in stealth mode, joins us to explore whether networking professionals all need to learn programming in order to remain employed.
White Papers
Register for Network Computing Newsletters
Current Issue
Video
Twitter Feed