Data De-Dupe & Archiving

De-duplication in archiving can show great rewards, but it should not be the primary determining factor in selecting a system

George Crump

November 8, 2008

3 Min Read
Network Computing logo

10:15 AM -- Data de-duplication made its first real market inroads as a backup target. It provided an alternative to standard disk-to-disk backups that allowed you to retain data for a longer period of time. Backup is tailor-made for de-duplication because of the amount of data similarities in full backup jobs. But does de-duplication make sense in archiving?

As is always the case, where you will end up on this discussion depends on how you define archiving, how long you need to retain data, and what your motivation is to retain that data.

De-duplication devices in the backup market will claim 20X or more storage efficiency, but most leaders in this market are factoring a certain frequency of full backups being run. Typically, you may only achieve 4X to 6X efficiency between daily incremental jobs. On average, we tend to see about 12X to 16X storage efficiencies with a backup data de-duplication system. (In an upcoming entry we will go into detail on backup de-dupe rates.)

Archiving today has many use cases, but two of the more common motivations are getting older data off of primary storage to reduce costs or storing data to fulfill a legal or corporate governance requirement. In both cases, data is specifically placed on the device for a purpose. In both cases, these are often unique files and, as a result, the amount of commonality between the files is limited -- 2X to 4X storage efficiencies is a typical average.

There are exceptions where de-dupe efficiencies can be fairly high in archive storage. I know of several organizations that are creating an archive of their production databases every night so that they can view that data at any point in time. For example, one uses a database to track trading activity. They want the ability to backtrack any inconsistencies in trading or malicious activities within the database. While this database receives thousands of updates a day, as a percentage it does not change much on a day-to-day basis. The archive system that they're using can do sub-file level data de-duplication and, as a result, the de-duplication efficiency on that system is well over 30X.Another example is VMware Inc. (NYSE: VMW). I know of several organizations that are archiving their VMware VMDK files to an archive system for either OS preservation or actual virtual machine archiving to limit virtual machine sprawl.

A final possible use case where there can be benefits is in a combined data backup and archive system. If you have an archive process that moves a file to the archive, and that archive already contains the byte-level information about that file from the backup, you can create your archive with little or no net new storage gain. A couple of considerations here: Make sure your de-dupe system can scale to retain this information in equal fashion. Also make sure your backup software does not write the data in a different byte-pattern stream than does your archive software.

While there are other use cases where de-dupe in archiving shows great reward, it should not be the primary determining factor in selecting an archive system, unless you have one of the specific requirements above. Archive systems need to be examined for scalability, data safety, data security, retention capabilities, non-proprietary access, and power efficiency.

George Crump is founder of Storage Switzerland, which provides strategic consulting and analysis to storage users, suppliers, and integrators. Prior to Storage Switzerland, he was CTO at one of the nation's largest integrators. Previous installments of his discussion on data de-duplication can be found here.

About the Author(s)

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights