Too Much De-Dupe

Suppliers need to call their data reduction solutions by what they are and what they do, not after the hottest marketing term

George Crump

January 30, 2009

3 Min Read
Network Computing logo

In the race to offer some form of a data reduction solution, it seems that every supplier wants to call its solution "Data De-Duplication," capitalizing on the term. The generic term should be Data Reduction, NOT Data De-Duplication.

Data de-duplication, popularized by Avamar and Data Domain in the backup space, is the identification of redundant segments of information within files and across volumes. The comparison should always be at a sub-file level, allowing for two consecutive backups of a database, for example, to only store the components within that database that changed. Even with this, there is room for suppliers to differentiate: You have source-side de-dupe vs. target-side, and of course you have the now famous inline vs. post processing debate.

Data de-duplication is moving beyond the backup space into archive space from suppliers like Permabit and Nexsan. There are even content-aware solutions like those from Ocarina Networks that work at a sub-file level and can provide data reduction on data that is visually similar but data different, such as rich media, Windows fileshares, and audio and video, prior to moving that data to an archive.

A device that can identify duplicate files, i.e., the same copy of a file across multiple volumes, should really be called "single instancing" not "de-duplication." While this technology can reduce the size of the backup storage somewhat, it would still have to store each night's copy of the above database example. This type of technology makes sense in email, email archive solutions, and, in some cases, primary storage, since it should have minor performance implications.

In here somewhere goes the conversation about Block-Level Incremental backup and replication technologies, but the term needs a refresh. Most of these technologies work by making a block-level copy, volume by volume, of a server on a secondary storage target. Now, however, some of these solution suppliers are able to snapshot the secondary storage and either automatically roll that snapshot to tape or present the snapshot as a read/writeable volume. Clearly, this is more than backup.The newest entry into the data reduction field is real-time data compression, Storwize was one of the first, and it looks as if it will be joined soon by others. These solutions are interesting in that they compress everything -- active storage, secondary storage and backup storage. They are complementary to de-duplication: Compressing the data before de-duplication has actually shown better results. Most impressive is how well these solutions perform, with almost no effect on the performance of the application accessing them.

Suppliers need to call their data reduction solutions by what they are and what they do, not after the hottest marketing term. The above is a helpful guide for them to do so.

Please register to attend our upcoming seminar on Primary Storage Optimization.

George Crump is founder of Storage Switzerland , which provides strategic consulting and analysis to storage users, suppliers, and integrators. Prior to Storage Switzerland, he was CTO at one of the nation's largest integrators.

About the Author(s)

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights