Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Too Much De-Dupe

In the race to offer some form of a data reduction solution, it seems that every supplier wants to call its solution "Data De-Duplication," capitalizing on the term. The generic term should be Data Reduction, NOT Data De-Duplication.

Data de-duplication, popularized by Avamar and Data Domain in the backup space, is the identification of redundant segments of information within files and across volumes. The comparison should always be at a sub-file level, allowing for two consecutive backups of a database, for example, to only store the components within that database that changed. Even with this, there is room for suppliers to differentiate: You have source-side de-dupe vs. target-side, and of course you have the now famous inline vs. post processing debate.

Data de-duplication is moving beyond the backup space into archive space from suppliers like Permabit and Nexsan. There are even content-aware solutions like those from Ocarina Networks that work at a sub-file level and can provide data reduction on data that is visually similar but data different, such as rich media, Windows fileshares, and audio and video, prior to moving that data to an archive.

A device that can identify duplicate files, i.e., the same copy of a file across multiple volumes, should really be called "single instancing" not "de-duplication." While this technology can reduce the size of the backup storage somewhat, it would still have to store each night's copy of the above database example. This type of technology makes sense in email, email archive solutions, and, in some cases, primary storage, since it should have minor performance implications.

In here somewhere goes the conversation about Block-Level Incremental backup and replication technologies, but the term needs a refresh. Most of these technologies work by making a block-level copy, volume by volume, of a server on a secondary storage target. Now, however, some of these solution suppliers are able to snapshot the secondary storage and either automatically roll that snapshot to tape or present the snapshot as a read/writeable volume. Clearly, this is more than backup.

  • 1