Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Analysis: Data De-Duping: Page 2 of 9

Point Of Origin

Duplicate data makes its way into backups across the temporal realm over time, as your backup program backs up the same file from the same directory multiple times, or as the same files are backed up from multiple locations in your network. Most networks have a surprising amount of duplicate data, from the holiday party invitation PDF 56 users saved to their home directories to the 3 GB of Windows files on the system drive of every server.

One solution to file duplication in the temporal realm is incremental backup. Although we're big fans of this, especially the incremental-forever approach used by Tivoli Storage Manager and others, we don't consider incremental backups to be data de-duplication any more than we consider RAID disaster recovery. Incremental backups fall in the realm of duplicate avoidance.

The most basic form of data de-duplication is the file-level single-instance store found in CAS (content-addressable storage) devices, such as EMC's Centera. As each file is stored on a CAS system, the device generates a hash of the file's contents; should a file with the same hash already exist, rather than saving another copy, the system just creates another pointer to the copy it already has.

Microsoft's latest version of Windows Storage Server, the OEM NAS (network-attached storage) version of Windows server, uses a slightly different approach to eliminating duplicate files. Rather than identify duplicates as they're written, WSS runs a background process, the SIS (single-instance storage) Groveler, which identifies duplicate files using a partial file hash function followed by a full binary comparison, moves the file to a common storage area and replaces the files in their original locations with links to the file in the common store.