Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Analysis: Data De-Duping: Page 3 of 9

Although file-level SIS can save some space, things get really interesting if we eliminate not only duplicate files but also storing data duplicated within the file. Think of Outlook's lowly .PST file. A typical user may have a 300-MB or larger .PST holding all his e-mail from time immemorial; every day he receives one or more new messages, and since his .PST file is changed that day, your backup program includes it in the incremental backup even though there are only 25 KB of changes in the 300-MB file.

A de-duping product that could identify that 25 KB of new data and store it without the rest of the baggage could save lots of disk space. Extend that concept so that duplicate data, such as the 550-KB attachment that's in 20 users' .PST files, can be eliminated, and you could achieve staggering data-reduction factors. One group of such solutions are the data de-duping backup targets pioneered by Data Domain. These devices look to a backup application like a VTL (virtual tape library) or NAS device. They take their data from the backup app and do their de-duplication magic on it transparently.

Modus Operandi

Vendors have taken three basic approaches to the data de-duplication process. The hash-based approach, used by Data Domain, FalconStor Software in its VTL software and Quantum in its new DXi-series appliances, breaks the data stream from the backup app into blocks and generates a hash for each block, using SHA-1, MD-5 or a similar algorithm. If the hash for a new block matches a hash that's in the device's hash index, the data has already been backed up, and the device just updates its tables to say the data exists in the new location too.

The hash-based approach has a built-in scalability issue. To quickly tell if a given block of data has been backed up, it should hold the hash index in memory. As the number of backed-up blocks grows, so does the index. Once the index grows beyond the device's ability to hold it in memory, performance falls off, as disk searches are much slower than memory searches. As a result, most hash-based systems are self-contained appliances balancing the amount of memory with the amount of disk space for storing data so the hash table never grows too big.