Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Analysis: Data De-Duping: Page 8 of 9

We've heard several comments from users afraid to use hash-based de-duping because there's a possibility of a hash collision--two sets of data generating the same hash--and, therefore, data corruption. Although there's some risk of data corruption through a hash collision, it's much smaller than the risks storage admins live with every day.

De-duping setups typically use MD-5 (a 128-bit hash) or SHA-1 (a 160-bit hash). The probability of two random blocks of data generating identical MD-5 hashes is approximately 1 in 1037. If a petabyte of data were de-duped using MD-5 with an average block size of 4 KB, the probability would be about 1 in 1020 (or 1 in a hundred billion billion) against two blocks having the same hash.

By comparison, the probability that both drives of a mirrored set of drives with an MTBF of 1 million hours will fail within 1 hour of each other is 1 in 1012--over 1 billion times more likely than a hash collision. Data sent across Ethernet or Fibre Channel is protected by a CRC-32 checksum, which has a probability of undetected data errors of approximately 1 in 4x109 (or 1 in 4 billion).

It's also important to remember that a hash collision, however unlikely, doesn't mean a total loss of data. If a de-duping system incorrectly identifies two data blocks as containing the same data, when they don't, the system will continue operating. When the data is restored, the one file whose data was misidentified will, however, be corrupted. All the other data would be restored correctly. We put hash collisions on our list of worries somewhere below asteroid strikes and the mega- volcano at Yellowstone erupting.

The larger risk inherent in data de-duplication is catastrophic data loss from a hardware failure. Since the data from any given backup job--and, in fact, any given large file--is broken up into blocks and spread across the whole de-duping data store, it doesn't matter how many times you backed up that server, if you lose a RAID set in the de-duping device, you'll lose lots of data. This makes enhanced data-protection features, such as battery backup cache and RAID 6, even more important for de-duping targets than for primary storage applications.