Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

When Hashes Collide: Page 2 of 2

Curtis even had a math Ph.D. create a spreadsheet to calculate the odds of a hash collision, which you can download from his Website BackupCentral.com/hashodds.xls. In order for the probability of a hash collision to equal the 10^15 odds of a disk read error, you would need 5x10^16 data blocks or 432 yottabytes of data in 8K blocks. I cheated and used the high-precision calculator at www.ttmath.org/online_calculator to compute that, for a deduping system with four petabytes of stored data in 8K blocks, the probability of a hash collision is 4.5x10^26, or about the same as a tape read error with perfect media.

Now, it's true that people tend to avoid catastrophic events, even if they're very unlikely, while accepting much higher probabilities of events that have lesser consequences. As a result, we mine coal for electricity knowing miners will die and people will get asthma, but we won't build nuclear power plants. But a hash collision doesn't ruin all your backup data. It just means that one block of data will be restored with the wrong data, just like a tape or disk read error.

One hash collision, one corrupt file to restore once every 10^26 times you backup 3PB of data. Seems like a reasonable risk to me. After all, I cross the street every morning to walk the dog, and I could get run over by a streetcar--or by Jessica Alba, if she reads this blog. No, I won't calculate the probability of that.