When Hashes Collide

When Hashes Collide: Page 2 of 2

If there was any doubt in my mind that data deduplication is a mainstream technology, it was wiped out when I saw--in the business section of The New York Times last week--a full-page ad from Symantec touting its deduplication technology. Even so, I still occasionally run into people who consider deduplication to be a dangerous form of black magic that is likely to mangle their data and end their careers. This attitude represents an overestimation of the likelihood of a hash collision in dedupli

Howard Marks

December 15, 2010

Curtis even had a math Ph.D. create a spreadsheet to calculate the odds of a hash collision, which you can download from his Website BackupCentral.com/hashodds.xls. In order for the probability of a hash collision to equal the 10^15 odds of a disk read error, you would need 5x10^16 data blocks or 432 yottabytes of data in 8K blocks. I cheated and used the high-precision calculator at www.ttmath.org/online_calculator to compute that, for a deduping system with four petabytes of stored data in 8K blocks, the probability of a hash collision is 4.5x10^26, or about the same as a tape read error with perfect media.

Now, it's true that people tend to avoid catastrophic events, even if they're very unlikely, while accepting much higher probabilities of events that have lesser consequences. As a result, we mine coal for electricity knowing miners will die and people will get asthma, but we won't build nuclear power plants. But a hash collision doesn't ruin all your backup data. It just means that one block of data will be restored with the wrong data, just like a tape or disk read error.

One hash collision, one corrupt file to restore once every 10^26 times you backup 3PB of data. Seems like a reasonable risk to me. After all, I cross the street every morning to walk the dog, and I could get run over by a streetcar--or by Jessica Alba, if she reads this blog. No, I won't calculate the probability of that.

Juniper Networks Announces AI-Native Networking Platform

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 31, 2024

Bob Friday, Chief AI Officer for Juniper Networks, explains how the advanced technology is transforming operations.

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 29, 2024

Contact center leaders from 8x8, Awaken Intelligence, and 360insight discuss the importance of agent experience.

AI Drives the Ethernet and InfiniBand Switch Market

David Curry, Technology Writer

January 27, 2024

AI may force enterprises to rewire parts of their data centers so they are fully optimized to run such workloads. The question is do you use Ethernet or InfiniBand?

When Hashes Collide: Page 2 of 2

Tags:

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

AI Drives the Ethernet and InfiniBand Switch Market

Search form

When Hashes Collide: Page 2 of 2

Tags:

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

AI Drives the Ethernet and InfiniBand Switch Market