Data Deduplication: 7 Factors To Consider

Deduplication has become a standard feature in storage arrays. Here are some issues to weigh with the technology.

Jim O'Reilly

September 21, 2016

8 Slides

Almost every all-flash array (AFA) or storage appliance offers deduplication as a way to make the available space go further. Deduplication creates a hash tag that is almost certainly unique for any given object or file, and compares it with existing hashes for data already stored in the system. If a file or object is a duplicate of an existing hash tag, instead of storing the new object, a pointer is created to the existing one, saving space.

It’s clear that the efficiency of this process is very data dependent. Virtual desktop servers have hundreds of duplicate files, so there can be a lot of space saving, while, at the other end of the spectrum, high-performance computing has huge files and duplication is rare. That’s why vendor claims for the amount of space saved with deduplication tend to be all over the map.

In the real world, most deduplication is a background task after the data is written to the AFA or storage appliance. That will change radically over the next few years as algorithm accelerators kick in, so monitor changes in your options going forward.

Deduplication usage is by no means ubiquitous, even when supported by the array or appliance. Worries about data integrity due to duplicated hashes or fear of the reliance on single copies still abound, though, frankly, they are easily shown to be urban myths. The benefits in capacity savings alone make deduplication important.

Those benefits flow downstream from where the deduplication occurs. Network loads for backup are reduced, as are WAN bottlenecks. Cloud storage rental charges drop considerably, as do data retrieval costs. Moreover, the deduplication process can track any changes that occur as data is modified and stored back into the system.

Continue on to learn some key considerations when implementing deduplication.

(Image: Axaulya/iStockphoto)

About the Author(s)

Jim O'Reilly

President

Jim O'Reilly was Vice President of Engineering at Germane Systems, where he created ruggedized servers and storage for the US submarine fleet. He has also held senior management positions at SGI/Rackable and Verari; was CEO at startups Scalant and CDS; headed operations at PC Brand and Metalithic; and led major divisions of Memorex-Telex and NCR, where his team developed the first SCSI ASIC, now in the Smithsonian. Jim is currently a consultant focused on storage and cloud computing.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights