Deduplication has become a standard feature in storage arrays. Here are some issues to weigh with the technology.
Almost every all-flash array (AFA) or storage appliance offers deduplication as a way to make the available space go further. Deduplication creates a hash tag that is almost certainly unique for any given object or file, and compares it with existing hashes for data already stored in the system. If a file or object is a duplicate of an existing hash tag, instead of storing the new object, a pointer is created to the existing one, saving space.
It’s clear that the efficiency of this process is very data dependent. Virtual desktop servers have hundreds of duplicate files, so there can be a lot of space saving, while, at the other end of the spectrum, high-performance computing has huge files and duplication is rare. That’s why vendor claims for the amount of space saved with deduplication tend to be all over the map.
In the real world, most deduplication is a background task after the data is written to the AFA or storage appliance. That will change radically over the next few years as algorithm accelerators kick in, so monitor changes in your options going forward.
Deduplication usage is by no means ubiquitous, even when supported by the array or appliance. Worries about data integrity due to duplicated hashes or fear of the reliance on single copies still abound, though, frankly, they are easily shown to be urban myths. The benefits in capacity savings alone make deduplication important.
Those benefits flow downstream from where the deduplication occurs. Network loads for backup are reduced, as are WAN bottlenecks. Cloud storage rental charges drop considerably, as do data retrieval costs. Moreover, the deduplication process can track any changes that occur as data is modified and stored back into the system.
Continue on to learn some key considerations when implementing deduplication.
Test your solution
A good rule of thumb is that the deduplication savings ratio is 5:1 with “general" office files, 100:1 with virtual desktops, 1:1 for high-performance computing, and anywhere from 1:1 to 100s:1 for media files, depending on how they're stored and who stores them. Having said that, your actual results will differ. Perhaps you should conduct a test using a large block of real data on a new software or hardware solution before shelling out the bucks!
Usually, vendors’ deduplication methods tend to use the same algorithms. Performance variations might creep in from coding efficiency and data handling, though.
Where to deduplicate
Deduplication is often performed in an all-flash array as a post-processing step after data is written to the array. Data written to the array goes into a journal file, which allows the write completion to be posted very quickly. Then, the journal file is deduplicated and the data or pointer stored away.
Today, post-processing in the AFA or storage appliance often is the only viable method for deduplication, simply because servers lack processors or accelerate hardware able to keep up with real-time transfers. With an AFA as the primary network storage, for example, it will either have deduplicated data onboard, or it can deduplicate data being sent to secondary storage or to backup and the cloud.
An alternative to post processing is to deduplicate when the data is being written to the external storage. This inline processing has the tremendous advantage of avoiding transmission of duplicate files over the network, which could reduce traffic by roughly the same factor as the capacity-saving ratio. With network bandwidth at a premium today, that saving is as important as saving storage space. However, there is a downside. Write operations take much longer to complete, since the deduplication process is relatively slow on most systems, and moreover, gets slower as the amount of stored data grows.
The pressure for deduplication-at-source will grow over the next two years as hyperconverged architecture begins to eat into market share. In this type of system, local storage is shared over a virtual SAN with all the other servers in the cluster, obviating the need for distinct storage appliances, at least for primary storage.
Another source of pressure comes from the software-defined data center movement. Data services will be abstracted from storage appliances, which points to source deduplication as optimal. If you have to execute deduplication on some computer core, do it at the source, where the network will see a load reduction.
(Image: Matej Moderc/iStockphoto)
Compression isn’t deduplication
Data compression is a type of deduplication within objects, but doesn’t replace deduplication. Compression algorithms build tables of identical byte-level objects. These can be as short as a few bytes or as long as files. Compression is further complicated by nesting the tables so that a small object is often contained in larger objects.
The byte-sized compression process is more compute-intensive and operates across whole sets of objects rather than just removing duplicate objects. Space savings for compression can be similar to deduplication and are multiplicative, so a 5:1 ratio for each results in a 25:1 reduction. Source compression is best, because the network load is saved.
I’ve heard arguments that compressed data shouldn’t be deduplicated. I think the right process is deduplication, followed by compression if the data is to be written out. That’s because deduplication should be a much faster operation.
In the past, there were some issues with duplicated hash codes created with the original SHA-1 algorithm for generating hashes. Though rare, this led to data loss. An improved SHA-2 algorithm resolves this problem and it’s no longer a consideration in deduplication processes.
The idea of a single copy also daunts some admins. The reality is that a single copy of a data object, well protected by redundant storage or erasure coding, is ideal. Sorting out near-duplicates isn’t trivial, and multiple copies create security issues as well as using space.
Because hyperconverged systems are limited in storage capacity and network connections compared with typical storage appliances, both effective capacity and network bandwidth are critical. As a result, we need better solutions for the source deduplication speed problem, such as SIMD extensions for x86 architecture CPUs or co-processors/FPGAs as accelerators, all of which appear to be on the horizon.
It’s worth noting that these accelerators have at least two other key functions. One is data compression, which is a type of deduplication within objects; the other function is data encryption).
We also face a new type of memory, where space will be at a premium. This is the non-volatile DIMM, which is ultra-fast, but likely to be expensive and relatively small, at least initially. Source deduplication and compression would be very useful here.
Ironically, the NVDIMM is so fast that we’ll likely see post-processing come back, where data is written to the DIMMs, deduplicated and then stored in the main part of the NVDIMM or sent over a LAN.
(Image: HPE 8 GB NVDIMM)