The Reality Of Primary Storage Deduplication

Should you be deduplicating your primary storage? For storage-obsessed IT, primary deduplication technologies is too sweet to ignore when you can eliminate duplicate data in your high-priced, tier one storage and cut storage costs by 20:1. Deduplication for back-up and off-line storage is a natural fit. Still, primary storage access demands means the reality of primary storage deduplication is a lot less rosy than you might expect.

July 16, 2010

4 Min Read
Network Computing logo

Should you be deduplicating your primary storage? For storage-obsessed IT, primary deduplication technologies is too sweet to ignore when you can eliminate duplicate data in your high-priced, tier one storage and cut storage costs by 20:1. Deduplication for back-up and off-line storage is a natural fit. Still, primary storage access demands means the reality of primary storage deduplication is a lot less rosy than you might expect.

With deduplication technologies, the deduplication software breaks files down into blocks and then inspects those blocks for duplicate patterns in the data. Once found, the deduplication software replaces copies of the pattern with pointers in its file system to the initial instance. Deduplication originally began in backup storage, but with IT's storage worries, it was only a matter of time before the technology was applied to primary storage. Today, EMC, Ocarina, Nexenta, GeenBytes and HiFn, to name a few, are all bringing deduplication to primary storage.

More specifically, there are three factors that George Crump and Howard Marks, Network Computing Contributors, point out driving this phenomenon. For one, storage is growing too fast for IT staffs to manage. Extra copies of data are going to occur, such as several copies of data dumps, multiple versions of files and duplicate images files. Primary storage deduplication catches these instances. The second play for primary storage deduplication is in storage of virtualized server and desktop images. The redundancy between these image files is very high. Primary storage deduplication will eliminate this redundancy as well, potentially saving terabytes of capacity. In many cases, the read back from deduplicated data offers little or no performance impact.

The third and potentially the biggest payoff is that deduplicating primary storage will effect optimization. Copies of data, backups, snapshots and even replication jobs should all require less capacity. This does not remove the need for a secondary backup; every so often it seems like it will be a good idea to have a stand-alone copy of data not tied back to any deduplication or snapshot metadata. Being able to deduplicate data earlier in the process does potentially reduce the frequency that a separate device is used, especially if the primary storage system replicates to a similarly enabled system in a DR location.

Deduplicating the primary storage isn't without risks and misconceptions. Deduplication ratios are dependent on the type data being operated on. Backup data, for example is highly repetitive, letting deduplication ratios to run as much as 20:1, but those opportunities don't existing in primary storage where ratios tend to run closer to 2:1. There's also a performance penalty with deduplication that won't be acceptable in certain situations, such as online transaction processing (OLTP) applications. Finally, as Marks points out, deduplication systems aren't all alike and using a primary deduplication system with the wrong backup application could result in significant problems."A lot has to occur for this level of data optimization to become a reality. First, the primary storage vendors need to offer a deduplication engine in their storage solution. Second the deduplication process and its handling of the meta data will also need to prove its reliability," Crump said.

If all of that sounds a lot like compression that's because it is. In fact, stripping away the marketing, the difference is largely a matter of scale. Deduplication applies to files and blocks; compression applies to finding redundancies within files. A number of vendors are offering both, deduplicating and then compressing for added benefit. What's not simple is figuring out who will exactly benefit from deduplication and compression. A test done over at Edugeek, for example, showed that compression might save you five or six percent on storage, but write times also go up by five or six percent. (Obviously, numbers may change based on the algorithm used and the data set.)

Finding ways of solving the performance issue may well be the essential question for enterprises to adopt deduplication of their primary storage. Solid State Disks (SSDs) with their high performance could be of help here. Crump thinks that SSDs will be the perfect complement to deduplication technologies. Some vendors are already using high speed SSDs as a cache to offload the writes from the application while writing them to disk in the background. NetApp and Startup StorSimple, for example, are two vendors who have integrated SSDs into their deduplication platforms. Whether these tweaks are sufficient to provide enterprises with the necessary performance remains to be seen.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights