With deduplication technologies, the deduplication software breaks files down into blocks and then inspects those blocks for duplicate patterns in the data. Once found, the deduplication software replaces copies of the pattern with pointers in its file system to the initial instance. Deduplication originally began in backup storage, but with IT's storage worries, it was only a matter of time before the technology was applied to primary storage. Today, EMC, Ocarina, Nexenta, GeenBytes and HiFn, to name a few, are all bringing deduplication to primary storage.
More specifically, there are three factors that George Crump and Howard Marks, Network Computing Contributors, point out driving this phenomenon. For one, storage is growing too fast for IT staffs to manage. Extra copies of data are going to occur, such as several copies of data dumps, multiple versions of files and duplicate images files. Primary storage deduplication catches these instances. The second play for primary storage deduplication is in storage of virtualized server and desktop images. The redundancy between these image files is very high. Primary storage deduplication will eliminate this redundancy as well, potentially saving terabytes of capacity. In many cases, the read back from deduplicated data offers little or no performance impact.
The third and potentially the biggest payoff is that deduplicating primary storage will effect optimization. Copies of data, backups, snapshots and even replication jobs should all require less capacity. This does not remove the need for a secondary backup; every so often it seems like it will be a good idea to have a stand-alone copy of data not tied back to any deduplication or snapshot metadata. Being able to deduplicate data earlier in the process does potentially reduce the frequency that a separate device is used, especially if the primary storage system replicates to a similarly enabled system in a DR location.
Deduplicating the primary storage isn't without risks and misconceptions. Deduplication ratios are dependent on the type data being operated on. Backup data, for example is highly repetitive, letting deduplication ratios to run as much as 20:1, but those opportunities don't existing in primary storage where ratios tend to run closer to 2:1. There's also a performance penalty with deduplication that won't be acceptable in certain situations, such as online transaction processing (OLTP) applications. Finally, as Marks points out, deduplication systems aren't all alike and using a primary deduplication system with the wrong backup application could result in significant problems.