Data deduplication holds promise for IT groups trying to get a handle on ever-growing storage volumes, and we clearly need the help: More than half of respondents to our InformationWeek AnalyticsData Deduplication Survey manage more than 10 TB of data; 15% manage 201 TB or more.
Deduplication systems look for repeating patterns of data at the block and bit levels. When multiple instances of the same pattern are discovered, the system stores a single copy of the pattern. Deduplication functionality is available in both appliances and in software. For certain applications, such as disk-to-disk backups--particularly backups of data sets that change slowly over time, such as e-mail systems--deduplication can be extremely effective. Compression rates of 30 to 1 or even higher aren't uncommon.
Some vendors dedupe the data stream as it's sent to the appliance and before it's written to disk, referred to as an in-line setup. Others perform post-process deduplication after data is written to disk. The best choice depends on your needs. In-line deduplication may have lower initial storage capacity requirements since a full "un-deduped" copy of the data is never written to disk. Post-process deduplication requires more initial space, but it's more easily adapted to and integrated with various storage systems.
Another advantage of dedupe is streamlined replication and disaster recovery capabilities. After an initial backup, only changed blocks are written to disk during subsequent jobs, consuming significantly less storage.
Of course, there are legitimate concerns about excessive reliance on deduplication appliances as part of the backup and disaster recovery process; the worry is that the appliance will fail at the most inopportune moment. Cost is also a concern. Deduplication storage costs significantly more than traditional storage on a per-gigabyte basis--usually orders of magnitude more. And encryption systems intentionally remove repeating patterns and randomize data, which makes deduplication ineffective when applied to encrypted data.
That said, given the right set of data and proper project execution, you can reach extremely high compression/ deduplication ratios that offset the higher cost of raw storage capacity. A smaller number of spindles take up less space, use less electricity, and generate less heat.
Deduplication appliances are best suited as targets for nightly backups and related DR replication. Another sweet spot is backup of highly redundant data, such as a large number of similarly configured physical or virtual servers or workstations. You won't go wrong by focusing your implementations efforts in these areas now, while deduplication vendors continue to improve their technology, lower costs, and increase suitability for other applications.