Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Expanding Role Of Data Deduplication

Data volumes continue to explode: Of the 437 business technology professionals InformationWeek Analytics surveyed for our data deduplication report (available free for a limited time at, more than half manage more than 10 TB of data, compared with just 10% who control less than 1 TB. Seven percent manage between 201 TB and 500 TB, and 8% are charged with wrangling more than 500 TB of data. These massive volumes may be a recent development--25% of the 328 business technology pros we surveyed for our 2009 InformationWeek Analytics State of Storage Survey managed less than 1 TB of data--but all indications point to this level of growth being the new normal.

The applications most responsible for the data deluge include the usual suspects: Enterprise databases and data warehouse apps (33%) and e-mail (23%) are cited most in our survey. Rich media, mainly voice and video, was cited by just 16%, but we think the recent surge in voice and video applications will put increasing demands on storage. And yes, we've been warned before about huge looming increases in video traffic, which never materialized. But there are good reasons to believe this time may be different given an increased focus on telecommuting and multimedia. In addition, the America Reinvestment and Recovery Act aims to have up to 90% of healthcare providers in the United States using electronic medical records by 2020.

That's a potential tsunami of high-value, regulated--and huge--files.

As more companies jump on the fast track to petabyte land, a variety of vendors have emerged with technologies and management approaches aimed at helping us more efficiently administer large storage pools while lowering costs and increasing security. In our survey on data deduplication, we asked IT pros about their use of some of these technologies, including compression, disk-to-disk-to-tape backups, encryption, virtual tape libraries, thin provisioning, massive array of idle disks (MAID), and data deduplication specifically. Of those, compression is the most commonly used, with 64% of respondents employing the technology in their environments. Survey results show relatively low current adoption rates for data deduplication, with just 24% of respondents using the technology. However, the good news is that 32% are evaluating dedupe, and just 10% say definitively that they won't consider adoption. Only 17% of respondents have deployed thin provisioning, while 15% say they flat out won't; and only 12% say they have deployed MAID, while 17% say they won't.

We found the low adoption rates for these three promising technologies surprising because business as usual is no longer a realistic option. The price of storage in the data center isn't limited to hardware. Escalating power and cooling costs and scarce floor space pose a serious challenge to the "just buy more disks" approach. These three technologies could enhance a well-designed storage plan and--along with increasing disk/platter densities, larger disk drives, and faster performing drives such as solid-state disks--reduce storage hardware requirements.

Of course, compatibility with legacy systems is always an issue.

McCarthy Building, a St. Louis-based construction firm with $3.5 billion in annual revenue, uses SATA disks in DualParity RAID configurations for its Tier 2 storage (more on tiers later). "We replicate production data to a remote site on the same storage," says Chris Reed, director of infrastructure IT. "We deduplicate everywhere we can, especially since the cost is still $0 from NetApp and we haven't seen a performance downside."

However, Reed has run into problems with legacy applications, such as the company's Xerox DocuShare document management system, that must run on super-large volumes. That system has grown to more than 4 TB on a single Windows iSCSI volume on a NetApp 3040 cluster, which doesn't support deduplication on volumes larger than 4 TB.

Data deduplication is particularly effective in reducing the storage capacity required for disk-to-disk and backup applications. It can also reduce the amount of bandwidth consumed by replication and disaster recovery. We discuss specifics of how data dedupe works in detail in our full report.

For those on the fence, recent market events--notably EMC's and NetApp's bidding war for Data Domain--illustrate the importance of dedupe. Companies that don't at least investigate the benefits could be hamstrung by spiraling storage costs.

Lay The Groundwork

Before we go more into dedupe, we want to give a shout out to a cornerstone of any solid data life-cycle management strategy: tiered storage. By matching different types of data with their appropriate storage platforms and media based on requirements such as performance, frequency of access, and data protection levels, tiered storage lets CIOs save money by applying expensive technologies, including data deduplication and thin provisioning, to only the appropriate data.

In a tiered strategy, Tier 1 storage is reserved for demanding applications, such as databases and e-mail, that require the highest performance and can justify the cost of serial-attached SCSI, Fibre Channel SANs, high-performance RAID levels, and the fastest available spindles--or even SSD drives.

Direct attached storage (DAS) is still the top choice of our survey participants for Tier 1 storage of applications such as databases and e-mail. Fibre Channel came in a close second, with 45% of respondents reporting use of Fibre Channel SANs, mainly for Tier 1 storage. Fibre Channel remains strong despite countless predictions of its rapid demise by most storage pundits--and the downright offensive "dead technology walking" label attached by a few.

One survey finding that's not completely unexpected--we were tipped off by similar findings in previous InformationWeek Analytics State of Storage reports--but is nonetheless puzzling is the poor showing by iSCSI SANs, which are the main Tier 1 storage platform for just 16% of respondents. That's less than half the number who report using NAS, our third-place response. Seems most IT pros didn't get the memo that iSCSI would force Fibre Channel into early retirement.

In all seriousness, the continued dearth of interest in iSCSI is mystifying given the current economic backdrop, the widespread availability of iSCSI initiators in recent versions of Windows (desktop and server) and Linux, and the declining cost of 1-GB and 10-GB connectivity options. We think the slower-than-predicted rate of iSCSI adoption--and the continued success of Fibre Channel--is attributable to a few factors. First, the declining cost of Fibre Channel switches and host bus adapters improves the economic case for the technology. Second, we're seeing slower-than-expected enterprise adoption of 10-Gbps Ethernet, leaving iSCSI at a performance disadvantage against 4-GB Fibre Channel.

However, iSCSI's performance future does look bright thanks to emerging Ethernet standards such as 40 Gbps and 100 Gbps that will not only increase the speed limit, but also accelerate adoption of 10-Gbps Ethernet in the short term. In our practice, we also see a reluctance among CIOs to mess with a tried-and-true technology such as Fibre Channel, particularly for critical applications like ERP, e-mail, and enterprise databases. Sometimes, peace of mind is worth a price premium.

Tier 2 comprises the less expensive storage, such as SATA drives, NAS, and low-cost SANs, suitable for apps like archives and backups, where high capacity and low cost are more important than blazing speed. Our survey shows that NAS is the Tier 2 architecture of choice, used by 41% of respondents. DAS is the main Tier 2 storage of 34% of respondents. Once again, iSCSI SAN finished last, with a mere 17% of respondents using the technology primarily for Tier 2 storage. This is an even more surprising result than for Tier 1--we expected iSCSI's low cost relative to Fibre Channel SANs to result in a healthy showing here.

Tier 3 storage typically consists of the lowest-cost media, such as recordable optical or WORM (write once, read many) disks, and is well suited for historical archival and long-term backups.

Applying a tiered strategy lets IT migrate older and less frequently accessed data to lower-cost storage--and in doing so significantly reduces both the growth rate of pricey Tier 1 capacity and overall data center costs. Sounds like a no-brainer, but data classification and planning are essential--including developing policies around retention, appropriate storage architecture, data backup and recovery, growth forecasting and management, and budgeting.

Policy is one area where many are falling behind. For example, one of the most eye-opening results of our survey was the response to our query about data retention periods. The percentage of participants reporting indefinite retention for application data ranged from 30% for Web (wikis and blogs) to a whopping 55% for enterprise database and data warehouse applications. And with the exception of wiki and blog applications and rich media, 50% or more of respondents report at least a five-year retention period--as high as 76% for enterprise databases and data warehouses.

We're clearly struggling to keep up with the complex records management needed to comply with requirements such as the Health Insurance Portability and Accountability Act, related privacy rules, and the Sarbanes-Oxley Act of 2002, just to name a few of the regs bedeviling enterprise IT.

We were also surprised to see that Tier 2 storage growth rates reported by our survey participants weren't dramatically different from Tier 1 growth rates. Twenty-nine percent of respondents reported growth in excess of 25% for Tier 2 storage, compared with 18% for Tier 1, and nearly twice the number of respondents are seeing growth rates of 51% to 75% in Tier 2 storage.

This represents a golden opportunity for IT to adopt more aggressive life-cycle management and shift more growth onto less costly Tier 2 storage. It may also indicate that more automation is needed. To that end, consider deploying information life-cycle management tools or archival systems with automated tiering features. More on those in our full report.

Data Dedupe 101

The idea of cramming more data into the same amount of space isn't new. For example, most of us use zip utilities every day, as files downloaded over the Internet are usually compressed using algorithms to eliminate recurring patterns, to conserve storage and bandwidth. Single instancing, another method to slash storage space consumption, is used on many e-mail systems. Single instancing operates at the file level and maintains only one copy of a given file within the storage system--for example, keeping a single copy of an attachment sent with an e-mail to multiple mailboxes. The benefits of single instancing are lost, however, when even small changes are made to previously identical files.

Enter data deduplication, whose features are available in both hardware and software. Data dedupe looks for repeating patterns of data at the block and bit levels (file-level deduplication is essentially single instancing). When multiple instances of the same pattern are discovered, the system stores a single copy of the data. Of course, it would be highly inefficient to continually look at all of an enterprise's data to find these repeating patterns, so deduplication systems create a hash value, or digital signature, for each string of data and compare these much smaller values to identify repeating strings that can be deduped. Hash values are derived using a one-way algorithm that can, in rare cases, result in the same value being derived from different strings of data--sometimes referred to as a hash collision. This can result in corrupted data. Deduplication vendors may deploy a secondary check when a hash value is matched and look at actual data strings to verify a true match. This approach adds some overhead, but it eliminates the possibility of data corruption. We recommend that you check into how the vendor you're considering handles hash collisions before buying.

Some vendors perform deduplication on the data stream as it's sent to the appliance and before it's written to disk (in-line). Others perform post-process deduplication--after data is written to disk. As is the case with most architectural differences, each side claims superiority. In-line deduplication may have lower initial storage capacity requirements, since a full "undeduped" copy of the data is never written. Post-process deduplication, on the other hand, requires more initial space, but it may be more easily integrated with various storage systems and is the way to go if you want to apply deduplication to existing files and storage.

The effectiveness of dedupe is highly dependent on the makeup of your data. For some apps, such as disk-to-disk backups of data sets that change slowly, including e-mail systems, dedupe can be very effective. Compression rates of 30 to 1 or even higher aren't uncommon. Another advantage of deduplication is streamlined replication and disaster recovery capabilities. After an initial backup, only changed blocks are backed up and written to disk during subsequent jobs, consuming significantly less storage. Because backups are full (not differential or incremental), recovery operations are simpler and quicker, improving recovery time objectives. Furthermore, off-site replication to another device requires significantly less bandwidth, opening up additional disaster recovery options.

It all sounds pretty good, particularly for apps such as backup, replication, and disaster recovery. All the same, only 24% of survey participants have data deduplication in use, and 44% of respondents have no plans for it or say they won't use it. For 37% of respondents, the main reason for not using data deduplication is lack of familiarity with the technology. If that sounds like you, download our full report, at, to see what you're missing.

Behzad Behtash is an independent IT consultant and InformationWeek Analytics contributor.

You can write to us at [email protected].