Deduplication Joins The Primary Storage Reduction Fray

Vendors are lining up to help you trim data bloat. Better choose carefully.

Howard Marks

April 9, 2009

5 Min Read
Network Computing logo

IT managers are swamped with data and new mandates to retain it. In fact, data is piling up so fast that just adding more capacity won't solve the problem. Enter storage reduction techniques, ranging from file compression and single-instance storage to data deduplication, which can help a beleaguered IT staff put the proverbial 10 pounds of data in a 5-pound bag.

Data deduplication is the latest reduction method to move from secondary storage applications to primary storage systems. There are good reasons for the move, although primary storage isn't always an easy fit for data reduction technology. Large enterprises' primary storage performance requirements are more stringent, especially when it comes to I/O response times and latency. Primary storage systems also have to meet substantially higher availability and reliability standards than backup stores. This makes them leaner, less-target-rich environments for data reduction, but they're also three to 10 times more expensive on a per-gigabyte basis than backup repositories. Small storage reductions can save significantly on space, power, and cooling. There's also a real possibility of performance boosts.

InformationWeek Reports

Vendors ranging from enterprise network-attached storage leader NetApp to startups like Ocarina Networks are readying data deduplicating tools to optimize primary storage capacity. With a range of options coming online in the next year or so, from software upgrades to complete NAS systems, now is the time to investigate deduping your primary storage. But keep in mind that your data reduction ratios may be closer to 2-to-1 than 20-to-1. We recommend using conservative data reduction ratios when developing budgets and ROI calculations.

Space And Time
There's more than one way to reduce the amount of space that files occupy on disk. Some data reduction technologies are built into the file system or operating system of a NAS appliance, while others are appliances that can be added to existing filers. These approaches operate either in real time or post-process. Real-time reduction needs the least disk space because it compresses and/or dedupes data as it's written to the share, but it's compute-intensive and can crimp performance. Post-process data reduction happens after data is written to disk. This approach requires enough space to hold both the inflated and deflated versions of the files, but it can be done during off hours, when it's less likely to effect user response time.

Deduplication systems find and eliminate duplicate data by dividing files into chunks and looking for chunks that have the same data. The major difference is how they chunk data. The simplest method is to use fixed size chunks like disk blocks.

NetApp's Write Anywhere File Layout, or WAFL, builds files as lists of blocks. WAFL calculates a checksum of each block, and stores it with the data, whether by a schedule or by an event like a disk reaching a utilization threshold. Blocks that have the same checksum are compared to see if they contain the same data. If they do, WAFL deletes one block and modifies the metadata of the file where the block resided. WAFL will be released as a free software upgrade later this year.

NetApp's approach opts for low overhead instead of high data reduction ratios, so its performance impact should be minimal for the vast majority of applications.

Deduplication using variable block size is more complicated but can identify duplicate data in the body of files saved elsewhere. GreenBytes' ZFS+ adds a real-time variable size block to Sun's open source ZFS file system. GreenBytes' Cypress NAS appliance, based on Sun's X4540 storage server, uses variable block sizes to deliver 800-MBps performance -- in part through clever use of flash SSDs to store hash lookup tables and logs. GreenBytes' appliances, priced at $100,000 for 46 TB of raw space, are set for release this summer.

The Essentials

Primary Storage Reduction

1.Consider the options your existing network-attached storage offers. You may get half the bang for none of the bucks.2.Primary storage isn't as readily deduplicated as backups are. Plan for 3-to-1 or 4-to-1, not 20-to-1, reductions.3.Deduplication and large read caches can speed up some apps, including hosting virtual servers.4.Try deduping less-performance-critical data such as user home directories first.5.Deleting or archiving obsolete data will free more space than deduplicating it.

Riverbed Technology's Atlas, due next year, puts an Atlas appliance or a redundant pair between the network and any CIFS/NFS file server to deduplicate data in real time. WAN bandwidth is even scarcer than disk space, so Riverbed uses small variable block sizes for high deduplication ratios. Its price has not been announced.

Deduplicating frequently accessed data, such as virtual machine images, changes the disk access pattern from reads spread across a volume to accesses of the one deduplicated copy. If the file server has sufficient cache, this replaces many disk I/Os with cache reads. Both NetApp and GreenBytes offer extended read cache options, with GreenBytes offering up to 600GB of flash cache.

Where the other dedupe schemes look at files as sets of bits, Ocarina's Online Storage Optimization Solution takes another approach, recognizing common file types and uses different techniques to space-optimize each. Ocarina breaks complex documents like ZIP files or PowerPoint presentations into their component objects. For example, a PowerPoint slide might be deconstructed into a text block, background, logo, photo, and graph, each of which is separately deduped and compressed with algorithms optimized for each data type. An optimizer replaces the files with a series of links to their constituent deduplicated objects. A reader sits between the user and filer, and reassembles data as users access it.

Ocarina's road map calls for original equipment manufacturers to integrate its technology into NAS systems. Several large OEMs have committed to the project but none has gone public yet.

Howard Marks is chief scientist at Networks Are Our Lives, specializing in data storage, management, and protection. Write to us at [email protected].

Continue to the sidebar:
Know Your Options

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights