A Rose May Be A Rose But Dedupe Is Not Necessarily Dedupe

Today just about every vendor in the storage market has data deduplication as a feature in one or more of their products. While those of us that work with multiple deduplication products on a regular basis know there are big differences between the various technologies that vendors call data deduplication many users don’t know how to pick the deduplication solution that will best fit their needs.

Howard Marks

March 22, 2012

5 Min Read
Network Computing logo

Today just about every vendor in the storage market has data deduplication as a feature in one or more of their products. While those of us that work with multiple deduplication products on a regular basis know there are big differences between the various technologies that vendors call data deduplication many users don’t know how to pick the deduplication solution that will best fit their needs.

The breath of the deduplication market was brought into focus for me by recent articles on StorageNewsletter.com that listed 96 companies that sold products with deduplication and 24 deduplication patents issued or applied for from Sept-Dec of last year . As I was absorbing the amount of patent activity in deduplication, Sepaton CTO Jeff Tofano and I had a chat discussing a list of 10 questions users should ask when considering a deduplication solution.

Some of the questions are, as we would expect, targeted to the large enterprise market Sepaton addresses, and to some extent targeted to Sepaton’s strengths, but they do address issues any potential dedupe customer should pay attention to. Rather than create a slide show of the questions to generate a lot of page views, I’ll add a little commentary to each.

1-What impact will deduplication have on backup performance – both now and over time?

Performance is a key consideration for any dedupe system, if you can’t make your backup window, it doesn’t really matter how much data reduction you get. Post process systems like Sepaton’s will deliver consistent backup performance over time, while some inline systems can get slower when they’re full, especially if they’re not given enough time to perform housekeeping between backup and restore jobs.

2-Will deduplication degrade restore performance?

Post-process vendors argue that since they store the last backup set in its native form, restores are as fast as possible, while inline systems must reassemble all restores from data blocks that are scattered across the deduplication repository. While some early systems were much slower at restoring data than at backing it up, the difference is getting smaller and HP claims their StoreOnce systems are just as fast at restores as backups.

Before you select a system based on claims of restore speed, you should check to see how often you run large restore jobs from backups that are more than a day or two old, and you should bring the system(s) you’re considering into your environment for a proof-of-concept before signing on the dotted line.3-How will capacity and performance scale as the environment grows?

Some vendors, like market leader Data Domain, have as many as seven different models and make you pick the one with the capacity you need. When you outgrow it, you’ll need to go through a painful upgrade/migration. Others, including NEC, Sepaton, Exagrid and HP with their B6200, let you cluster multiple systems in a more scale-out fashion.

When you look at capacity and scalability, make sure to not only look at capacity and aggregate performance, but also ask as you scale the system up does that create multiple deduplication realms where data sent to one repository won’t deduplicate against data in another silo.

4-How efficient is the deduplication technology for large databases (e.g., Oracle, SAP, SQL Server)?

5-How efficient is the deduplication technology in progressive incremental backup environments such as Tivoli Storage Manager (TSM) and in NetBackup OST?

6-What are realistic expectations for capacity reduction given the high data change rate common in Big Backup environments?

Ok, questions 4-6 should really ask “how well will the system deduplicate my data. There is no standard set of data that vendors agree to use to measure either performance or data reduction, and even when there is one it will just be the EPA rating and your mileage will vary. If you’re going to spend many tens of thousands of dollars on a solution, you’ll have to test with your data and your applications.

7-Can administrators monitor backup, deduplication, replication, and restore processes enterprise-wide?

Management and reporting are a weakness of many deduplication systems. The information a good management dashboard can give you has a lot of value, and can show you that some datasets might be better sent to cheap bulk storage, as they’re not reducing to any significant extent.

8-Can deduplication help reduce replication bandwidth requirements for large enterprise data volumes without slowing backup performance?

As much as data reduction helps save disk space, the reduction you can get on the amount of bandwidth needed to replicate data can have a bigger impact on your budget and data protection processes. Once again you’ll have to test yourself.

9-Can IT “tune” deduplication by data type to meet their specific needs?

See questions 4-6. How’s it work with my data is all that matters.

10-How much experience does the vendor have with large enterprise backup applications such as Symantec NetBackup/OST and TSM?

Does experience matter? Yeah, sure I never want to be the first customer to try something if there’s a tested alternative. On the other hand, being the fifth or tenth user of something innovative can be a lot better than only buying from the vendors that have been in the market forever.

Do these 10 questions cover all the issues that divide deduplication systems? Clearly not, but they are a pretty good way to clarify the choices you have to make picking a product.

At the time of publication, I have no relationship with Sepaton.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights