A Data De-Duplication Survival Guide: Part 2
In the second installment of this series, we discuss when to de-duplicate data
August 9, 2008
Editor's note: This is the second installment of a four-part series that will examine the technology and implementation strategies deployed in data de-duplication solutions:
Part 1 looked at the basic location of de-duplication -- standalone device, VTL solution, or host software.
Part 2 discusses the timing of de-duplication. This refers to the inline versus post-processing debate.
Part 3 will cover unified versus siloed de-duplication, exploring the advantages of using a single supplier with the same solution covering all secondary data, versus deploying unique de-duplication for various types of data.
Part 4 will discuss performance issues. Many de-duplication suppliers claim incredibly high-speed ingestion rates for their systems, and we'll explore how to decipher the claims.
The Timing Issue
One of the hottest debates in data de-duplication is when should the process be done? Should it be done inline as data is being ingested or as part of a post-process after the initial send of the backup job completes?
Although a more detailed explanation of de-duplication is given in my first article, as a quick reminder, de-duplication is a process that compares an incoming data stream to previously stored data, identifies redundant sub-file segments of information, and only stores the unique segments. In a backup this is particularly valuable since much of the data in is identical, especially from full backup to full backup.
There are basically three whens” of deduplicating data: inline, post-process, or a combination of the two.
When a product claims it is de-duplicating data inline, that typically means that as the appliance is receiving data, redundant data is identified, a pointer is established, and only the unique data is written to disk -- the duplicate data is never written to disk. Post-process data means that all of the data is first written to disk in its native form, then a separate, sequential process analyzes that data, and the duplicate data is eliminated. Some vendors offer a variant of post process de-duplication that employs buffers to allow the de-duplication process to start before the entire backup is completely ingested.Presentation
One of the strengths of an inline system is the simplicity of its presentation. You are dealing with the data in a single state. For better or for worse, it is always de-duplicated. Post-processing has a weakness in its presentation: it has to deal with data in native and de-duplicated states. There has to be enough space in the raw area to land the inbound backup jobs.
Vendors have handled this by either requiring the user to manage the two backup pools or having the system manage that data in the background. Either way, there is some additional management work required to make sure that there is enough capacity in the landing area to handle the entire backup job. This is not to say that inline systems are not immune to poor capacity planning or unexpected changes in the environment that could cause them to fill up. What we have seen and experienced is that users have an easier time managing an inline system.
Performance
For an inline system, performance should be the Achilles heel, because you may be sacrificing performance for the simplicity of interaction. Doing the de-duplication step in real-time requires processing power, and underpowered or inefficient systems may hinder an inline system’s ability to ingest data. A post-processing system does not have to worry about the ingest performance impact caused by de-duplicating data -- it does not have to process data as it is being received. Performance bottlenecks would likely be caused by disk or network I/O limitations. Inline systems count on the cost of processing decreasing and speed of that power increasing, also known as Moore’s Law. This has resulted in a continual growth in the speed at which inline systems can ingest data; today single mid-range to high-end inline systems can handle about 750 Gbytes to 1 Tbyte per hour.
The backup processes needed for performance are a key factor in making a data de-duplication decision. If you meet your backup window requirements by transferring 1 Tbyte per hour, or if your infrastructure can't sustain much more than 1 Tbyte per hour, then the ease-of-use gains of an inline system outweigh the unrealized performance capabilities of a post-processed system.Supporting several of these units is not out of the question if it will allow you to meet your backup window. It is important to note that, as of today, no systems support de-duplication across separate appliances, although we expect to see that capability delivered this year. Lastly, some of this performance challenge can be offset in environments that have a high data redundancy rate, because in subsequent backups less data has to be written. Fewer writes not only means fewer actual writes of data but also less data for which a RAID parity bit needs to be calculated.
If your infrastructure can deliver more than 2 Tbytes per hour, and your backup window requirements demand more than 2 Tbytes per hour, this could be a situation for which the speed potential of a post-processed system would be more appropriate. Consider also that this typically means you have a very large data set and you are more than likely counting on tape to play an active roll in your environment.
First make sure that the entire disk backup solution -- backup-vault to tape-data de-duplication -- can sustain these speeds for the daily backup policy. De-duplication is not the sole bottleneck. Also, make sure that if you are going to be depending on tape, that an integrated move to tape is high on your testing criteria. If electronic vaulting is also a required capability, include it in the testing criteria along with testing the complete daily backup policy.
Recovery Performance
Post-process solutions will make a case for recovery performance as well, stating that keeping data in its native state is critical for rapid recovery, meaning not going back through the de-duplication process. Not all post processes deal with the matter in the same way. Some keep as much native data available as possible, some on the last version of the backup job. Regardless, there could indeed be performance issues with recovering de-duplicated data, but similar to backup, make sure that there are not other bottlenecks in the environment that would be of greater concern; the network, the ability of the server to rapidly receive data, the requirement during a recovery to rewrite all the RAID parity data, and, of course, the simple fact that writes are slower than reads.
If speed to recovery is that critical, then other options, like Continuous Data Protection (CDP) solutions, which store data in its truly native format, should be considered. Most of these solutions allow you to boot right off of the backup copy of data, eliminating data transfer from the recovery step.To Page 3
Disaster Recovery
As stated earlier, one of the potential strengths of post-processing is that its data de-duplication step can happen after the data is written and the backup is complete. Post process is less dependent on the processing capability available to it, but it also creates challenges with the disaster recovery (DR) process. The post step has to be completed before replication of the backup data can occur, and depending on the system configuration and the amount of data involved, this can be a considerable period of time. While few vendors will report on what their post-process de-duplication times are, we have seen times from 1 hour to 3 hours per TByte. Your mileage can and will vary greatly depending on your data.
The important measurement here is post processing's impact on the DR replication window. If there is a requirement to get data to an offsite facility in a set window, then you may not have enough time to complete the backup job, run the de-duplication process, and then replicate the data. If offsite protection is critical, the reduced time available for replication may force the user to obtain increased bandwidth.
Even if there is not a set window in which to have the DR copies made, internally you want it done well before the next backup job completes. If you take 7 hours to backup 10 Tbytes, and then 15 hours to analyze and de-dupe 10 Tbytes (assuming a 1.5-Tbyte-an-hour de-dupe process) that only leaves you about 2 hours before you have to start the next backup window to replicate all of the data to the remote site. This also leaves no margin for error if, for some reason, a client failed to send data.
With inline processing, replication can start as data is coming into the appliance, so even if the backup window is twice as long, because you can start replication earlier, your net backup process time may actually be faster. While this may not be the sole factor in the decision-making process, it is certainly something to consider.De-dupe not the primary requirement
Data de-duplication is not the primary focus of all solutions. Depending on your environment, there may be a capability that is more critical right now; power managed storage, data retention, very tight tape integration and boot from backup via iSCSI. All of this can be important and if they are in your data center, they should be considered. At that point, a data de-duplication feature that is added in later via a post process may be acceptable.
Summary
When deciding between inline vs. post-processing it is important to understand how much backup performance you need, how much backup performance you can deliver, how soon you need to create a DR copy of your backup data, and if there is a more important capability to consider than de-duplication.
You May Also Like