Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

A Data De-Duplication Survival Guide: Part 2

Editor's note: This is the second installment of a four-part series that will examine the technology and implementation strategies deployed in data de-duplication solutions:

  • Part 1 looked at the basic location of de-duplication -- standalone device, VTL solution, or host software.
  • Part 2 discusses the timing of de-duplication. This refers to the inline versus post-processing debate.
  • Part 3 will cover unified versus siloed de-duplication, exploring the advantages of using a single supplier with the same solution covering all secondary data, versus deploying unique de-duplication for various types of data.
  • Part 4 will discuss performance issues. Many de-duplication suppliers claim incredibly high-speed ingestion rates for their systems, and we'll explore how to decipher the claims.

The Timing Issue

One of the hottest debates in data de-duplication is when should the process be done? Should it be done inline as data is being ingested or as part of a post-process after the initial send of the backup job completes?

Although a more detailed explanation of de-duplication is given in my first article, as a quick reminder, de-duplication is a process that compares an incoming data stream to previously stored data, identifies redundant sub-file segments of information, and only stores the unique segments. In a backup this is particularly valuable since much of the data in is identical, especially from full backup to full backup.

There are basically three whens” of deduplicating data: inline, post-process, or a combination of the two.

When a product claims it is de-duplicating data inline, that typically means that as the appliance is receiving data, redundant data is identified, a pointer is established, and only the unique data is written to disk -- the duplicate data is never written to disk. Post-process data means that all of the data is first written to disk in its native form, then a separate, sequential process analyzes that data, and the duplicate data is eliminated. Some vendors offer a variant of post process de-duplication that employs buffers to allow the de-duplication process to start before the entire backup is completely ingested.

  • 1