As I wrote in my first blog post, there are three factors that can impact the performance of any application's connection: the amount of bandwidth, percentage of packet loss and latency. By addressing and correcting each of those factors, WAN optimization products profoundly improve protocol and application performance. I'll examine each of these items over several posts, starting with bandwidth.
How Much Optimized Bandwidth Is Enough?
What is the most important factor in determining the capacity of an optimization product? Most organizations would probably say the maximum amount of optimized bandwidth, and to some extent that's true. An optimization product that can optimize 1 Gbps of WAN bandwidth technically has greater capacity than one that can optimize 50 Mbps of WAN bandwidth.
But as we've seen, the constraints of latency and loss limit the amount of data an application can send over one IP connection. One application connection will probably not be able to consume the entire capacity of an optimization product. It's more likely that many simultaneous connections are needed to utilize the product's full capacity.
This is particularly true when optimizing a branch office. Each user typically generates 10 to15 connections over the WAN. (Just run "NETSTAT" at your Windows command prompt and see for yourself.) Multiplied across all branch office users, the number of simultaneous optimized connections is critical for realizing the peak capacity of optimization products. But how do you fully utilize each connection? That's where deduplication comes into play.
Shopping Lists and WAN Optimization
At a high-level, deduplication works in much the same way my wife creates my weekly grocery shopping. A while back, she became so fed up with my tendency to mistake her handwriting that she printed a standard shopping list. Now she just ticks off the stuff I need to purchase, only scribbling in a few new items every once in a while. (A printed list is very '90s, I know; we've since graduated to Paprika, an iPhone/Android recipe-grocery list manager.)
Deduplication works in much the same way. WAN optimization products inspect incoming data and create a unique "fingerprint," usually with a hashing algorithm. These fingerprints are then shared with the other optimization systems in the network, creating a single, coherent dictionary. (At one time, this was not the case, but today it's fairly common.) On subsequent passes, the optimization systems detect repetitive patterns by comparing the fingerprints, and replacing the outgoing data with small tokens or instructions. The data is reinserted on the receiving side. In this way, bandwidth can be dramatically expanded, enabling a 10-Mbps connection to carry 200Mbps of data.
Deduplication, Compression and Caching
Of course, that description could be used to describe compression, as well, but while compression works on data patterns over a short horizon--typically within a file--deduplication algorithms identify data patters across a much larger timeframe--typically across files.
Deduplication also sounds a lot like caching, but it's not. Like deduplication, a cache compares incoming traffic to a library of data and, if found, delivers data locally, saving the time and bandwidth of traversing the WAN. But that's where the similarity ends. Caches are typically specific to a particular environment--Web caches accelerate HTTP, file caches accelerate CIFS, NFS or some other file service, and object caches are specific to a given application. So organizations end up needing a separate cache for each application being accelerated. In contrast, deduplication can be protocol-agnostic.
There also is the matter of cache coherency. In general, caching is meant for static data as dynamic data change too frequently to be cached. Even slightly changed data would require the entire dataset to be retrieved from across the WAN. Deduplication, though, can detect more granular data patterns.
Evaluating Deduplication Approaches
There are a number of differences between deduplication approaches. As one might imagine, the granularity at which the optimization product inspects the incoming data stream is very important. The fewer bytes that are needed to form a fingerprint, the higher the probability of a "hit" in the optimization 'system's database. Byte-level granularity is ideal, but vendors are typically limited to looking at 16-byte data patterns as they need to create a unique hash to prevent erroneously substituting the wrong data. An alternative approach avoids the problem by using indexes to indicate shifts in data patterns, allowing for true byte-level granularity.
The size of the dictionary is also critical since the more fingerprints that can be stored, the greater the likelihood of detecting a repetitive data pattern. It should be no surprise then that WAN optimization vendors talk about the amount of their disk space as a rough indication of the size and effectiveness of the dictionary.
Of course, size doesn't matter if it's not populated with the right data patterns. By looking across more applications and protocols, there's a greater likelihood that additional data patterns can be detected, eliminating even more traffic from the WAN. This is a major reason why data that's been deduplicated by data replication systems can be further deduplicated by WAN optimization products. They can detect data patterns from other TCP-, UDP- and IP-based protocols running over a WAN and ignored by a system only deduplicating data being replicated between locations.
David Greenfield is a long-time technology analyst. He currently works in product marketing for Silver Peak.