Kadena Systems developed, patented and distributed a block-based deduplication method that promises to improve space efficiency over traditional deduplication methods. Instead of using the more common fixed-block or variable-block methods, the Kadena method uses a 'fixed-length, sliding window' to parse the file and see if there is new data in it. This method may provide greater efficiency in finding redundant data. The technique is similar to what rsync does prior to synchronizing files. The size of this window can be either automatically or manually adjusted based on the type of file, which should allow for a greater chance of finding redundant data within the file. For example, image files that won't typically contain any redundancy will use a large window, allowing for a quick and not very CPU-intensive examination. Alternatively, database files will use a much smaller examination window because the size of the data repetition is generally the length of the row of a table. PowerPoint files would use a medium-size examination window.
Arkeia makes the case that its sliding window technology offers the speed of fixed-block deduplication, because all the blocks are the same size, with the compression benefits of variable-block because it accommodates file changes due to byte inserts. The traditional concern with the sliding window approach is that it typically requires significantly more CPU horsepower to process than either fixed-block or variable-block deduplication because each time the window slides one byte, a new fingerprint (e.g. a message digest) must be calculated. Arkeia gets around this concern by doing something called progressive matching.
An example of this technique is using checksums to quickly determine if a block is new or if it is potentially a duplicate. If the latter case is true, then a more CPU-intensive calculation (e.g. a message digest) is run to determine the level of redundancy. As a result, heavier CPU resources are only expended when there is a greater likelihood of redundancy. Arkeia believes that the net effect is greater deduplication efficiency without greater than normal CPU requirements.
Once the non-redundant data is identified, it's compressed and sent across the network to the backup server where it is stored in its deduplicated state. As with other source-side deduplication products, this allows for effective use of network bandwidth as well as the reduction in the amount of storage required. It also enables the use of disk as a longer term repository for backups.