Data Storage Group has been awarded a patent for its core deduplication technology, which the company says is better suited for data that is distributed across multiple servers. While most deduplication technologies work by breaking up data into chunks of from 8K to 256K, and then building index files to track those chunks and determine whether they've been seen before, DataStor's Adaptive Content Factoring technology creates one index entry per file version. This results in up to two or three orders of magnitude fewer index entries. The technology is used in the enterprise version of the company's DataStor Shield product.
The patent helps validate a product that Darrin Tams has been happy with and that has met his needs. Tams, the enterprise server administrator for the St. Vrain Valley School District, in Longmont, Colo., says he has been using the product for more than four years. The district has 30,000 staff members and students, and uses just two staff members to support up to 200 servers, according to Tams.
With the software, the district backs up nearly a terabyte of raw data from about 60 servers into a baseline of about 600GBytes of data, and maintains 30 days of backups, says Tams. (The remaining servers are used primarily for applications rather than data, and don't need to be backed up.) The data consists of a combination of user documents, user profiles and some Microsoft SQL Server database data. There is also software with modules that interface with the software. The students use the software for performance testing, and the software and data setup is duplicated among the schools.
The big challenge with the way other companies perform deduplication is that a 1GByte file, broken up into 8K chunks, results in 128,000 index entries, says Brian Dodd, president and chief executive officer of Data Storage Group, which is also based in Longmont. "If you have to manage that volume of index entries across a broad network, you very quickly get overwhelmed," he says.
The architecture works reasonably well for a single target-based system because organizations can beef up that server with more processing power and memory, he says. "But to try to distribute that load across a network is essentially impossible."