Long-Term Storage & Compliance: CAS Vs. Locked NAS

The answer to a barrage of laws and regulations requiring IT to store data for increasing periods of time? CAS and locked NAS.

February 25, 2008

14 Min Read
Network Computing logo

How certain are you that the electronic data your team retrieves in response to discovery requests is complete and unaltered? Recent court rulings have framed electronic records as on par with audio recordings and digital photos in terms of reliability, as judges recognize that a clever cheat could modify an e-mail to remove a critical "not" before submitting it into evidence. IT groups that have yet to implement systems that store data in nonmodifiable form are behind the curve.

Long-term data retention mandates are a minefield as well. Organizations covered by OSHA regulations must keep pre-hiring physical exam records for 30 years after an employee's termination of employment, for example, while HIPAA requires that medical facilities retain records for 20 years or more. Simply keeping copies of end-of-month or end-of-year backup tapes doesn't cut it for long-term data retention. Even if the tape hasn't degraded physically, it's unlikely you'll still have a drive that can read it.

Because organizations facing the most restrictive regulations have had deep pockets -- at least until the sub-prime mortgage debacle -- storage vendors including Caringo, EMC, Hitachi Data Systems, Permabit Technology Corp. and Nexsan Technologies offer a variety of technologies to store fixed content data. These systems aren't cheap, but neither is litigation. And, as the space expands -- Gartner expects the e-mail archiving market to grow from $315 million in new license sales in 2007 to $1 billion by 2011 -- IT will have more to choose from. We asked vendors about the latest in tamperproof CAS (content addressable storage) and locked NAS storage gear, as well as services for those who don't want to maintain their own archives.

CAS Vs. Locked NAS
Click to enlarge in another window

As for a business driver, if you can empower your counsel to say, "This message was intercepted before the user had access to it by our e-mail archiving system, which saved it to a non-modifiable archive at 4:02:03 p.m. on 13 February," you're a rock star. "This e-mail sat for nine months in the user's inbox, where he could have changed it at any time," not so much.

We Like WORMs
Highly regulated industries like securities brokers and pharmaceutical firms have long maintained records in a non-rewriteable and non-erasable format, called write once, read many, or WORM. We believe most organizations are well advised to go this route for fixed-content archives. In fact, until EMC's 2002 release of its Centera system, magneto-optical WORM disks were the only reliable and non-modifiable storage medium. These disks should provide dependable data storage for 30 years or more. Except for occasional complaints about balky robotics on some jukeboxes, reports from real-world users indicate no trouble reading data written 10, even 15 years ago.Besides WORM storage, you'll also need e-mail and file archiving apps to identify which data should be saved. Of course, that's easier said than done, especially in e-mail. Vendors like EMC, Symantec and Zantaz can help separate ham from spam, but expect to store some grocery lists. Other applications, like medical and check imaging, write data directly to the fixed content store.Plasmon's ultra-density optical WORM disks, with a capacity of up to 60GB each, are state of the art for organizations seeking long media life. Like all WORM disks, they need WORM-aware archiving software to write to them. Plasmon's current archiving system, the Enterprise Active Archive, uses a server running Nexsan's Assureon NAS software as a front end. Data is typically written to a RAID array when initially stored, then migrated after 60 to 90 days as the rate of access falls off and long-term storage becomes more important than access time.

All of today's popular tape formats, from LTO in the midrange to Sun's T10000 at the high end, have firmware in the drive that identifies special WORM cartridges, and once data is written to them, prevents overwriting or erasure. With capacities of 800GB per cartridge, WORM tape, especially if used behind a RAID cache, is the lowest cost, and greenest, solution for very large archives where IT can deal with file access times measured in minutes. RAID or even MAID uses power when not being accessed. Optical disks take lots of floor space. High density and no need for power when not being accessed make tape the new green.

Contents Under Pressure
Rather than use a file's name and location in a hierarchy of directories as the primary identifier, as conventional file systems do, CAS systems generate a globally unique identifier, or GUID, for each file as it's saved using a hash function like MD-5 or SHA-1. The file is stored based on that GUID. If the CAS device provides a CIFS or NFS interface -- and most do -- it does a database lookup to find the GUID for the full file path, then uses the GUID to retrieve the file.One advantage here is that CAS systems automatically provide single-instance storage. When someone, or some process, saves a file with exactly the same contents as a file already in the system, the new file will generate the same hash value. Because the hash value GUID is the primary key for storage, the system won't save two files with the same GUID; rather, it notes that one file has been referenced in the system multiple times. For data stores that contain multiple copies of a document, single-instance storage can slash space requirements.Just as with hash-based data de-duplication, some CIOs have expressed concerns about hash collisions resulting in two different files being sent to their CAS systems, but only one being saved. The odds against this are astronomical -- 1 in 10 to the 25th for even the most basic hash functions -- but steps vendors are taking to ease our minds range from using hash functions that are much more resistant to collisions, like SHA-512, to employing byte-by-byte comparisons of files that generate the same hash values before declaring them identical.

Real-world CAS implementations add the ability to store user metadata along with each object and provide a mechanism for enforcing data retention, preventing anyone, including the system administrator, from deleting files until their retention periods expire. EMC's Centera was the first commercially available CAS system and remains the market share leader. The Centera RAIN (redundant array of independent nodes) architecture uses access nodes, through which applications store and retrieve files, and storage nodes that include disks and additional processing power. Centera protects data by either storing a copy of each object on two storage nodes or in an object-based parity scheme, rather than relying on conventional RAID controllers. Centera clusters can also replicate data over an IP network locally or remotely.

Until recently, applications needed to use EMC proprietary APIs to store and retrieve data from the Centera. That's not an issue with most archiving and document management applications, but creating custom applications is more difficult with Centera than with systems that use a NAS or other standard interface. In response to customer demand, EMC released a backup-and-recovery module that runs on Windows or Solaris servers and provides a standard interface to Centera for tape backup and Centera Universal Access, which also runs on a gateway server and provides CIFS, NFS and HTTP access. However, Centera's encryption capabilities are not as robust as some competitors'.Hitachi Data Systems' Content Archive Platform, a product of Hitachi's acquisition of Archivas last year, takes a different approach to CAS, using a file's location as the primary identifier and generating hash tokens as a background process after data is stored. CAP uses three or more diskless front-end nodes to store files on attached Fibre Channel arrays, which can also be used for other data.

Organizations may add additional back-end storage or front-end compute nodes to boost capacity and speed indexing and/or data ingestion. Rather than rely on custom APIs, data can be written to or retrieved from CAP using HTTP, NFS, CIFS and WebDav. Archive applications can specify retention times, the number of copies of data to store and other metadata by writing simple text and/or XML files for each folder. Because Hitachi runs single-instance storage, indexing and data integrity checking as background tasks, data ingestion rates aren't controlled by how fast the system can hash and index. Data is encrypted at rest on archive disks, in flight across the SAN and when being replicated to another CAP cluster at a remote site. CAP directly supports NDMP (Network Data Management Protocol) to back up archives to tape in addition to having multiple replicas.

Permabit's CAS system, built from a RAIN of 1U servers in access node and storage node configurations, adds data de-duplication, full text indexing from FAST Search & Transfer on a dedicated node, and a flexible NAS interface that can automatically retain and track multiple versions of files as they're saved. Problem is, with just 1 TB of usable storage on each node, a large archive could end up taking a lot of rack space and power. Microsoft's purchase of FAST shouldn't affect FAST's many OEM deals -- at least not right away.Nexsan's Assureon line allows an organization to add RAID arrays for simple storage or nodes with compute and storage capabilities at any time. Assureon also includes data de-duplication and MAID technologies to reduce the amount of storage needed and power consumption. Assureon can act as a RAID cache in front of optical disk or WORM tape libraries and includes a Windows file system watcher that will automatically copy files from any Windows file store when they're closed or reach an age that implies they're complete; if the systems guesses wrong, you'll end up archiving several drafts.

Finally, Caringo sells its CAStor as software distributed on a USB thumb drive that turns from two to //TK// standard Intel-based PC servers into a CAS cluster. Unlike EMC Centera, CAStor uses HTTP rather than a proprietary API as its primary interface, with CIFS/NFS access available as an add-on.

CAStor has the basic set of CAS features most organizations are looking for, including local and wide area replication, data retention, and replication depth definable at the object level. Still, while idea of building a large CAS cluster from standard servers and disk has a certain appeal, we don't think most enterprises will be comfortable rolling their own CAS systems.

Keep It Simpler
For all its inherent sexiness, CAS is a complicated solution to the problem of preventing users and administrators from deleting or modifying files. Several vendors, including Network Appliance through its optional SnapLock for filers running OnTap and Sun's StorageTek division through its StorEdge Compliance Archiving software, have added software-managed WORM to their NAS appliances. Organizations can use the same NAS architectures, even the same appliances, as their primary file stores and still have a WORM archive. One system for backup, replication and management saves money and complexity.

Locked NAS is also easy on your application developers. Rather than having to integrate a new XML-based API to store and retrieve images or other files, they can simply write to the locked NAS via CIFS or NFS. Data retention periods can be defined on a folder-by-folder, or even a file-by-file, basis by setting the "file last accessed" time attribute to the end of the retention period and then flagging the file as read only.Now that Network Appliance has rolled out its proprietary A-SIS (Advanced Single Instance Storage) sub-file data de-duplication technology, a NetApp filer running SnapLock can one-up the CAS vendor's single instance storage, eliminating not just duplicate files but also duplicate data within files, ensuring that those five corporate positioning slides that appear in almost every PowerPoint presentation will be stored only once.

Compared with CAS systems, locked NAS does lack a mechanism for storing metadata about objects. How big a problem that is depends on how good your archiving software is. CAS systems provide an XML interface for storing file metadata, but organizations selecting locked NAS as their compliance stores will need to look to their archiving software or enterprise content management systems as a metadata store.

SIDEBAR: Send It Out: Storage As A Service
It used to be that IT groups resisted using external services for their compliance archives, concerned that putting valuable data in someone else's hands heightens the risk of exposure, loss and availability lapses. Online archive providers including big players IBM Global Services, EDS, EMC, Sun and Network Appliance have gone to great lengths to address these fears by hosting their archive infrastructures in state of the art data centers with redundant power and Internet providers. The investment has paid off: In 2006, the worldwide storage service market was worth about $25 billion, according to Gartner, which says that should grow to $33 billion by 2010. Our take: When you must retrieve data fast -- and the Federal Rules of Civil Procedure now require that parties to a lawsuit produce evidence much more quickly than in the past -- having access to a staff that lives and breathes archiving day in and day out is worth its weight in subpoenas. Applications that generate revenue, or helpdesk calls, get in-house admins' attention first.To get started with storage as a service, consider e-mail. Because Exchange provides a journal interface for archiving, and e-mail has well-defined metadata in the header that can be used to set retention policies, a variety of providers have gotten into the act. Companies can implement online e-mail archiving with little to no capital expense in just a few days. In comparison, installing a fixed content storage system and integrating it with e-mail archiving software is a substantial project. This makes online archiving especially attractive to smaller organizations.

Zantaz's First Archive on Demand service leverages its EAS e-mail archiving system to provide an online service, while MessageOne and Mimecast take a slightly different approach, building integrated e-mail management services that provide continuity and message management for Exchange as well as an archive.

SIDEBAR: How Green Is My Data Center
For those concerned about data center power consumption -- and who isn't nowadays -- using green storage technology in the archive can reduce consumption compared with storing the same data on high-performance primary storage arrays.

Most archives will consume somewhat less power per terabyte just by using higher-capacity SATA drives in larger RAID sets, so they store data on half or fewer drives as a primary storage system. The SATABEAST and AMS arrays that Nexsan and HDS use as back ends for their CAS systems can spin down disk drives that aren't in use, and spin them back up again when they're accessed. The current version of HDS' Content Archive Platform doesn't use this feature, though it may in the not too distant future. This MAID (massive array of idle disks) technology can save 50% or more of the power needed to run arrays. Because compliance archives might be idle 16 or more hours a day, spinning down the drives should cut power consumption significantly.

Because optical disk and tape libraries use minimal power when they're not being accessed, they're even greener than MAID. The price is slower access times: A MAID system can take as long as two minutes to spin up its drives, though once they're up retrieving, any given data item takes a second or less. Every retrieval from an optical or tape library will take 5 seconds to a minute or more.0

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights