Network Infrastructure

All Archive Data is Not Alike

Just because we call all the data users aren't changing on a daily basis archival, that doesn't mean it's all the same. The 31-day old email that's been replaced with a stub in the user's mailbox and the 15 year old X-ray HIPPA requires you to keep don't need the same SLA and may be better served by different storage systems.

Howard Marks

May 30, 2009

3 Min Read

Just because we call all the data users aren't changing on adaily basis archival, that doesn't mean it's all the same. Depending on its source and usage patterns, differenttypes of archival data can be best served by different storage techniques andmedia.

The most common type of archival data is made up of emailmessages and files that have been put into the archive and may need to be accessedby users transparently. Realizingthat object access rates fall off dramatically as objects age, system administratorsset up policies to migrate objects over 30 or 60 days old to an archival storagesystem and replace them with some sort of pointer or stub so the user can stillaccess the data from its current location in their mailbox or home directory. Organizations with strict regulatory requirementslike SEC 17a-4 require that some data be retained, and will archive objects as they'recreated and/or published/finalized and then replace the original files withstubs as they reach the 30-60 day age.

Since users are accessing this data transparently thearchive system has to be fast enough to retrieve data without the user noticinga substantial delay. While the archivedoesn't need to be as fast as the primary storage, notice there's no random I/Oto files on the archive; it does have to be able to retrieve this kind of datawith response times in the one second range.

After some additional time -- SEC 17a-4, for example, says datamust be easily retrievable for 2 years but retained for 7 -- transparent accessbecomes less important and archive solutions may delete the stubs leaving thearchive's UI and indexing system as the primary access method. Since we've moved from real time access to aquery/response model, the archive solution can take some tens of seconds tominutes to return the objects requested without serious impact on userproductivity. This second tier ofarchive storage also fits the needs of data retained for potentiale-discovery.

Then there's the deep archive that's kept not 7 years but 70or 700. All footage shot for a TV showor movie, the as-built blueprints for each plane Boeing builds, the designs an architectural firm or a buildingsdepartment keeps on file, digital X-rays for patients that have been dischargedfrom the hospital and all the other data that's traditionally stored inwarehouses. Add in scanned images ofhistorical documents and such that fill miles of shelves in real, as opposed tobusiness record, archives. This data can be stored on a system that literallytakes minutes (for those digital X-Rays) to hours to retrieve.

Note this deep archive data is also typically made up oflarge objects and these objects are frequently retrieved together. Boeingmay need to call up all the tail sections of 737s built over a 3-year period tosee which have the old style parts the FAA says need to be replaced or Lucasfilmmay want to call up all the Star Wars Episode XXXIV footage for a new Blu-Rayspecial edition.
While all this data could be stored on magnetic disk systems,with or without MAID, the power, space and periodic migration required couldmake another medium more attractive. More on this subject next week.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.He has been a frequent contributor to Network Computing and InformationWeek since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of Networking Windows and co-author of Windows NT Unleashed (Sams).He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.  You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

See more from Howard Marks

More Insights

Webinars