Here are data management best practices and tools that help clear the clutter.
I remember a few years ago parsing through all the files on a NAS box. I was amazed at all the duplicate files, but a bit more investigation revealed that we had a mix of near duplicates in with the genuine replicas. All had the same name, so it was hard to tell the valid files from the trash. I asked around and the responses I got were mostly along the lines of, “Why are we keeping that? No one uses it!”
This begs the question: Do we throw any data away any more? Laws and regulations like the Sarbanes-Oxley Act (SOX) and HIPAA stipulate that certain data should be kept safe and encrypted. The result is that data subject to the law tends to be kept carefully forever, but then, so does most of the rest of our data, too.
Storing all this data isn’t cheap. Even on Google Nearline or Amazon Glacier, there is cost associated with all of the data, its backups and replicas. In-house, we go through the ritual of moving cold data off primary storage to bulk disk drives, and then on into the cloud, in almost a mindless manner.
Excuses include “Storage is really cheap in the cloud,” and “It’s better to keep everything, just in case” to “Cleaning up data is expensive” or too complicated. Organizations often evoke big data as another reason for their data stockpiling, since there may be nuggets of gold in all that data sludge. The reality though is that most cold, old data is just that: old and essentially useless.
As I found with the NAS server, analyzing a big pile of old files is not easy. Data owners are often no longer with the company. Even if they were, remembering what an old file is all about is often impossible. The relationship between versions of files is hard to recreate, especially for desktop data from departmental users. In fact, it’s mainly a glorious waste of time. Old data is just a safety blanket!
So how can companies go about reducing their data storage footprint? Continue on to learn about some data management best practices and tools that can help.
Decide how much data is really rubbish
Some data is clearly eternal: The company financials, for starters, but also personnel records; pension companies and HR personnel performing reference checks go way back. But data quality degrades rapidly as we move towards someone’s desktop or mobile. While regulated data should be erased according to SOX, HIPAA, and other guidelines, most other data should be erased after at most a few years.
The view that such data might have residual value, is debatable. The cost of storing the data and then of applying a “big data analysis” and figuring out the results is probably going to exceed that value. An example: Knowing that I looked at white shirts in a store three years ago is not going to give much of a hint as to what I’m buying this year, so why store all my eye movements for years?
Data lifecycle management
Clearly, keeping most old data is neither sanitary nor cost-effective. Be honest! Most of it has no value and will never be accessed again. Admins should set in place a process to prevent the clutter in the first place.
Data lifecycle management needs to begin when data is created. Almost all data should be gone after some pre-determined life. Anything really important should be flagged for either earlier or later demise, or for disposition to a specific archive on a specific date. Fortunately, there are products and tools to help with this.
Object storage software has extensible metadata. This allows an admin to flag service actions including end-of-life deletion as part of the travelling history of any object, inheritable by any derivatives. Now this doesn’t actually do anything unless the object store code is primed to act. That means having a tool that detects a change of disposition and acts on it. This approach needs a policy management tool to work. Such a tool allows an admin to set policies for whole groups of objects, by type, user, and other attributes.
Products like Red Hat’s Ceph Storage, Caringo Swarm, and Scality Ring have methods to add user-defined metadata attributes in the form of key/data pairs. As we’ll see later, tools to generate attributes by policy and then to use them are entering the market.
Deduplication is another answer, often available in all-flash arrays and some other appliances as a post-processing of stored data, usually as it moves to secondary storage. By processing the objects as they are being stored, a system can detect files that already exist and so just create a pointer rather than yet another copy.
This approach uses a unique hash key to detect variants of objects with the same name in order to avoid a very common trap. Deduplication is also amenable to policy management. Bear in mind that removing a deduped object can be easy -- one object and some pointers -- but also difficult when owners want their copy to expire at different times, for example.
Data lifecycle management tools
There are already data management tools aiming to handle data lifecycle in the context not only of object storage, but also block IO and filer data. Actifio appears to be well attuned to the copy management issues, while Data Gravity is tied somewhat to dual-controller hardware platforms. Qumulo offers a “data-aware” NAS platform that uses metadata to help analyze and optimize data
All of these companies are startups and there’s clearly room for more vendors in this market. There are also tools from several large vendors, such as IBM and SAP that manage enterprise data assets using a database approach for metadata, but their approach is aimed at finding and managing content much more than files.
Policy-based management and SDS
Any of the data management products in the previous slide are amenable to virtualization. This brings us to the next wave of technology, which is software-defined storage. Breaking free of the bounds of proprietary, inflexible storage software and fixed metadata by running copies of services as needed allows self-regulating performance, while metadata extension provides a true data-driven storage system and opens up a much richer service ecosystem.
Primary Data addresses this by providing metadata director software that uses the metadata to move objects and process them. Coho Data has a somewhat simpler product that's sold with a low-cost appliance.
We can expect much of the new effort in storage to home in on the extended metadata option. Policy management systems for hybrid clouds will also extend to support the approach this year.