Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Active Data Demands Storage Rethink

5 Big Wishes For Big Data Deployments
5 Big Wishes For Big Data Deployments

(click image for larger view and for slideshow)

Active data used to mean the data you were working on at that moment in time. Data in this category was typically active for a short period of time.

Active data was created, collaborated on and distributed, but then within a few days or weeks would become permanently dormant. The capacity requirement of this working set would be relatively static. But active data is changing, and we need to re-think how that data is stored.

Active data is changing in two important ways. The first is probably no surprise to any IT professional: There is more of it than ever. The working set of data is now substantially larger than what was dealt with in the past. Part of this change is that users simply create more data per device and everything is created on a device. And they have more devices on which they want to create that data. A great example can be found in simple note taking -- fewer and fewer people take notes on paper any more. It starts out and remains digital.

[ Looking for more advice on flash storage? Read Why Flash Storage Excels In Virtual Environments. ]

The second area where active data is changing -- and the important one for our discussion -- is recall. The old profile of data going dormant and staying dormant has changed. More data than ever needs to be accessible and many times that accessibility needs to be instant. This means that it can't be off-line, on a tape drive -- or in the case of real-time analytics -- even on a hard disk. Also, the predictability of just what data will be needed at a given moment in time is difficult.

Many vendors are suggesting the use of a solid-state drive (SSD) tier for this working set but as I discussed in my recent article "Is SSD Enough To Stop Active Data Onslaught?", the size of this potential working set, because of the unpredictable nature, might be too large for a typical SSD cache. Many organizations are finding that much of their working set needs to be on a dedicated flash array to be able to respond to real-time data requests.

Real-time environments have a need to assemble an answer, often from disparate data sources, in a moment's notice, even if the data being assembled or analyzed is relatively old. If that older set of data is not in the SSD cache, the real-time part of the user experience is lost.

For example, I saw an application recently where this need for real-time data is obvious. You hold your smartphone or tablet's camera so it is aimed at a restaurant you're passing by. Without taking a picture, it provides real-time information about that restaurant -- phone number, menus, ratings from various sites, health department reports, available slots in their reservation system, and even entertainment such as movies or sporting events happening nearby -- all simply by being pointed at the restaurant.

All of the above information points needed to be accessed and processed instantly but they were all being accessed from different repositories. This is an excellent example of how a large flash system can create a customized experience in real time for a user.

There Is A Line

The need for active data and real-time processing is increasing but there is a line. Not all businesses need that level of real-time processing. Even if you do need that type of active working set, there is a line where moving data to another form of storage is appropriate.

The key going forward is not to ask ourselves what data might become active but to assume that all data eventually will become active again. Then we need to ask which data, when it becomes active again, will be needed randomly and which will be needed sequentially. Randomly needed data should be on a SSD; it might make more sense to keep sequential data on tape.

Hard disks might be the odd man out here. Data that has even the slightest potential to become randomly active because of an analytics need will have to go on disk. The remaining data set, and still the largest set, can be well served sequentially by tape. In my next column I will discuss the surprising new importance of tape in an increasingly active data world.