Big data presents challenges for enterprise storage. Here are the top things you should consider.
Defining big data is actually more of a challenge than you might think. The glib definition talks of masses of unstructured data, but the reality is that it’s a merging of many data sources, both structured and structured, to create a pool of stored data that can be analyzed for useful information.
We might ask, “How big is big data?” The answer from storage marketers is usually “Big, really big!” or “Petabytes!”, but again, there are many dimensions to sizing what will be stored. Much big data becomes junk within minutes of being analyzed, while some needs to stay around. This makes data lifecycle management crucial. Add to that globalization, which brings foreign customers to even small US retailers. The requirements for personal data lifecycle management under the European Union General Data Protection Regulation go into effect in May 2018 and penalties for non-compliance are draconian, even for foreign companies, at up to 4% of global annual revenues per affected person.
For an IT industry just getting used to the term terabyte, storing petabytes of new data seems expensive and daunting. This would most definitely be the case with RAID storage array; in the past, an EMC salesman could retire on the commissions from selling the first petabyte of storage. But today’s drives and storage appliances have changed all the rules about the cost of capacity, especially where open source software can be brought into play.
In fact, there was quite a bit of buzz at the Flash Memory Summit in August about appliances holding one petabyte in a single 1U rack. With 3D NAND and new form factors like Intel’s "Ruler" drives, we’ll reach the 1 PB goal within a few months. It’s a space, power, and cost game changer for big data storage capacity.
Concentrated capacity requires concentrated networking bandwidth. The first step is to connect those petabyte boxes with NVMe over Ethernet, running today at 100 Gbps, but vendors are already in the early stages of 200Gbps deployment. This is a major leap forward in network capability, but even that isn’t enough to keep up with drives designed with massive internal parallelism.
Compression of data helps in many big data storage use cases, from removing repetitive images of the same lobby to repeated chunks of Word files. New methods of compression using GPUs can handle tremendous data rates, giving those petabyte 1U boxes a way of quickly talking to the world.
The exciting part of big data storage is really a software story. Unstructured data is usually stored in a key/data format, on top of traditional block IO, which is an inefficient method that tries to mask several mismatches. Newer designs range from extended metadata tagging of objects to storing data in an open-ended key/data format on a drive or storage appliance. These are embryonic approaches, but the value proposition seems clear.
Finally, the public cloud offers a home for big data that is elastic and scalable to huge sizes. This has the obvious value of being always right-sized to enterprise needs and AWS, Azure and Google have all added a strong list of big data services to match. With huge instances and GPU support, cloud virtual machines can emulate an in-house server farm effectively, and make a compelling case for a hybrid or public cloud-based solution.
Suffice to say, enterprises have a lot to consider when they map out a plan for big data storage. Let's look at some of these factors in more detail.
(Images: Timofeev Vladimir/Shutterstock)
Sizing up big data storage demand
Once you’ve created your quarterly requirement for big data storage, look at ways to reduce it. Much of the data is junk after a day or two, count on aggressive end-of-life protocols. Some is sacred, so it should be stored and encrypted, with a backup and archive.
Look at the spikiness of demand. The public cloud is ideal for storing short-life data, especially if it is bursty. Storage buckets can be created and deleted cheaply, and scale definitely isn’t an issue.
Finally, big data sometimes isn’t that big! I’ve worked with 100 petabyte farms. Yep, that’s big! For someone using 10 TB of structured data, 100 TB seems large, but it will fit easily in the minimum Ceph cluster. Don’t overstate your problem! Today, solutions for 100 TB are straightforward.
The role of object storage
Big data is often conflated with object storage because object storage can handle odd object sizes easily, and provides metadata structures that allow tremendous control of data. This is all true. Moreover object storage is much cheaper than traditional RAID arrays. In fact, the most common object storage uses open source software and COTS hardware. Unbundled licensed software is also available economically.
Object storage appliances come with six to 12 drives, a server board, and fast networks, and increasingly, the networking will be RDMA-based 100 GbE or 200 GbE. Even so drives are getting so fast that these network rates may still struggle to keep up. We are on the edge of NVMe over Ethernet connectivity for object storage, which will bring a leap forward in latency and throughput.
There also are open source global file systems that have been used in financial systems and high-performance computing for years. These handle the scale needed, but don't have extended metadata and other flexible extensions.
Getting data in and out of your big data storage pool is a much bigger challenge than setting up the pool itself. Building end-of-life tagging into your storage software is one way to manage it: A policy sets the destruct tag value at data object creation time. Figuring out the policy takes time, though, and it gets more complex when disposition options are increased to include moving the data to very cheap archiving tiers in the cloud.
The data flow model for big data, especially IoT-generated big data, is often portrayed in storage marketing infographics as "a great river with many tributaries coming together.” From the storage farm perspective, however, all that joining together doesn’t really happen. Data has to be broken down into usable chunks and stored appropriately. Sensor data, the typical content generated by IoT, might be broken into timestamped chunks, making later disposal easy, while structured database entries may be stored directly into the master database, which has its own tools for tiering cold data.
To complicate this, we know that some big data is much more active than others. This active data probably needs to be guided, by policy, to faster storage such as NVMe SSDs.
Data privacy laws
GDPR is close upon us. You might be forgiven for thinking this is nothing to do with the US or Asia, but the rules for handling EU personal data carry draconian penalties of 4% of global revenue per violation, and apply worldwide. So if I sell a bottle of Napa wine in the US to someone in France, then let his or her personal data leak, I’d be in real trouble!
GDPR is, in the end, common sense for handling critical and personal data. Everyone should be encrypting data at rest properly and so on. The rules cover governance, lifecycle management, access and use as well as encryption.
You might heave a sigh of relief on learning your storage vendor is GDRP compliant, but the rules involve a major paradigm shift for the data owner (you!) as well as any data storers. If you haven’t gone through a realignment process, you’re not compliant!
A common misconception is that vendor-provided encryption solves your compliance requirements. Drive-based encryption, whether provided by a storage vendor or a cloud service provider, is not adequate for any of the data standards such as HIPAA, SOX, or GDPR. You as data owner must own the keys. Fortunately, there is encryption support in the cloud, but a better alternative altogether is to build it into workflows back in your servers or virtual machines.
SSDs are changing all the rules in storage systems. From acting as caches between DRAM and persistent storage to bulk storage devices, SSDs improve storage performance by factors of around 1000x in random IO and 10x to 100x in bandwidth. This is essential with huge volumes of data, especially when using parallel processing such as Hadoop, or GPU acceleration.
With 100 TB SSDs just over the horizon, and all this performance, a few small storage appliances can work wonders. The minimum Ceph object store arrays is four nodes and even using a standard 1U server format could hold 1.2 PB of raw SSD capacity today. It would not be cheap, but it would be economical when performance is calculated in. Vendors have already announced plans for 1U petabyte appliances, including one from Intel using 32 Ruler drives -- long, but narrow SSDs.
The rapid development in this space is why you shouldn't invest too heavily in the short term. Price points and all other metrics are changing over the next two years. Ensure that any future buys of appliances and drives will fit the cluster, so that otherwise useful gear isn’t scrapped.
Suppose I offered to turn that 1 PB appliance into 5 PB. That’s the average benefit you'll get from using compression software. SSDs have so much bandwidth that using some of it for compression of data written to an appliance in the background makes sense. Still, I’m strongly in favor of compression at data creation. This reduces network traffic throughout the data flow, saves on storage space, and reduces time-to-transmit by, you guessed it, 5X too! Source compression needs hardware support and that’s just beginning to appear in the market.
"Rehydrating" data is a trivial process that uses few resources, so increasing storage capacity with compression quickly translates to savings. All-flash arrays usually include compression; the technology also is offered as software for appliances.
The cloud alternative
After all talk about hardware, letting cloud providers do all the work might be an attractive option. In fact, the big three cloud service providers – Amazon, Google, and Microsoft -- all lead when it comes to implementing new architectures and software orchestration. The cloud is economical and geared to paying for just the level of scale you need at any point. Cloud services can handle storage load spikes, which are common in some data classes such as retail sensor data, for example. This reduces or at least delays in-house gear purchases of storage gear.
Getting performance levels comparable to in-house operations, though, is a challenge. Not all instances with the same CPU and memory combinations are equal. A highly tuned in-house cluster might even do much better.
Today, storage doesn’t stop with actually writing data to a drive. We are seeing value-added data storage services such as encryption and compression, indexing, tag servicing, and other features. The giant cloud providers, especially AWS, are even building database structures such as the Hadoop file system into the toolkit. This allows them to “invisibly” deploy gear such as key/data storage drives similar to new Seagate and Huawei units to accelerate specific data structures.
(Image: Krisda Ponchaipulltawee/Shutterstock)