Remember the days of 9 GB hard disk drives, when storage farms were measured in acres and not rack units? It hasn’t been that long since we began the capacity expansion that we see today, perhaps just since 2001. We appear to be exceeding Moore’s Law in the rate of growth.
Today, we are at 10 terabytes for the state-of-the-art hard disk drive and, while hard drives capacities are somewhat stagnant due to the technical wall to be climbed for the next generation (HAMR drives), Samsung unveiled a 16 TB solid-state drive last summer. I have every expectation that predictions of 30 TB SSDs in 2020 will be met.
What does this mean for the data center? Already beset by the shrinking server farm, where containers and larger memories mean fewer servers are needed for a given workload, big storage drives are currently more than capable of absorbing the growth of big data. This is a result of storage running a bit ahead of expectations and big data not being quite as big as predicted.
In and of itself, having a bit more expansion space in the storage drives is not going to change the world of the data center that much, but what these jumbo drives attach to will. We are seeing a rapid migration away from RAID arrays towards appliances can handle the scale out of large data pools. These storage units are beginning to add features like compression and deduplication to the equation.
Compression typically is a natural consequence of all-flash arrays or solid-state drives. These have such high performance that it’s possible to use them as journal files for data that is compressed in the background without impacting server-side performance.
The consequences of compressing data between the all-flash array or SSD and secondary storage is that the raw capacity needed in the bulk storage is reduced dramatically for most use cases. We are typically seeing 5x capacity shrinkage for many workloads and as much as 100x for virtual desktops. Scientific data is one area that doesn’t compress much, and audio and video, which make up the bulk of storage at companies like YouTube and Facebook, are already compressed.
Deduplication, where identical files are aliased to a single master copy, complements compression. The two technologies do different things, though there can be some overlap. Depending on the environment, deduplication can save huge amounts of space. Just think of a web-server farm with all the servers using identical images of Apache, and the opportunity is easy to figure.
By putting the two approaches of compression and deduplication together, and using solid-state storage to make operations efficient, we are looking at a huge reduction in raw drive capacity needed. Even a 5x or 10x reduction reduces the space for drives spectacularly.
Let’s look at a hypothetical project. I have 2,000 500 GB 10K RPM drives in my imaginary data center. I can replace them with 100 1 TB SSDs, each costing less than the 500 GB hard-disk drive. That gives me many more IOPS (10 million) versus the HDD (400K IOPS) and enough storage for my active data.
I’ll also need to make up the storage with some bulk 10 TB drives. A little testing shows that I can get 10x from compression and deduplication, so I need just 100 TB of raw bulk capacity to hold that. That’s just ten 10 TB drives!
My 200-drive array farm, which needed 34 4U boxes plus SAN switches and other equipment, is supplanted by 100 2.5-inch SSD and a box with 10 bulk drives, which together will fit into 10U of rack. That’s right: 25 times faster in a 16th of the space!
This is the future of storage. We’ll be well into this at the end of 2016. What’s more, we’ll be using much cheaper SSDs, and those of us who understand how the big cloud service providers operate will be buying our storage as white boxes that allow us to use generic drives from distribution rather than very expensive “approved” drives from the box vendor.
The result is that future storage farms will be much smaller physically and much cheaper. They’ll use less power -- SATA SSDs typically use 2/10ths of a watt compared with the 12W of an HDD -- and be more reliable. Access latency will improve, while compression and deduplication reduce the load on the LAN and, in the case of compression, also shorten the time to read a file.
Now the Internet of Things and big data may expand your storage needs a lot, but it will be some time before the size of storage returns to that 2,000-drive array farm. Storage has finally redeemed itself after years of stagnation in technical growth.