Petabyte-scale deployments in applications from high-performance computing to archival storage are pushing current storage systems to the brink. Organizations face a challenge when attempting to fine-tune their storage systems to achieve the desired balance of performance, efficiency (capacity), and resiliency. To continue to service the overwhelming growth of data, they are forced to make compromises due to legacy storage architecture limitations. Some sacrifice capacity for performance while others concede performance for capacity. In attempts to achieve a balance, some adopt poor resiliency schemes that put their data at risk. Clearly, today’s storage solutions are falling short in meeting customer needs. Because of these storage limitations, organizations are not realizing the full potential and value of their storage.
Causes of Today’s Performance, Resiliency, and Efficiency Problems
Limitations of RAID: Both on-premises and edge data storage suffer from the limitations of RAID. Traditional RAID environments limit the drive count and drive capacity in an array to maintain minimum acceptable performance and minimize the risk of data loss. Larger drive capacities could deliver a lower $/terabyte (TB) and watts/TB, but the associated long rebuild times come with too high a risk of data loss and extended periods of poor performance. For an example of how industry utilization is affected by this, most RAID-based solutions are limited to 8TB devices or smaller, even though the industry now provides efficient 18TB+ drives.
Hierarchical RAID: With mounting pressure on storage performance, in some cases, basic RAID 5 and 6 have been abandoned in favor of more performant solutions such as RAID 50 and RAID 60. However, hierarchical RAID solutions come with a significant loss of efficiency as well as an additional expense incurred by wasting purchased capacity. In addition, they lack a deterministic failure domain. For example, in a RAID 50 array, two drives can fail with no data loss in some cases, but two drives failing in the wrong location can lose data.
Data duplication: As organizations seek to maximize performance, traditional protection schemes are straining to keep up. The strain has been so great that to reach desired performance density. Many organizations have been forced to adopt multiple data copies. These solutions can enhance performance but are inefficient when it comes to usable capacity, operational costs, and carbon footprint.
Erasure encoding and the cost of providing efficiency and resiliency: Erasure encoding can improve storage efficiency and boost resiliency with the ability to fail many drives without losing data. However, any performance benefits tied to duplicates or simpler RAID architectures are lost when erasure code methods are adopted. In other words, organizations give up one benefit to enable another. In addition, there is a significant computational overhead to provide this capability, limiting its practical use to file and object storage only. Outside inefficient replication, there is no answer for block storage. In short, the gained efficiency and resiliency of erasure encoding are offset by the resulting performance loss. There are also additional computational challenges to implementing it.
To overcome these limitations, many erasure code solutions resort to a hybrid approach. Large files are stored with erasure codes, and smaller files are stored as triple copies. This is done to alleviate some of the burdens on the erasure code engine. However, this approach leads to wasted capacity on small files and loss of performance on large files. This balancing act makes capacity management problematic as space consumed is now tied to file sizes and the method used to store them instead of a single predictable capacity efficiency for all storage.
Data storage operational costs: Efficiency extends beyond wasting money on unusable storage capacity. As an example, the need for a certain level of data performance could force an organization to make mirrors of its data, allowing reads to be pulled from two or more sources. But that would mean they now have two devices to hold one device worth of data. If these devices use 12 watts each with an additional 6 watts of power consumed in a data center to cool the device, they are now consuming 36 watts of power or more for what could have been 18 watts if copies weren’t needed.
Doubling the drive count has a ripple effect throughout a system, including the need to double the enclosure count needed to house the drives and the “top of rack” switches, as well as more leaf and spine switches. All this increases power consumption and potentially increases software costs if the software is licensed per CPU core. Additionally, solutions with predictable and repeatable performance requirements must over-design, so the performance seen when drives fail is the new minimum benchmark. This adds an additional layer of inefficient built-in margin.
Larger capacity drives play a part in operational costs as well. Larger drives save space, drive count, and the power needed to operate them by utilizing larger capacity drives and result in a more efficient power usage per device. A typical 8TB drive consumes approximately 9.5 watts, but a helium-filled 20TB drive consumes only about 7.3 watts. This is an astounding 70% more power efficient per TB. Significant operational costs could be saved if larger capacity drives could be used.
What’s Needed: A New Storage Architecture
What’s needed to overcome today’s storage performance, resiliency, and efficiency problems is a new foundation for storage systems, an alternative to RAID that eliminates the most painful limitations of RAID, including low-capacity drives, idle hot spares, urgent failed drive replacements, and degraded rebuild performance. This must be done without introducing slow CPU-intensive erasure codes or hybrid architectures that are meant to hide limitations while pushing costs into other hidden areas such as CPU and memory. The new foundation for storage also must eliminate the need to use performance-centric duplicate copies. Without such a breakthrough, organizations will continue to struggle with the cost of on-premises storage as their needs continue to grow with the ongoing explosion of data.
Adam Roberts is Field CTO at Nyriad.