Data resiliency is at a crossroads. Traditional SAN storage solutions than run on Redundant Array of Independent Disks (RAID) are creaking under the strain of new data demands. While striping, mirroring, and parity in RAID implementations provide various degrees of protection, the cost of resiliency, recovery timings, and RAID’s recovery process vulnerability issues are all paving the way for alternatives.
One option is erasure coding , which is distinctly different than other hardware-based systems. EC is an algorithm-based implementation that's is not tied down to any specific hardware. It breaks the data into fragments, augments and encodes them with redundant pieces of information, and then distributes encoded fragments across disks, storage nodes, or locations. With erasure coding, data which becomes unreadable on a node can still be reconstructed using information about the data stored elsewhere.
Unlike RAID, the EC does not require a specialized hardware controller and provides better resiliency. More importantly, it provides protection during the recovery processes. Depending on the degree of resiliency, complete recovery is even possible when only half of the data elements are available -- that’s a major advantage over RAID. Compared with mirroring, EC also consumes less storage. The down side, however, is that EC is CPU-intensive and can cause latency issues.
Storage efficiency vs. fault tolerance
Erasure coding is most often rendered using Reed-Solomon (RS) codes. For those familiar with RS codes, two performance metrics matter: storage efficiency and fault tolerance. EC involves a trade-off between the two. Storage efficiency is an indicator of additional storage required to assure resiliency, whereas fault tolerance is an indicator of the possibility of recovery in the event of element failures.
These metrics are inversely proportional to one another; more fault tolerance reduces the storage efficiency. That is to say, the more distributed, and therefore geographically widespread the data is stored, the more latency occurs as the time required to recall from different locations or systems is greater.
Hyperscale data centers pose fresh challenges for data resiliency in terms of node failures and degraded reads. Modern erasure code algorithms have evolved to include local regeneration codes, codes with availability, codes with sequential recovery, coupled layer MSR codes, selectable recovery codes, and others that are highly customized.
Acceleration and off-loading
Erasure codes are compute intensive and it has become necessary to offload that compute from the main CPU. Research looking into options for optimizing various aspects is well underway in academia and in industry. Innovations in data center hardware are promising too. Whether virtual or bare metal, there is a greater probability of freeing up computation resources here, like GPU and FGPA.
One of the requirements of GPU-based acceleration is parallelization of the EC algorithms. Parallelization is based on the concept of parallel computing, when multiple processes are executed concurrently and the modern resiliency codes have some cases of the vector codes. These vector approaches make it possible to leverage GPU cores and high-speed on core memory (like Texture Memory) to achieve parallelism.
Fabric acceleration is another trend in EC off-loading. Next-generation host channel adapters (HCA) offer calculation engines, making full use of features like RDMA and verbs. Encode and transfer operations are handled in HCA. With RDMA, it proposes more acceleration for storage clusters.
Data resiliency, compression, and deduplication advances are evolving at breakneck speed. It is an exciting time for erasure coding: extreme low latencies of NVMe technologies, tighter integration of storage with application characteristics, and newer virtualization options are opening up a myriad of use cases. As traditional RAID systems reach their data resiliency limits, data center and storage professionals can consider systems based on erasure coding as a strong option to provide resiliency, protect data during recovery, and minimize storage requirements.
Dinesh Kumar Bhaskaran, Director of Technology and Innovation at Aricent, has more than 15 years of experience in embedded and enterprise storage technologies. He also works for the innovation group leading efforts in the field of hyper converged infrastructure. His areas of interest include Erasure Coding, Heterogeneous Computing and Operating Systems.