Howard Marks explains modern scale-out storage systems and how they handle networking.
Until a few years ago, the storage market was dominated by two system architectures. Monolithic arrays served the high end of the performance and reliability spectrum while dual-controller “modular” arrays fed the larger mid-market. The few scale-out solutions were relegated to specialty uses like high-performance computing, archiving or the large file world of media and entertainment.
The rise of software-defined storage has led to a proliferation of scale-out solutions and scale-out architectures. Today, users can buy scale-out systems for just about any storage use case from all-flash like XtremIO and SolidFire to massively scalable object stores. Of course, there's also integrated scale-out storage and compute from a myriad of hyperconverged suppliers. Scale-out’s even becoming the way to go in the backup world with integrated backup/storage appliances from Cohesity and Rubrik.
This wide variety of scale-out products has relegated the old-school high-end array to the corner – and admittedly profitable -- use cases that require extreme levels of reliability, and/or connectivity. The relative simplicity of a dual-controller array will have a place in the storage market as both solutions for applications with more modest scale requirements, because distributed systems are hard, kids. They’re also useful as larger, more resilient building blocks with which to scale-out.
When most storage folks hear the term scale-out, their first thought is of a shared-nothing scale-out cluster. Within a shared-nothing cluster, each node -- almost always, an x86 server -- has exclusive access to some set of persistent storage. Nodes in shared nothing clusters don’t need fancy storage features like shared SAS backplanes, so any server, even a virtual one, can be a node in a shared nothing cluster. Early scale-out players called this shared-nothing architecture redundant array of independent nodes (RAIN), but that term has fallen by the wayside.
The problem with shared nothing clusters is that those commodity x86 servers are inherently unreliable devices. Sure, they have dual-power supplies, but an x86 server still represents a swarm of single points of failure. To provide resiliency, shared nothing clusters must either replicate or erasure code data across multiple nodes in the cluster.
The result is that a shared-nothing cluster is generally media inefficient. Replication requires twice as much media as data to provide any resilience and three times as much to ensure operations after both a controller and a device failure. Add in that a shared nothing cluster should always reserve enough space to rebuild its data resiliency scheme in the event of a node failure, and an N node shared-nothing cluster with three-way replication will only deliver (N-1)/3 times the capacity of each node.
Erasure coding can be much more efficient, bringing the capacity of a cluster up to (N-1)/1.2. But spreading data across a large number of nodes requires a larger cluster, with many solutions requiring six or more nodes to implement and rebuild the double-parity scheme required to survive a rebuild on large disk drives with multiple device failures. Erasure coding also has write-latency implications as the slowest node in the cluster for each write defines the application’s write latency.
Scale-out storage and the network
By spreading the functions of a storage array across many independent -- or more accurately, interdependent -- nodes, scale-out storage systems are inherently network dependent. The scale-out system has to not only support a SAN- or NAS-like interface to the compute load, but also has to use the network to tie all those nodes together.
The designer of a scale-out storage system has two primary network problems. The first is moving data between nodes for data protection in order to rebuild after a failure and to balance the system as nodes are added to the cluster. The second is how to deal with requests to node A for data that’s stored on node D.
Most early scale-out systems used a dedicated back-end network to interconnect the nodes. Using a dedicated back-end network, and providing any switches that network required, freed storage suppliers from the burden of qualifying and supporting any network gear their customers used. More significantly, it let them use a low-latency interconnect like InfiniBand on the back end while providing IP storage over standard Ethernet and storage protocols to the hosts.
EMC XtremIO even manages to use its InfiniBand back end to provide Fibre Channel on a scale-out system. While IP systems can redirect a request to the node that holds the data requested by a host. Fibre Channel-attached hosts have to get a response from the same port they made their request on. An XtremIO node can fetch the requested data from another node in the cluster and reply in a reasonable time because of Infiniband’s low latency.
While dedicated back-end networks made a lot of sense in the days of 1 Gbps or slower Ethernet on the front end, today’s 10 Gbps networks provide plenty of low-latency bandwidth for both host access and node-to-node traffic.