A look at the shared-media model and how scale-out storage systems handle data distribution.
Many of my colleagues claim that the only true way to do scale-out storage is by using a shared-nothing model to spreading your data across a pool of standard x86 servers. But other scale-out options exist. Rather than use a single commodity server as their building block, shared media systems use that old standby, the dual-controller storage system, as their base unit.
As the name shared-media implies, storage systems such as Dell EqualLogic, EMC XtremeIO and Nimble Storage arrays share their SSDs and/or HDDs between a pair of controllers. A dual-port SAS -- or in the latest systems, PCIe -- backplane allows both controllers to directly access the storage devices. Some dual-controller suppliers, most significantly Nimble, support adding additional JBODs, scaling up like any other dual -controller system and scaling-out by clustering multiple dual-controller bricks into a single storage system.
Since shared-media storage systems can use simple RAID across the drives in each brick, they require less media than shared nothing systems, especially for small clusters. Back in the day, this was a substantial contributor to EqualLogic’s greater success than the competing LeftHand shared-nothing system. After all, a single EqualLogic box is a perfectly useable dual-controller array, but customers would have to buy three LeftHand nodes to reach the same level of resiliency.
This efficiency advantage at the low end is offset by an increase in the size, and cost, of an individual node. While a shared-nothing user can add a single x86 server node with a couple of drives, users of shared-media systems have to add to their clusters in much larger increments, sometimes as large as a pair of dual-controller bricks.
Shared-media clusters can take advantage of all the same hardware tricks to minimize latency as scale-up dual controller arrays. Incoming data can be written to NVRAM in the first controller and replicated across a PCIe non-transparent to that controller’s partner many microseconds faster than it can be written to an SSD, replicated across 10 Gbps Ethernet, and written to an SSD on a second shared-nothing node.
The "intelligent shelf"
Over the years, several storage vendors, including X-IO Technologies, Pillar Data Systems, and Whiptail, have tried another scale-out model I’ve dubbed the "intelligent shelf." In this model, the typical functions of a storage controller are delegated to resilient disk shelves and a separate layer of array controllers.
Rather than use simple JBODs to hold their media, these systems use more intelligent disk shelves that provide basic services like RAID and caching. Higher level data services like data reduction, snapshots, and replication are provided by the array controllers.
This intelligent-shelf model has frequently been adopted by vendors that provide some unique functionality, but don't scale or provide a competitive set of data services. By putting these smart shelves behind a pair of x86 servers running virtualization software from an OEM vendor like FalconStor, X-IO and others can present the self-healing goodness of their ICEs to a broader audience.
The last major difference between scale-out storage solutions is how they distribute data across their nodes and how tightly they couple nodes together. While it might not be immediately apparent to a user that one system distributes data across a tightly coupled cluster in 4 KB chunks while another distributes virtual machine images across a federation of semiautonomous nodes, these design decisions can have a significant impact on system management, performance, and behavior.
Tightly coupled systems, such as NetApp SolidFire and those from the hyperconverged crowd, present their storage as a single pool, abstracting the physical location of data from both users and storage administrators. Coupling the nodes of a scale-out storage system tightly together also allows the storage architect to distribute the data across nodes at a fine grain.
Fine-grained distribution spreads each workload’s data across more storage nodes as the cluster is expanded, improving performance for all workloads. Systems that distribute data as larger objects keep all of a workload’s data on a small number of nodes. On those systems, adding nodes to the cluster increases aggregate performance, but each workload is still limited to the performance provided by a small (two-six) number of nodes.
At the other end of the spectrum are more loosely federated systems like NetApp Clustered ONTAP that combine multiple NAS systems to create a single name space, which can have one or more root directories. Each NAS system in the cluster stores the data for one or more branches of the directory tree.
Since all I/O to a given folder is serviced by one NAS system, loosely coupled clusters don’t add performance for individual workloads as the cluster expands. Because data resiliency is provided by each member using RAID or similar technologies, each node represents a failure domain for those portions of the name space it holds, but only for those directories. Data in other directories will continue to be serviced by the remaining nodes of the cluster.
By comparison, a fine-grained storage system is one failure domain. When the number of node and/or drive failures exceeds the system’s resiliency level -- which for some systems can be as low as two -- the entire cluster fails, making all of its data unavailable.
The latest example of a loosely federated scale-out storage system comes from VM storage specialist Tintri and relies on the virtualization platform as the data mover to migrate VMs from one Tintri appliance to another. I’ve said in the past that scale was Tintri’s greatest weakness since it didn’t support either scaling up by adding disk shelves or scaling out. Rather than tie multiple nodes into a cluster, and having the nodes of the cluster communicate among themselves to present a single logical name space, Tintri continues to present each dual-controller array as an independent data store. It uses the analytic data the storage system collects to predict which storage systems will become overloaded, in terms of performance and/or capacity.
It then figures out which VMs it should move to solve problems or eventually just balance the load across the arrays in the cluster. The system is smart enough to know that VMs based on Tintri clones will expand when migrated to another storage system and factors that expansion in its recommendation formulas. Once it’s figured out which VMs to move, it gives the storage admin a recommendation and a one-click implementation button. To actually move the VMs, the system uses vSphere Storage vMotion or Hyper-V Storage Live Migration.
While this sounds like vSphere Storage DRS, the Tintri system has 30 days of analytical data about each VM on which to base its decisions. In contrast, Storage DRS moves data from the store with 30ms latency, but it doesn’t know which other datastores have performance to spare. The other difference is that Tintri also moves snapshots and metadata for the migrated VMs directly from the source to destination arrays using its replication protocol before moving the VM.
While this loose federation still limits the performance of a single VM to the performance of one storage node, it has the advantage of preserving the individual storage array as a failure domain. If every VM’s data is spread across all the nodes in a cluster to maximize performance, then the entire cluster becomes a single failure domain, if the system can tolerate two node failures and three nodes go offline, all the VMs hosted on that system become unavailable. If one Tintri array goes offline, it only effects those VMs hosted on that array; VMs running on the other arrays in the federation continue unaffected.
This being version 1.0, Tintri still has a bit of work to do, including adding affinity and anti-affinity rules to keep my VDI images together where they dedupe really well while making sure the Exchange servers in a DAG group are never on the same array. Tintri also needs to let admins schedule migrations to happen later and/or automatically.