The search for adequate bandwidth has us scrambling for ways to maximize box-to-box intercommunication. Just recently, EMC bought the startup DSSD, whose main claim to fame is a PCIe interface to the host, as a way to maximize speed and minimize latency for an all-flash array.
Line speed has increased from 1 Gigabit Ethernet to 10GbE and then 40GbE in maybe four years, but 40GbE is expensive. Hidden within all this improvement is a major issue: Ethernet is a collision system. It's designed to allow multiple senders to try to reach the same host, and when a collision occurs, the losers have to retry.
Many benchmarks show good, efficient 10GbE operation, but this is an artifact of traffic patterns with essentially synchronized workloads. Real-world operation can set caps for efficient operation at less than 25% of nominal bandwidth.
Ethernet is like a traffic system where, if you arrive at an exit and it is blocked, you have to go around, get in the queue again, and hope it isn't blocked when you get back to the exit. Fundamentally, it isn't efficient.
Can we do better? One solution is to move to "Converged" Ethernet. This approach adds buffers in the switches so that any collisions result in data being temporarily stored until the receiving port clears. Buffer sizes are limited, and there has to be a throttling mechanism to make Converged Ethernet work. This allows transmissions to be paused by the receiving end of the connection for a time, allowing the traffic jam to clear.
Remote Direct Memory Access (RDMA) doesn't much affect line performance, whether Converged or not, since it functions mainly to reduce system CPU overhead in working the links. A variety of streaming protocols have helped a bit, especially those that require no receipt confirmation for blocks.
Blade systems offer an alternative to the collision problem: There are direct connections between the blades, and each node has one or more dedicated links to the other nodes in a cluster. This allows a very simple (UDP) protocol and full duplex operation without any collisions, bringing realized bandwidth close to theoretical limits.
One downside of a blade star fabric system is that, when fully implemented, the number of links is roughly the square of the number of connections. This has generally limited its use to small clusters, such as the 12 blades of a single blade server. Moving outside the box requires some refinement.
EMC's DSSD acquisition addresses the out-of-the box need, though clumsily. Dedicated PCIe links connect the array to the host servers, but PCIe suffers from being an internal protocol, and it's quite possible that older-generation link speeds will be the norm. The interconnects also are dedicated point-to-point links, since PCIe switching is in its infancy. Ethernet appears to be racing ahead of any other link mechanism, with 56GbE g shipping and 100GbE g in the design labs.
I would postulate we have the way to resolve the loss of performance due to collisions already in hand. Standard Ethernet switches are smart enough that we can define two-node VLANs that essentially give us direct connection via a switch. Having the switch allows us a great deal of configuration flexibility, and the fabric can be software-defined.
We need a fast low-overhead protocol to take advantage of the high-quality dedicated connection. RoCE and iWARP are candidates, but RoCE implies a converged environment, while iWARP will run out of the box. There are protocols that don't require RDMA support, including Google's QUIC.
Because this is software-defined networking, we can build and tear down VLANs as needed to cope with load changes. Booting a new in-memory instance can get a lot of resources until it completes and then drop back the number of connections to the level required for normal operation.
One downside of this approach is that the total number of connections increases, but in a real system, allowing the dedicated links to be configured on the fly by software permits enough flexibility to cope. Remember that this system supports ordinary shared Ethernet links, as well, though a protocol shift may be needed.
Using dedicated links means that servers will need more than the two Ethernet links typical of most designs. The tradeoff is that these servers won't need lots of storage interfaces, so the SAS/SATA port count can drop way down. I suspect six to eight 10GbE ports would be a typical buildout. The storage boxes would also need more channels.
Obviating collisions should allow a much faster storage connection, and running it via switches allows SDN flexibility of configuration. How this is structured together needs a use-case analysis, but the impact on in-memory and VDI applications, in terms of efficient booting, could be dramatic.Jim O'Reilly was Vice President of Engineering at Germane Systems, where he created ruggedized servers and storage for the US submarine fleet. He has also held senior management positions at SGI/Rackable and Verari; was CEO at startups Scalant and CDS; headed operations at PC ... View Full Bio