NETWORKING

  • 10/27/2014
    7:00 AM
  • Rating: 
    0 votes
    +
    Vote up!
    -
    Vote down!

Will RDMA Over Ethernet Eclipse Infiniband?

InfiniBand has dominated the RDMA market, but Ethernet is on the rise as a connectivity option.

Remote Direct Memory Access (RDMA) is a technology that allows data to be written from one machine directly into the memory of another system. This bypasses many of the operating systems and network stacks that slow down transfers. A good analogy is a delivery service that delivers from my desk to yours, versus the postal system.

RDMA has long been associated with InfiniBand, and is, in many ways, the only reason for the existence of that link type. Where InfiniBand has been deployed, performance is higher and latencies lower. This makes RDMA attractive in the high-performance computing (HPC) market and in financial services trading systems where time equals money.

Ethernet also has RDMA capability, but InfiniBand currently surpasses it in sales. However, InfiniBand's RDMA dominance may be coming to an end.

To understand the situation, one has to look at how the market for RDMA has evolved. The two major vendors in the space initially were Mellanox, driving RDMA on InfiniBand, and Chelsio pushing RDMA on Ethernet. Mellanox brought the InfiniBand approach to market well before Chelsio had product, and established a monopolistic position delivering RDMA over InfiniBand.

InfiniBand effectively saturated the market, which at the time was limited, and this locked out the competing Ethernet product to a great extent. Customer inertia and good engineering have maintained the InfiniBand status quo for a long time.

Mellanox, however, hedged its bets some years ago by designing in the ability to run either InfiniBand or Ethernet protocols with the same ASIC. This strategy has matured to the point that there are now two alternatives for Ethernet RDMA: RoCE (RDMA over Converged Ethernet) from Mellanox and Chelsio's iWARP.

Intel also entered the iWARP space with the purchase of NetEffect, but its latest "Fortville" NIC chips appear to have dropped iWARP support, leaving Chelsio as the sole supplier. RoCE has gathered up Emulex as a second source and others appear to be joining the bandwagon, while RoCE is being adopted by OEMs, including Dell.

From an industry standards viewpoint, there are recognized organizations abstracting the design from the vendors' hands in both cases. RoCE has the InfiniBand Trade Association behind it, while iWARP has IETF.

It's fair to say that RDMA over Ethernet is still in the early adopter stage, but to understand its future one has to look at some trends in the wider industry.

First, we are seeing Ethernet overtake all of the other protocols in terms of performance and options. There are 100 GbE products already available, and 10GbE, the low-cost workhorse of the data center, is poised to move to 25 GbE without extensive rewiring. Even though InfiniBand has also achieved 40 Gbps and 56 Gbps rates (and in fact, beat Ethernet to those goals) the cost of maintaining parallel support is significant.

Second, we are shifting to an era of software-defined networking. This implies a convergence to an Ethernet solution, and likely would marginalize InfiniBand in many situations.

AppliedMicro is adding RoCE to its X-Gene2 multi-core ARM CPU chip, which will make the technology much more affordable and mainstream. X-Gene2 is currently sampling, so we'll see it in volume production in 2015.

Demand for higher performance and low latency is, as Microsoft's commitment to RDMA shows, increasing fast. High-end database architectures are moving to in-memory clustered models, and this requires a close federation of memories to maintain speed. Likewise, flash acceleration of a variety of solutions needs a low-latency, low-protocol solution to moving data between server nodes.

Most storage arrays, especially the all-flash variety, are offering InfiniBand connection options. There is pressure to limit the connectivity options and have more of the same port type, rather than a variety of connectors. This tends to add interest for Ethernet RDMA.

InfiniBand has loyal supporters, and it won't disappear overnight, but Ethernet's ability to deliver almost the same performance, plus network convergence, makes RoCE the likely long-term winner. Mellanox, which provides the industry with InfiniBand switches, could facilitate the transition with a RoCE to InfiniBand router.


Comments

iWARP

Hi Jim -- What do you think will happen to iWARP? It doesn't sound like much of a contender.

Re: iWARP

iWarp has its fans. Its advantage is it runs on standard Ethernet, so it is quite a bit cheaper. RoCE may have an edge in latency, but that matters mostly to fin serv types, and it isn't clear it's true...benchmarks are thin and partisan.

The battle isn't settled, but RoCE is coming from the IB side, which has a strong following.

Re: iWARP

iWarp was one of those things that was going to change the way that we work online and yet once again it has become one of those things that could have been so good, for some reason the take up has not been as good as it could have been.

Re: iWARP

For 8 years, it was Mellanox against Chelsio. InfiniBand won the early rounds, and just when Ethernet RDMA via iWARP threatened to grow, Mellanox dropped RoCE on the world.

Re: iWARP

iWARP ran over InfiniBand from several vendors.  Note, Intel purchased QLogic's InfiniBand division and is another source for IB adapters and switches.  Doesn't converged ethernet require datacenter ethernet bridging and thus additional expense, at least for RoCEv1?  RoCEv2 uses UDP and perhaps doesn't require this.  

I think that to really win RoCE needs to do well in the high performance computing market.  This has been where infiniband has shined.

Re: iWARP

RoCEv1 does require a converged switch solution. v2 gets around the routing problems, but still needs the Pause operation and buffered switches, so it's still expensive. iWarp doesn't need that.

Re: iWARP

Intel remains heavily committed to iWARP.   We will integrated iWARP in future server chipsets and SOCs.  We went public with this commitment in September at the Intel Developer Forum 2014.  For an overview of what we did at IDF14, see the Intel website.  Unfortunately, I'm prevented from posting the URL.  However, it is easy to find.  Search for "Intel Communities."  From this page, search for "Wired Ethernet Community."  From there, search for "With support from Microsoft, Intel demonstrates its commitment to iWARP technology at the Intel Development Forum."

Here's a short summary of our reasons:

RoCE requires setting up DCB traffic classes across the L2 subnet.  This is not a common datacenter practice.  It is exactly the kind of tedious network configuration task that Software Defined Infrastructure envisions eliminating.  iWARP does not depend on DCB.

RoCE has no congestion management support within an L2 subnet.  DCB does define a scheme called Quantized Congestion Notification, however, it is rarely deployed.  The only way RoCE can work at scale in an L2 subnet is with substantial overprovisioning (=$).  iWARP doesn't have this problem.

No congestion management protocol or even algorithm is defined in the RoCEv2 specification. The only way RoCEv2 can work at scale is with substantial overprovisioning at both L2 and L3 (=$).  In contrast, iWARP uses all the congestion management technologies embedded in TCP/IP, the world's most provably scalable network.

Routable RoCEv2 needs to be able to find the correct DCB traffic class for RDMA traffic in a destination L2 subnet.  Commercially available switches do not appear to be able to do that, implying SDN-programmed switches are a requirement for RoCEv2.  iWARP works fine with the existing infrastructure.

In his nice overview, Jim O'Reilly correctly points out the Chelsio Communications is currently shipping iWARP adapters, as they have for years.  But he missed that QLogic recently announced that their latest Ethernet adapter sampling now also supports iWARP.

Intel's long-term bet remains on iWARP as the better technology for delivering RDMA capabilities to the Ethernet world.

David Fair, Intel Networking Division

Re: Will RDMA Over Ethernet Eclipse Infiniband

Thanks for this overview, Jim. You really have a knack for making complicated technologies - and the complicated vendor politics that surround them - seem manageable and easy to understand. As you rightly point out, your level of need for these different options, and the greater speed and efficiency they deliver, is going to vary greatly depending on your business. Many people just want the quick and dirty version to make their decision without having to do too much research - but sometimes we need to take a look at the big picture to get a good idea of what the future is going to be like.

With all that considered, I'm inclined to agree with you that Ethernet will be the winner in the long run. People trust what they know - even if Ethernet is equivalent to Infiniband (or even slightly behind) and no better, that's likely good enough to drive a lot of sales from people investing for the first time. These people will invest in Ethernet hedging on it's inevitable future success, and it becomes a self-fulfilling prophecy; those sales will drive costs down, spur improvements, and cause entrenched Infiniband users to have to switch over. Thanks for laying it all out for us.

Re: Will RDMA Over Ethernet Eclipse Infiniband

I didn't address the impact of SDN on this issue, basically because SDN is so new and ill-defined as yet. My vision of SDN means that any product that doesn't comply with the near-bare-metal switch+abstracted control plane model is in serious trouble, since it won't be possible to use automated orchestration, nor a whole set of vitualized services. 

SDN fits the Ethernet model, but currently, not Infiniband. That's a potent argument over a 5-year window. RoCE may run into similar problems, since the converged Ethernet switches are much more expensive than merchant silicon standard switches right now. iWARP doesn't have that problem. 

Re:Will RDMA Over Ethernet Eclipse Infiniband

Very informative post.