• 06/12/2014
    7:00 AM
    Jim O'Reilly
  • Jim O'Reilly
  • Commentary
  • Connect Directly
  • Rating: 
    0 votes
    Vote up!
    Vote down!

Switched Networks Vs. Dedicated Links

Direct links run much faster than traditional switched networks. Using software-defined networking with dedicated links can help in the quest for storage bandwidth.

The search for adequate bandwidth has us scrambling for ways to maximize box-to-box intercommunication. Just recently, EMC bought the startup DSSD, whose main claim to fame is a PCIe interface to the host, as a way to maximize speed and minimize latency for an all-flash array.

Line speed has increased from 1 Gigabit Ethernet to 10GbE and then 40GbE in maybe four years, but 40GbE is expensive. Hidden within all this improvement is a major issue: Ethernet is a collision system. It's designed to allow multiple senders to try to reach the same host, and when a collision occurs, the losers have to retry.

Many benchmarks show good, efficient 10GbE operation, but this is an artifact of traffic patterns with essentially synchronized workloads. Real-world operation can set caps for efficient operation at less than 25% of nominal bandwidth.

Ethernet is like a traffic system where, if you arrive at an exit and it is blocked, you have to go around, get in the queue again, and hope it isn't blocked when you get back to the exit. Fundamentally, it isn't efficient.

Can we do better? One solution is to move to "Converged" Ethernet. This approach adds buffers in the switches so that any collisions result in data being temporarily stored until the receiving port clears. Buffer sizes are limited, and there has to be a throttling mechanism to make Converged Ethernet work. This allows transmissions to be paused by the receiving end of the connection for a time, allowing the traffic jam to clear.

Remote Direct Memory Access (RDMA) doesn't much affect line performance, whether Converged or not, since it functions mainly to reduce system CPU overhead in working the links. A variety of streaming protocols have helped a bit, especially those that require no receipt confirmation for blocks.

Blade systems offer an alternative to the collision problem: There are direct connections between the blades, and each node has one or more dedicated links to the other nodes in a cluster. This allows a very simple (UDP) protocol and full duplex operation without any collisions, bringing realized bandwidth close to theoretical limits.

One downside of a blade star fabric system is that, when fully implemented, the number of links is roughly the square of the number of connections. This has generally limited its use to small clusters, such as the 12 blades of a single blade server. Moving outside the box requires some refinement.

EMC's DSSD acquisition addresses the out-of-the box need, though clumsily. Dedicated PCIe links connect the array to the host servers, but PCIe suffers from being an internal protocol, and it's quite possible that older-generation link speeds will be the norm. The interconnects also are dedicated point-to-point links, since PCIe switching is in its infancy. Ethernet appears to be racing ahead of any other link mechanism, with 56GbE g shipping and 100GbE g in the design labs.

I would postulate we have the way to resolve the loss of performance due to collisions already in hand. Standard Ethernet switches are smart enough that we can define two-node VLANs that essentially give us direct connection via a switch. Having the switch allows us a great deal of configuration flexibility, and the fabric can be software-defined.

We need a fast low-overhead protocol to take advantage of the high-quality dedicated connection. RoCE and iWARP are candidates, but RoCE implies a converged environment, while iWARP will run out of the box. There are protocols that don't require RDMA support, including Google's QUIC.

Because this is software-defined networking, we can build and tear down VLANs as needed to cope with load changes. Booting a new in-memory instance can get a lot of resources until it completes and then drop back the number of connections to the level required for normal operation.

One downside of this approach is that the total number of connections increases, but in a real system, allowing the dedicated links to be configured on the fly by software permits enough flexibility to cope. Remember that this system supports ordinary shared Ethernet links, as well, though a protocol shift may be needed.

Using dedicated links means that servers will need more than the two Ethernet links typical of most designs. The tradeoff is that these servers won't need lots of storage interfaces, so the SAS/SATA port count can drop way down. I suspect six to eight 10GbE ports would be a typical buildout. The storage boxes would also need more channels.

Obviating collisions should allow a much faster storage connection, and running it via switches allows SDN flexibility of configuration. How this is structured together needs a use-case analysis, but the impact on in-memory and VDI applications, in terms of efficient booting, could be dramatic.


Good observations about links

Interestingly, this is somewhat solved by some of the optical technologies. The ability to use L1 paths across infrastructure helps to get around some of the static nature of bandwidth. And if you combine dynamic pathing with an SDN controller, you can do things like load balancing, traffic engineering, and dynamic pathing to meet application requirements. 

I suspect that the real change that is happening (and has been for awhile) is that traffic is far more east-west than north-south at this point. The interconnect is almost more important than the uplinks. We should expect to see this interconnect happen at the rack level before long (big optic pipes between racks of recources). In this world, the traditional architectures get disrupted, along with the ecosystem of suppliers around those.

Mike Bushong (@mbushong)


Re: Good observations about links

The change in infrastructure you describe would impact the design of storasge appliances in a big way. They'll either need to be local in the racks, or have many really fast ports to move data to the inter-rack level.

Re: Good observations about links

I agree that would be the implication. I suspect we end up in a place with lots of compute and storage in a rack with high-speed interconnects within the rack and then fast pipes between racks. It's certainly what Intel would want.

Mike Bushong (@mbushong)


Re: Good observations about links

This may cause us to rethink the rack concept a bit. How about the double-rack (back to back) Verari sold, or tying adjacent racks in a container together as if they are one entity. This would increase local cluster size to make room for netwroked storage.

Re: Good observations about links

So basically you look at something along the line of "chaining" side by side racks, extending the backplane across each of them.  Then use SDN to manage/orchestrate it.

Re: Good observations about links

There are no siewalls on these types of rack, so cabling can go round the front or through the cutouts on the frames. A cluster could be two, four or even more racks in size, with just a single switch hop to connect. Direct server-server links could be added too, but server vendors need to face up to needing more than two Ethernet ports.

Re: Good observations about links

So, when talking on a FIOS home phone, is the conversation real time or is it data?  Not sure how to phrase this - but can a circuit-switch conversation be captured and stored as data as opposed to recording it? I am think of that software that stores phone calls and then lets you search them for key words, does it act the same for circuit-switched and packet-switched? This answer by fiberstore

Re: Good observations about links

I assume you are thinking of speech to text conversion. It's possible to process any spoken words into text automatically, but the quality of translation for general speech from an occasional speaker is still not good.

However, for deaf people, there is a wealth of real-time speech to text software. Just Google "speech to text deaf"

Perhaps a job for statisticians?

I agree with Jim here, Ethernet is not the most efficient system, especially as traffic collisions increase.  It seems as though a more scientific approach may be necessary.  Such fancy terms as regression analysys normally relegated to statisticians, may be needed in the search for a solution here.

Re: Perhaps a job for statisticians?

Thats why fiber channel introduced to carry SCSI commands. The lessons are learned from the ethernet.


Actually ethernet has pause frame but once the congestion occur, all the traffic are stopped regardless of their importance.

If you want to make this operation efficient, you can separate traffic into classes and you treat them based on their importance. Thus data center bridging has been invented. PFC is the priority flow control can give us the ability to put traffic into classes and control the flows based on their pirority.

Also efficient quening is possible with ETS Enhanced Transmission selection which can be thought as a sub category of DCB.

All these protocols help the ethernet become lossless, so for the storage traffic you can carry SCSI over FC over Ethernet or IP networks.


Dealing with the configuration can increase configuration complexity and also buffer management might be a concern but these are all our design tools.


Re: Perhaps a job for statisticians?

Creating lossless "Converged" Ethernet is both expensive and difficult to do today. There are also limits to scale involved. Even loss-less operation involves delays, as buffers get filled up quickly. These delays can be substantial in a loaded system since the process is akin to turning off a tap then sending instructions to turn it on again.

Re: Perhaps a job for statisticians?

You stated ' Creating lossless "Converged" Ethernet is both expensive and difficult to do today. There are also limits to scale involved. Even loss-less operation involves delays, as buffers get filled up quickly. These delays can be substantial in a loaded system since the process is akin to turning off a tap then sending instructions to turn it on again.'


expensive and difficult part , as I also stated, configuration and management complexity might be , I also mentioned from the buffer related issue , but again these all our design tools , so If you want to use you should know the drawbacks, there are pros and if your business and technical requirements match so no problem to use it.

So nothing is the best , if one of them would be the best , we would not need the others..

how about infiniband

Infiniband supports reliable remote DMA with hardware segmentation/reassembly, hardware retries.  It also has send/receive style semantives and even an unreliable datagram mode.  There are lots of protocols for network and storage and HPC and even sockets and multiple OSs support.  The cost per bps is much lower than 10 or 40 gigabit ethernet, especially when you factor in large switches.

Re: how about infiniband

When you say the cost is cheaper, you forget you need a specially-trained admin or two!

IB has its points, but even Mellanox is hedging to IB over ethernet (RoCE)

Markov chains post?

This post is so incoherent it might be generated by Markov chains.


It's like author has some information from 80s and then picked some recent technology buzzwords then creatively combined them to post generating information which seemed to fit the context. Atrociously bad article.



Ethernet almost invariably is 97.5% efficient, no modern network experiences collisions as all links are full-duplex, point-to-point links to a switch interface.

Efficiency problems are from 80s and 90s using hubs. 



Author tries to talk about 'microbursts' in 'converged' ethernet, but does not understand the issue. This is fundamental problem with any technology if offered ingress capacity exceeds egress capacity, someone has to buffer, and if issue persits, buffers run out. 



RDMA isn't a thing, this is normal DMA when data is written directly to NIC memory instead of system memory. It is internal ain computer, nothing to do with network.



Blade systems solve nothing, blades are interconnected by ethrenet switches, like everything else.



There is no 56G Ethernet, 100GE has been shipping for year and is in production in most large networks and it has become cheaper than 10x10GE. 400GE is next-up standard actively being worked on.



QUIC has nothing to do with this, QUIC is L4 protocol whose main benefits are when compared to TCP ability to multiplex many stream insde single QUIC session without internal sessions creating HOLB to each other when one stream blocks (i.e. problem google has when SPDY multiplex many sessions in single TCP). 

Other benefits QUIC delivers is ability to roam between IP addresses, as session is not bound to particular IP as it's cryptogrpahically authenticated we can receive next packet from any IP

It offers FEC by having ability to send redundant parity packets, so if packet loss is 1% it can send 101 packets for every 100 packet and receiver can reconstruct any dropped packet, no resending needed, offering much better potential capacity than TCP in lossy conditions.

It has 0 RTT penalty for estabilshing session (First packet is payload packet), apart from first session ever where crypto keys need to be exchanged. 

It has no packet amplification potential.

It's great protocol and we need to have in our toolbox new L4 protocl taking lessions from QUIC and MinimaLT, but nothing to do with SDN or Ethernet performance or anything in the article.




Re: Markov chains post?

@huittinen massive, Thank you for your comments. Perhaps a little clarification might help you:

1) You are right about the point-to-point nature of Ethernet. Vampire traps went out of fashion a long time ago! Each link is quite efficient.

However, switched Ethernet is NOT an end-point to end-point connection, When two sources both address the same target it's a bit like airplanes trying to land on the same runway. One gets to go round again. This occures in the switch. It reduces efficiency considerably in most cases.

Converged Ethernet buffers some number of messages, so the sending port doesn't need to go around. It also has a Pause operation to prevent the buffers over-filling.

2) See 1)

3) You are incorrect in stating that RDMA writes to NIC memory. It writes to application space, bypassing buffering in and copying from kernel space. It also bypasses the complexity of protocol stacks.

4) Not all blade systems are equal! ATCA blades use a point-to-point fabric rather than switches for example.

5) 56 GbE is not standard, but Mellanox has a product, based on their IB technology.

6) Performance is usually defined as doing the job as fast as possible. If QUIC has all the benefits you list, it probably speeds up Ethernet!