VMware's SDN Dilemma: VXLAN or Nicira?

VMware has invested in two overlay network approaches: the VXLAN standard originally conceived by Cisco and STT, drafted by SDN startup Nicira. VMware acquired Nicira for more than a billion dollars. Which will VMware choose? Here’s my take.

Greg Ferro

February 1, 2013

6 Min Read

VMware has a technology problem: It's backing two competing standards for overlay networks: Nicira's STT and the IETF draft standard VXLAN. An overlay network enables network virtualization, which is a core component of VMware's software-defined data center initiative. Both STT and VXLAN have upsides and downsides. I'll look at each protocol and speculate on which direction VMware may go.

First, a little background. Before being acquired by VMware, Nicira developed the Stateless Transport Tunneling (STT) protocol for tunneling between open source software switches in the Openvswitch project.

VXLAN, which is now an IETF draft standard, was originally proposed by Cisco. Cisco sources say that the company then got VMware involved (although the IETF draft has a lot of names on it). The end result is VMware is telling everyone that it has this great VXLAN overlay network technology that removes any hypervisor dependency from physical network devices. Even better, it's configured and managed from vCenter.

The question is, which protocol will win?

Nicira and STT

Prior to acquisition, Nicira had a software controller for managing tunnels between virtual switches, and used OpenFlow-like commands to configure the vSwitch. STT is a tunneling protocol that connects the virtual switches, thus forming a virtual network.

STT performs this task well enough. It uses the TCP protocol for encapsulation. Supposedly, operating systems can use the TCP offload function of modern network adapters for better performance.

However, STT also has several limitations. One problem is that the limited entropy in the STT header means it doesn't balance loads evenly over Ethernet port bundles in network backbones. Depending on your network design, this could be a significant limitation.

Second, STT currently works only with the Openvswitch software switch on Linux hypervisors such as Xen or KVM. That's not necessarily a problem for cloud providers and very large organizations; for instance, eBay is using Nicira in its OpenStack deployment. However, VMware is more common in enterprise data centers. It's possible VMware could add STT to the ESXi vSwitch, and thus deliver a multicloud network overlay strategy, but the VXLAN protocol already has a lot of momentum.

VXLAN's Multicast Issues

VXLAN depends heavily on a multicast-enabled underlay network to handle broadcast/unicast/multicast Ethernet protocols. (I use the term "underlay network" to describe the physical devices that pass Ethernet frames and IP packets.) What's not well understood is that IP multicast is complex and risky to operate.

Each VXLAN-enabled device is known as a VXLAN Tunnel End Point (VTEP). When the VTEP is configured with VXLANs, it will be configured to join an IP multicast group. Joining the multicast tree is the method for VTEPs to discover the MAC of each host in the VXLAN in a self-configuring and autonomous method. Direct server-to-server data flows are transported through the VXLAN overlay in unicast packets.

IP multicast also provides an efficient way to broadcast Ethernet frames to all servers as is required--for example, for unknown MAC address flooding and IP ARP Requests.

VMware recommends a separate multicast group for each VXLAN; thus, 50 VXLANs would require 50 separate multicast trees in an attempt to control L2 Ethernet flooding problems. L2 loops remain a problem in VXLAN networks, but the failure domain is reduced to an individual VXLAN itself. The problem is that each of those multicast trees requires state to be held in the network layer, which consumes CPU, memory and TCAM space. TCAM size is a serious limitation on network diameter, and overloaded TCAM is serious network threat.

A lesser performance problem is the frame replication silicon in the switches. At its core, multicast is a method for duplicating Ethernet frames inside the hardware of your network. One multicast frame must be sent out of every Ethernet port that needs to receive it. On a data center core switch, this could mean replicating one received frame to 300 ports (thus, 1 Gbps of inbound multicast packets results in 300 Gbps output). Network switches require dedicated silicon to handle the duplication process. For example, this is an approximation of silicon pathways inside a single M1-series line card from a Nexus 7000 showing the replication engines on the blade:

(click image for larger view)
Internal Architecture of Single Line Card Nexus 7000. Source: Cisco Systems

There are a number of IP multicast routing protocols that maintain the multicast trees, including PIM-SM, PIM-DM, BiDir and ASM multicast. In general terms, PIM-SM will be the default choice because it's got the widest vendor support, but that isn't saying much. Most data center switches do not support multicast protocols today. This can make VXLAN hard to deploy in existing networks and usually requires new network hardware.

Next page: Picking a WinnerThis isn't necessarily a bad thing. Companies are willing to replace servers every three years. Replacing data center switching every three years is uncommon today, but it may be more likely in the years ahead, particularly given the changes washing over data center switching, including SDN and 40 GbE.

You could make a sweeping statement that configuration of multicast trees is an "unknown known" for most network engineers and be perfectly correct. In practice, very few companies have a practical use for multicast. (Notable exceptions exist in niche areas of the financial trading market.)

To complicate matters, securing and operationally "de-risking" multicast is complex and expensive. Check out this presentation for more information.

Even worse, problems with a multicast protocol and the entire VXLAN overlay can fail as a single failure domain. Fate sharing of multiple services are acceptable for enterprise networks and application developers, but they are not viable in hyper-scale cloud networks where hundreds of customers and services share the network. (Note that while this article was being published, Cisco announced enhancements to its Nexus 1000V virtual switch to remove the need for IP multicast in the network.)

VXLAN to Win--But Not As You Know It

So is VMware's network virtualization future based on STT or VXLAN? My guess is neither--instead, a new VXLAN will arise.

The case against STT is a lack of standards and market adoption. Ultimately, user data must leave the overlay network and reach the external world, and this means hardware support for tunnel termination. Network vendors already VTEP for VXLAN, and it's hard to imagine that STT support is worth their while. It's certainly possible for VMware to force a standard onto the market, but I don't think they have the appetite to upset major networking vendors, especially Cisco.

If not STT, then it must be VXLAN. I predict that VXLAN will be extended to support an SDN network controller design, and its dependence on multicast will be reduced or removed completely. An SDN controller that manages host and network configuration can replace the requirement for most frame flooding because the controller knows the MAC address and IP address of every device. Thus, ARP requests could be handled in the local vSwitch, and unknown unicasts are not required because there are no unknown addresses.

Consider that VMware's vSwitch and vSphere Distributed Switch (vDS) are actually part of a controller network--vCenter is a "controller" that knows all of the hosts and their MAC addresses, and configures all of the vSwitchs across the network.

It wouldn't take much to add a vmknic for VXLAN interfaces to the vSwitch code, and then set up some configuration in the controller to configure all the endpoints in the ESXi vSwitch. That's what Nicira was doing in Openvswitch and that, I think, is why VMware bought Nicira.

About the Author(s)

Greg Ferro

Network Architect & Blogger

Greg has nearly 30 years of experience as an IT infrastructure engineer and has been focused on data networking for about 20, including 12 years as Cisco CCIE. He has worked in Asia and Europe as a network engineer and architect for a wide range of large and small firms in many verticals. He has been writing about networking for more than 20 years and in the media since 2001.

You can email Greg or follow him on Twitter as @etherealmind. He also writes the technical blog Etherealmind.com and hosts a weekly podcast on data networking at Packet Pushers.

See more from Greg Ferro

More Insights

Webinars