Obviously Google operates a massive network. It's so large that a 2010 study by Arbor Networks concluded, "If Google were an ISP, it would be the fastest growing and third largest global carrier. Only two other providers (both of whom carry significant volumes of Google transit) contribute more inter-domain traffic."
At the Open Networking Summit, Google Distinguished Engineer Amin Vahdat presented "[email protected]: Why and How." The talk shared some details about how Google uses a combination of Quagga open source software along with OpenFlow to optimize its data center interconnects. He also shared details about Google's use of OpenFlow within its own data centers. Google calls its SDN network "B4."
Vahdat noted that the growth of Google's back-end (east-west) network is quickly surpassing its front-end user-facing network. This growth is expensive because the network doesn't scale economically like storage and compute do. The operating expense of compute and storage becomes cheaper per unit as scale increases, but this is not the case with the network.
Vahdat laid out Google's rationale for software-defined networking. First, by separating hardware from software, the company can choose hardware based on required features while being able to innovate and deploy on software timelines. Second, it provides logically centralized control that will be more deterministic, more efficient and more fault-tolerant. Third, automation allows Google to separate monitoring, management and operation from individual boxes. All of these elements provide flexibility and an environment for innovation.
At the start of the project, Google built its own switches (see image) using merchant silicon. Google built its own hardware because there wasn't any hardware on the market to fulfill its needs at the time it kicked off the project.
Vahdat didn't mention if any of the SDN-enabled switches now on the market, which typically involve a vendor firmware upgrade incorporating an OpenFlow agent, would satisfy Google's hardware needs.
However, Vahdat noted that while OpenFlow was not and is still not a perfect protocol, Google will continue to use it for flow instantiation because it will be supported by a variety of vendors. I think this implies Google isn't interested in continuing to roll its own hardware, and that OpenFlow support will be a requirement for future hardware purchases.
Vahdat then went on to discuss Google's SDN migration path, which moved in stages from a fully distributed monolithic control and data plane hardware architecture to a physically decentralized (though logically centralized) control plane architecture.
The hybrid migration for the Google B4 network proceeded in three general stages:
In the next image, you can see that Google has deployed sets of Network Controller Servers (NCSs) alongside the switches. The NCSs contain the extracted control plane for some number of network elements. The switches run an OpenFlow agent with a "thin level of control with all of the real smarts running on a set of controllers on an external server but still co-located," said Vahdat. The NCSs are 32-core servers.
On top of the NCSs are OpenFlow controllers running leader election for high-availability failover. The primary application Vahdat discussed was a traffic engineering application that instantiates policy into control protocols, including BGP, ISIS and OpenFlow.
I was interested to hear about Google's hybrid implementation strategy, as I am involved with a 1,000-port production implementation in the enterprise.
It seems to me that Google is fairly limited in how it can react to the loss of network state and mechanically continue forwarding packets. Presumably the network falls back to low-priority, proactively instantiated flow rules. A packet that did not match a flow rule in the flow tables (e.g. table-miss event) could be drained into either a normal forwarding pipeline or a set of pre-installed, catch-all flow rules to reach an egress gateway. While custom forwarding would fall back to a shortest path or a static path, it would keep traffic forwarding until the control elements recovered.
The diagrams imply control elements arranged in a multilayer hierarchy. Hierarchy and modularity is how we scale a large network today. By chunking portions of the NIB (network information base) into modules, Google's approach resembles today's network architectures minus a dedicated control plane per data plane. These modules can then distribute the hashing tables and NIB throughout all the modules and provide global views at the aggregations.
Unfortunately, Vahdat was limited on time and did not go into details around state distribution and redistribution between the control protocols. If you'd like to see the talk for yourself, it's available on the Open Networking Summit website. It's free to watch, but registration is required to view the recorded sessions.
Brent Salisbury, CCIE#11972, is a network architect at a state university. You can follow him on Twitter at @networkstatic.