Networking

07:00 AM
Connect Directly
Twitter
LinkedIn
Google+
RSS
E-Mail
50%
50%

White-Box Switches: Are You Ready?

White-box switching promises flexibility and lower costs, so should you make the leap? It depends.

White-box switching is an idea that is starting to gain more traction in today's networking environments. Network administrators are beginning to see the value represented by decoupled hardware and software. But is it time for your network to make the change? Let's take a closer look at this technology trend and whether it can help your network.

White-box 101
White-box switching isn't a new idea. Original device manufacturers (ODMs) have been building hardware for well-known vendors for many years. These vendors take the ODM hardware, install their operating system, and sell the unit as a bundle, often attaching a support contract.

What's new about white-boxes switches is that the ODM will now sell the hardware directly to the customer without an operating system. Manufacturers such as Quanta Cloud Technology (QCT) and Accton offer a range of data center switches. These can be purchased at a discounted rate compared to similar switches from traditional networking vendors, due to the fact they have no operating system installed.

The second component of a white-box switch comes from vendors like Cumulus Networks and Big Switch Networks. These vendors offer operating systems that can run on a variety of hardware switching platforms, which allows you to install your own software on hardware that may come from different suppliers.

Purchasing the hardware and software independently of each other offers many advantages. The acquisition cost is generally lower than a traditional vendor. The flexibility of the platform also is very useful. Cumulus and Big Switch base their OSs on Linux, which gives programmers and developers the ability to customize the platform to their needs.

Making the switch
Are you ready to install white-box switches in your network? The answer to that question depends on what kind of networking needs you have.

Application-focused networking companies will quickly find the flexibility of white-box switches compelling. The ability to heavily customize the operating system to provide high performance is very important for some lines of business, such as financial trading. Developers can customize the system to limit unneeded processes and concentrate the processing power of the switch on the important features. This leads to a lean, custom switch platform that provides peak performance for a narrow range of uses.

Customers with highly unique support needs also benefit from white-box switches. Through the separation of software and hardware, customers can obtain different support levels for hardware and software. The lower acquisition cost for the hardware allows for spare units to be held at the ready for quick replacement. Having a software platform that is independent of the hardware also allows support engineers to debug it easily and provide relevant output to the networking team.

Finally, organizations with very strict monitoring and availability requirements may benefit greatly from the customizable aspects of white-box switching. The most recent example of this customization is the Facebook Wedge open switching platform. One of the biggest reasons why Facebook chose to develop Wedge was to have the ability to integrate the monitoring of the platform into the existing system monitoring suite.

This isn’t something that can easily be accomplished with a traditional vendor product. With a white-box switch, the OS can be reconfigured to support an existing availability monitoring suite.

Or not
Benefits aside, white-box switching isn’t a great fit for all networks. Organizations with extensive training on a specific vendor’s platform won’t see huge benefits. Customers that feel comfortable having a comprehensive support contract may feel better with the “one throat to choke” model that traditional vendors offer. It should be noted that Cumulus Networks does offer a similar support model for hardware on its hardware compatibility list.

Small- and medium-size networks won’t see the same advantages that large and hyper-scale organizations get from making the move to white-box switching. If you are buying hundreds of thousands of dollars worth of networking gear, saving 25% to 50% justifies any issues you might have retraining on a different user interface. If you are only purchasing three or four units, that cost savings won’t balance out with the learning curve of deploying new equipment.

While white-box software vendors have taken steps to reduce the time it takes to install their equipment, an engineer trained on traditional vendor equipment will still require time to install and configure an ODM switch with Cumulus or Big Switch software. For the SME looking to deploy white-box switches, a phased approach or pilot lab installation would be more practical today.

However, many networking shops that are looking forward to their next purchasing cycle in 12 months to 18 months would do well to investigate white-box switch options along with quotes from traditional networking vendors.

While the advantages of current white-box platforms may not be enough to tip the scales of your purchasing decision in favor of the technology, it may open your eyes to the possibilities that are available to you today and give you a roadmap to your next purchasing decision.

Tom Hollingsworth, CCIE #29213, is a former VAR network engineer with 10 years of experience working with primary education and the problems they face implementing technology solutions. He has worked with wireless, storage, and server virtualization in addition to routing and ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Threaded  |  Newest First  |  Oldest First
Pablo Valerio
100%
0%
Pablo Valerio,
User Rank: Author
7/28/2014 | 8:59:56 AM
Depends on IT resources
I can see why White-Box products could be attractive to large organizations who have the staff to properly install and support the switches. 

But for SMEs lack of professional support could come at a high price, and they are better off buying from a brand name vendor.
aditshar1
50%
50%
aditshar1,
User Rank: Ninja
7/28/2014 | 10:05:28 AM
Re: Depends on IT resources
I guess white box is still in early stage of deployment and major challenge i see is product support, but its low cost and budget is attracting lot of people.
AbeG
50%
50%
AbeG,
User Rank: Ninja
7/28/2014 | 7:24:30 PM
Re: Depends on IT resources
This somewhat reminds me of the DD-WRT software that you can use to flash your home router with.  A recent trend in the consumer market is to offer routers that embrace open source firmware like DD-WRT and some come with it pre-installed.
AbeG
50%
50%
AbeG,
User Rank: Ninja
7/28/2014 | 7:22:29 PM
Re: Depends on IT resources
I agree with Pablo.  I can't imagine this sort of thing being attractive to large organizations.
Serrad
100%
0%
Serrad,
User Rank: Strategist
7/28/2014 | 9:26:18 AM
Tom is talking about things he doesn't know about.
Tom, where are you coming up with "The ability to heavily customize the operating system to provide high performance is very important for some lines of business, such as financial trading".  No HFT traders are customizing anything in the control plane.  the customization is happening at the hardware layer (read FPGA) if anything.   There is no "performance" gain to be had by programing the OS of the switch (asside from routing/switching convergence but maybe).  I think you need to go back and study more because your ccie didn't teach you this.

 
NetworkingNerd
100%
0%
NetworkingNerd,
User Rank: Strategist
7/28/2014 | 9:57:54 AM
Re: Tom is talking about things he doesn't know about.
Serrad,

You are correct that many HFT lines of business are customizing Field Programmable Gate Arrays (FPGA) to accelerate workloads.  This is a big part of the push of companies like Arista Networks.  However, saying that customizing the software load can't provide performance gains isn't correct either.

Removing unnecessary portions of software code can and does provide a performance boost to workloads.  Reducing the memory and CPU footprint of the OS leaves more power available for the data plane.  Because we aren't dealing with special-purpose CPUs here, every timeslice we give back to the system is one that can be used to push data that much faster.

You are right that HFT really wants to use FPGAs for acceleration.  But I think the narrow focus of FPGAs and the special knowledge required to program them for their purpose won't translate well into the wider enterprise and data center market.  It's much easier to adapt existing Linux OS experience and familiarty with languages like Python to whitebox than it is to find a VHDL programmer.  Making your staff learn to program your switches is the road most want to go, not making your network beholden to another specialized consultant.

And lastly, you are correct that the CCIE doesn't teach these concepts.  There is a lot of outside research that goes into new technologies like SDN and whitebox switching.  It's important for today's networking engineers to keep on top of the changing landscape.  A friend once told me that a CCIE doesn't mean you know everything, but instead means you can learn new things.  That's the kind of approach that networking needs today.
DavidS327
100%
0%
DavidS327,
User Rank: Apprentice
7/28/2014 | 10:25:41 AM
Re: Tom is talking about things he doesn't know about.
Tom, you said" Removing unnecessary portions of software code can and does provide a performance boost to workloads.  Reducing the memory and CPU footprint of the OS leaves more power available for the data plane.  Because we aren't dealing with special-purpose CPUs here, every timeslice we give back to the system is one that can be used to push data that much faster." 

 

This is completely wrong.  The job of an ASIC is to forward packets.  The ASIC memory and processing of packets are independent of the control plane CPU and memory.  Packets should never be switched in the control plane or you will have real performance problems.  You really need to learn more about ASIC architecture and stop making statements about things you don't know about.  Nobody in HFT is customizing the operating system on a switch to get performance gains because packets don't go to the OS.  They get switches at the ASIC.  Please don't write an article and post your ccie credentials at the bottom unless you know what you are talking about.
EtherealMind
100%
0%
EtherealMind,
User Rank: Ninja
7/28/2014 | 10:48:18 AM
Re: Tom is talking about things he doesn't know about.
The assertion about code reduction for performance gains is quite correct. For example, Facebook's FBOSS operatiing system is specifically desgined to maximise throughput by focussing on removing packet drops across the switching ASIC at very high loads. This has been common and well known problem in branded solutions at Facebook and led to the Wedge hardware/FBOSS Software. 

Pluribus Networks, and others, write their own device drivers for the silicon to improve performance for flow updates to the FIB in silicon.

The packet forwarding latency is only one measure of speed, consider TCAM table update speed, or buffer management on the VOQ,  total goodput at 90% sustained load,or many other areas besides. 

The commenters highlighting the ASIC performance might be taking an overly simplistic model of the internal architecure of a switch and making poor errors in judgement. The ASIC performance is determined by the total sum of the software that it runs, network processors, internal buffer management, and much more. 

Consider that Intel x86 server performance is determined by a combination of the operating system, bus speed, memory class and speed, network adapter, and much more. A switch is also a collection of components that determine the overall performance of the unit. 
DavidS327
100%
0%
DavidS327,
User Rank: Apprentice
7/28/2014 | 11:15:17 AM
Re: Tom is talking about things he doesn't know about.
My understanding of FBOSS (WEDGE) was that facebook wanted to put these switches into their server management platform and so they created a switch that is using an x86 architecture for the control plane.  I've not read anything on them needing to optimize the switch performance though maybe I've just not come across the proper information on this topic.  

Regarding Pluribus.  In todays world, we update ASIC tables with a FIB.  This FIB is based on per network rather than per flow.  Once routing has converged, there is no need to update the FIB and we have nanosecond switching.  If Tom was referring to updating the ASIC with a per Flow table (and I don't see that he was saying this) then you have to build state for each and every flow.  There is so much latency involved in this model and because of that, it is not a good fit for HFT.  They would have to optimize the hell out of it just to get today's performance.  If you are adding 100's of microseconds of latency for each flow setup then you will be out of business rather quickly in this business.  

My comments are not simplistic but industry specific which it is clear you are not in the HFT world.  Ando also note that I would not have piped up my opion had the author not made such a ludicris industry specific comment.

Though I am no server expert, your x86 architecture reference is a bad comparison.  You are mixing the complete application performance with the forwarding performance.  If we are just talking about forwarding performance, then a 10Gig NIC doing TCP offload is a function of the ASICs on the NIC and not of anything on the other side of the PCIe bus (read main CPU and memory). 

 
EtherealMind
50%
50%
EtherealMind,
User Rank: Ninja
7/28/2014 | 3:12:15 PM
Re: Tom is talking about things he doesn't know about.
The initial purpose for FBOSS / Wedge was to address packet loss issues in a vendor switch at high utilisation. The value proposition was likely extended to include support for a BGP/SDN solution and removal of unnecessary code for reliability. 

Arista EOS software was extensively customised to support HFT systems and provide low-latency paths through the hardware architecture. 

Cisco NX-OS software on the Nexus 3064PQ series was extensively customised to provide low-latency features. You can find a great deal information on Cisco's website on the modification and functions. 

The performance of the ASIC is rigidly determined by the VOQ algorithm, Fabric arbritration  and a number of other functions that are controlled by software. For example, changes to the FIB in some software implementations cause packet loss during FIB table updates to the ASIC. 

Switch performance is determined by many factors and like an x86 CPU, the ASIC fabric is the most important aspect but the software and supporting hardware are critical in platform performance. The metaphor is a solid comparison when you go deeper on the microcode in the ASIC itself, the firmware and algorithms that drive many of the functions. 

It's not obvious to most networking people since we have never needed to know this before. I certainly only learned this information in the last year or so. 
DavidS327
100%
0%
DavidS327,
User Rank: Apprentice
7/28/2014 | 6:42:19 PM
Re: Tom is talking about things he doesn't know about.
Very interesting stuff with FBOSS.  I'll have to read more about that.

I'm quite familiar with the Arista boxes.  In my world packets need to be switched at line rate(nano seconds).  If packets are being queued then it is time to upgrade your links to 40 0r 100Gig.  VOQs introduce jitter and this is unacceptable.  My developers will call me about the jitter faster then I can get any reading from LANZ.  We joke of replacing our SNMP trap collectors because they are too slow ;) 

I still stand by what I originally stated in that removing software process does NOT create resources for switching.  I agree that driver type functionality of writing ACL, FIB, QoS, etc tables to TCAM can be optimized but these events do not happen in stead state.  The goal is to run in a steady converged state during market hours and after market do your maintenance (add ACLs, Routes, etc.).  Yes there are failures during the day and as a result FIBs will update but it is understood that you will have packet loss and jitter when these events happen.

I read Tom's statement about customizing the Network OS for financial trading as the owner of the switch doing this (like a bank) in a devops model.  I did not read this as a vendor has performed this optimization action like Arista.  His article is pitching the pro's and con's of white boxes.  Not vendor hardware/software solutions.

I will also say it again.  While Tom is a smart guy, he should stick with what he knows.  CCIE stands for Cisco Certified Internetwork Expert.  If you are going to call yourself the Expert and on top of that, publish articles as an Expert, then you should be sure you are reporting facts and not fantasy.    

 
NetworkingNerd
100%
0%
NetworkingNerd,
User Rank: Strategist
7/28/2014 | 6:32:33 PM
Re: Tom is talking about things he doesn't know about.
David,

You are absolutely right.

High Frequency Trading (HFT) is a very specialized application/workload that prizes performance over extensibility.  Being able to push a packet out a few nanoseconds faster could mean the difference between making money and losing big.  The appeal of whitebox switching to those customers is negligible at best.  They have lots of money and are willing to sacrifice as much as it takes to get the performance they desire.  The Nexus 3548 "Warp Mode" is perfect example.

I wasn't thinking of HFT when I talked about financial trading.  Instead, I was looking more at the discussions coming out of the Spring Open Networking User Group meetings in NYC, which are heavily attended by financial firms and have a great emphsis on solving problems through the use of open networking.  The appeal of whitebox switching to these companies is high, given the sponsorship of the event by Big Switch Networks and Cumulus Networks.  The advantages of extensibile, programmable platforms would appeal greatly to these finanicial customers.

More specifically, Lucera (https://lucerahq.com) is working with Pluribus Networks (http://www.pluribusnetworks.com/news-and-events/press-releases/detail/20140624-pluribus-networks-brings-the-server-switch-paradigm-into-the-broader-commercial-market-at-white-box-economics/) to use open networking and commodity switching hardware in financial trading.  Their ONUG presentation was very interesting indeed.
Serrad
100%
0%
Serrad,
User Rank: Strategist
7/29/2014 | 7:50:56 AM
Re: Tom is talking about things he doesn't know about.
@EtherealMind - After a conversation with my Arista rep, MOST Arista platforms don't even support VOQs.  Only the 7500E and the 7280 do which are not the platforms used in most trade plants.  Your agrument doesn't have much relevance.   

 

@Tom - How can you not think of HFT when you think of financial trading.  The majority of trades in the market today are by HFT and Algo traders.  You attended a convention with some people from financial firms and this is what you base your "facts" on?  Really?  Listen, I know from the outside looking in that this sounds reasonable but I can tell from the perspective of working for 3 banks in the past 14 years and all of which are doing HFT and Algo trading that no one I've seen or heard of at the bank is dedicating resources to removing "unneeded processes" in the switch to increase performance.  Yes, there are problems that banks are looking to solve programmatically but few of that will be in the actual trading plant.  I personally am working on building a new agile environment for our developers and building a private cloud with white boxes is very appealing but these are not trading apps (though some may be used for RISK).  And none of these off the shelf SDN OSs (like Big Switch, etc...) support multicast (PIM) which is a requirement to get your market data.  Keep going to those conferences but don't think that they give you a clear indication as to what banks are doing.
Jason Lackey
50%
50%
Jason Lackey,
User Rank: Apprentice
7/29/2014 | 2:41:42 PM
Re: Tom is talking about things he doesn't know about.
Full Disclosure - I work at Pluribus Networks

 

Great to see the world of networking awakening to outside the box alternatives, even in the world of financial trading. Interesting to see mention of FPGAs, one of our customers specifically looked at FPGAs but did not go down that path because the time to market in terms of having someone reprogram when some rules/algorithms changed was too long, good performance but not really agile. They were however, able to get perfectly acceptable performance out of our boxes.

Here's the Lucera Case Study on this very topic for those interested and here's a perspective on Pluribus/Wedge and for those interested in more on our take on a network hypervisor/operating system, here's our CTO deconstructing Netvisor, the Pluribus network hypervisor.
jgherbert
100%
0%
jgherbert,
User Rank: Ninja
7/31/2014 | 11:55:54 PM
Play The Ball, Not The Man
Whether you happen to agree with every point raised, I do feel that Tom has raised some interesting issues that have generated some truly interesting discussion in these comments, and the information and opinions have enhanced the value of the article as a result as it has given everybody a chance to dive deeper than the (undoubtedly word-limited) article could in the first place. So thanks all for the conversation and experience sharing.

As for whether Tom should put his CCIE in his bio, well why not? He earned it, and it demonstrates that he has a capacbility with Cisco products. *shrug*
Susan Fogarty
100%
0%
Susan Fogarty,
User Rank: Strategist
8/1/2014 | 9:29:07 AM
Re: Play The Ball, Not The Man
Well said, jg! Our goals are for the discussion to focus on the actual material posted, rather than the person who wrote it. Of course, everyone's background influences their point of view, but we expect members to take the high road They generally do, but sometimes go a bit astray.
toddmcraw
50%
50%
toddmcraw,
User Rank: Apprentice
8/20/2014 | 11:52:39 AM
Tom and the financial guys are right.
Disclaimer: I work for Cumulus Networks and have 14 years+ working for hw vendors

 

Interesting discussion... I wanted to add that both Tom H. and the financial HFT guys seem correct but they are talking about different things. I work directly with many of these customers and bare-metal is very popular in finance, web-scale and enterprise for capex reasons and for opex related to having an open source customizable OS.

HFT/Algo or trading generally needs the intel fulcrum chipset or something like the Cisco 3000 with warp because it has the lowest latency and the features they need like multicast and NAT. Trading in general is very multicast heavy and needs protocols like PIM. A customizable open source OS may be interesting so that they can add certain custom monitoring or analysis tools or provide some special methods of failure detection. This is probably extremely rare though. Almost all of the functions are provided by hw and all that matters from the sw is that it be stable and update the hw as quickly as possible. There may be control plane optimizations here but I doubt it would be much different than on any other switch (why would you improve protocol convergence on one platform and not do it for all?). The only unique sw would be related to the special ASIC features  (i.e. Warp which is really a hw feature you turn on) and how they have to manage any sw structures for those features. Any ASIC is a combination of hw, firmware microcode and software working together.  In finance today, you generally see bare-metal in modeling and compute farms or application/web services.

The most common usecases for a bare-metal open source OS are to only use the processes you need for a more stable platform. Use proven open source code that you control and can modify if you have the desire/skills. Use your existing automation tools for servers and your server boot environment to provision switches in the same manner as servers (ZTP). A more rare usecase is to customize the OS to perform functions that cannot be done on regular switches easily. Support and function of server automation tools on vendor switches is usually very poor compared to doing it on a Linux OS but there is a lot of effort to improve this. There can be substantial capex and opex savings.

Controversial part: This is all being done by progressive organizations but the tools and skills are filtering down to everyone and it is becoming very common. The majority of data centers will be a highly automated commodity environment within the next 5 years or your CFO will use a highly automated commodity cloud like AWS instead. It doesn't take incredible vision to see this coming. It has already happened in servers and it destroyed some great companies. The days of making the network smart and expensive and the application dumb are over. The network should be cheap and simple (preferably L3 Clos) and the application smart.

 

Furthermore, my limited understanding of Facebook, Google, etc. is they are doing special customizations of the switch OS related to how they manage their data and workloads, provide security, etc. This requires lots of expertise and only is valuable at large scale and with very smart developers and operators.

 

 AbeG/Pablo: This is SUPER Attractive to large organizations btw.

 

 
Susan Fogarty
50%
50%
Susan Fogarty,
User Rank: Strategist
8/20/2014 | 1:17:56 PM
Re: Tom and the financial guys are right.
Todd, thanks for weighing in here. We really appreciate you taking to time to share all these technical details. The point that there are two different arguments going on here is important -- I know we try to be concise, since we are after all "commenting," but it's very easy to lose both the big picture and the specifics. Especially with topics that require a good deal of explanation. This info is helpful.
Slideshows
Cartoon
Audio Interviews
Archived Audio Interviews
Jeremy Schulman, founder of Schprockits, a network automation startup operating in stealth mode, joins us to explore whether networking professionals all need to learn programming in order to remain employed.
White Papers
Register for Network Computing Newsletters
Current Issue
2014 Private Cloud Survey
2014 Private Cloud Survey
Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.
Video
Twitter Feed