Crash Course: Router Redundancy Protocols

Routing protocols add redundancy and reliability to increase uptime. Here's what you need to know.

April 18, 2006

9 Min Read
Network Computing logo

As applications like voice over IP raise the bar higher for network uptime, it may be time to consider new ways to keep your network humming along without interruption. Routing protocols such as VRRP (Virtual Router Redundancy Protocol), IEEE 802.1w, OSPF-ECMP (Open Shortest Path First-Equal Cost Multipath) and others let you remove single points of failure by adding redundancy--and reliability--in your multivendor network equipment and connections to help keep those voice calls coming.

 

 

Implementing these protocols takes planning and training, and may require software upgrades or even new switches and routers. But, before you go there, be sure your documentation and change-control procedures are sufficient. Good documentation lets you more easily resolve network problems and solid change-control procedures help prevent one of the biggest causes of network downtime--human error.

Carefully review the impact of any changes to your network, update your documentation, then have your staff test major changes in router and switch configurations and software upgrades in the lab.

Start in the Middle

Adding redundancy is the most common way to increase your uptime, and the best approach is to start in the middle and work your way toward the edge. First, make sure there's redundancy within your core router--redundant CPU cards, power supplies and fans usually can be added to chassis-based routers and switches, and some router and switch vendors have equipment with dual backplanes. Each vendor does this differently, and in some cases, an outage occurs when the backup card takes over, but usually only new routes are affected while the new card comes up. With redundant CPU cards, you can force a failover to one card while you upgrade the second one, instead of having to bring the whole router down for the upgrade.

The core or backbone of a network usually handles the most traffic so, if it goes down, it will likely affect the most users. If your redundant core router or switch equipment is connected and ready to kick in automatically when a problem occurs, you can reduce an outage from hours of manual labor to an automated process that takes just a few seconds. This is called High Availability, where identical core routers must be ready to take over should the primary fail (see "Three Tiers for HA," above right). This means that the next layer out, the aggregator switches, has to have a connection to each router, which also provides some redundancy for the links themselves--which also lets you put each core router in different geographic locations.

Proper Protocols

Now that your network has redundant links, you must decide how packets on the network will select their paths and avoid loops. This isn't a new problem--redundant paths have been addressed by protocols like STP (Spanning Tree Protocol) at Layer 2 and routing protocols like the IETF's OSPF at Layer 3. But these protocols can take 40 seconds or more to resolve. OSPF takes up to 30 seconds to resolve; STP, even more. This is unacceptable for critical networks, especially those with real-time applications like VoIP and video.An upgraded version of IEEE's STP called RSTP (Rapid Spanning Tree 802.1w) cuts the convergence time of STP to about one second. One disadvantage to RSTP (and STP) is that only one of the redundant links can be active at a time in an "active standby" configuration, and STP also changes the active path to another router, so the gateway addresses of the clients must change as well. To avoid these problems, you must run VRRP along with STP and RSTP on your routers, which emulates one virtual router address for both core routers and takes about three seconds to fail over.


Standards
Click to enlarge in another window

But because VRRP and RSTP work independently, it's possible VRRP will designate one router as master and RSTP would determine the path to the backup router as the preferred path. Worst case, this means if the backup VRRP router receives traffic, it will immediately forward it to the master router for processing, adding a router hop.

Another router redundancy option is to run OSPF in the core router as well as on the aggregator switches. OSPF is a link state protocol, so if one of the links goes down, it usually fails over in less than one second. You don't need VRRP with OSPF if you don't have redundant aggregator switches, because the clients would use the single aggregator switch as their gateway address. Most OSPF router and switch implementations now support ECMP (Equal Cost Multipath), a newer version of OSPF that load balances traffic equally across two links. Both links are always active in an active/active configuration and, if there is a failure, only half the traffic will be affected.

Load balancing also means that, theoretically, you have the total bandwidth of both links available. But, if you're depending upon both links for your bandwidth requirements, you don't get full redundancy. If a failure occurs, the traffic will oversubscribe the remaining link with unpredictable results. You can mitigate this to some extent with QoS but, given the low cost of LAN bandwidth, it's better to upgrade the link speeds and get true redundancy.Another trade-off is that, though an OSPF-based device can immediately reroute traffic upon detecting a downed link, it may not always get the information. If you have another firewall or switch between your core router and the aggregator switch, for example, and the router goes down, the link state will not change on the aggregator switch because it is not directly connected. Then you're dependent on OSPF's "hello" protocol, which checks the status of its neighbors. By default, a hello message is sent every 10 seconds and when four replies are missed, OSPF considers the neighboring device down.

It's best to have direct connections to avoid this; you can even change the default timer settings to provide faster convergence. You may also have to pay extra to add OSPF to aggregator switches, as some vendors only provide it as an option.

You can remove more single points of failure as you work toward the edge of the network. You can, for instance, connect each workgroup switch to dual aggregator switches, so if one switch crashes, you'll have a backup. This also means each workgroup switch will need dual connections--one to each switch--which adds further redundancy. You should have four cables, usually fiber, depending upon the distance, going from the core to each pair of aggregator switches.

A simple way to add redundancy between any two switches is to use the IEEE 802.3ad protocol. This trunking protocol takes multiple connections and combines them into one virtual pipe, to increase bandwidth. Packets are load-balanced across the connections so, if one of them goes down, traffic is directed to the remaining connection or connections.

The downside of 802.3ad is there has to be a connection between two switches--you cannot have two connections on one switch going to two different switches. You can, however, have two connections from an aggregator switch to two different cards on the core switch, which at least gives you redundancy at both the card and port levels. Some vendors let you use 802.3ad at the aggregator switch and plug into two different core switches that emulate one switch.Best Practices

These router redundancy protocols operate in a multi-vendor environment and most vendors support them, but you won't know if they play well together until you test them. Implementing the standards-based approach gives you more product choices and, even if you use only one vendor, standards give you the freedom to switch to another vendor someday. We don't recommend you use equipment from more than two switch and router vendors as this could add too much overhead in vendor management. But you may want one vendor's gear at the core and another's further out.

Test everything thoroughly before deploying router redundancy protocols, whether you run a multivendor or single-vendor environment. Redundancy increases uptime but adds complexity, which can work against uptime. Make sure you exercise all failure mechanisms and observe how they work, so you know what to expect and what your management software can tell you about the status of the components.

In the end, it's all a matter of how much downtime your applications can handle and how it will impact the the business. In some cases, you may be able to handle a network outage of a few hours while you replace a switch with a spare; in other cases, a few seconds could spell disaster.Peter Morrissey is a faculty member of Syracuse University's School of Information Studies, and a contributing editor and columnist for Network Computing. Write to him at [email protected].

Single-Vendor Advantages

Router vendors offer proprietary solutions that may provide quicker and more efficient failover schemes than standard protocols. And a single-vendor solution can help avoid finger-pointing after an outage.

Extreme Networks' EAPS (Ethernet Automatic Protection Switching) protocol is the only one designed to fail over within 50 ms. Individual rings can be configured for each aggregator-to-core connection. Foundry Networks also has a proprietary redundancy solution and is working to implement the IETF's BFD (Bidirectional Forwarding Detection) draft protocol, to detect failures more quickly than existing standards when link detection isn't possible. It's not too early to ask your equipment vendors if and when they plan to provide support for BFD.

Nortel Networks' VRRP implementation lets both core routers be active, rather than having just one active and the other on standby--plus, Nortel says it will operate in a multivendor environment. Nortel also uses its SMLT (Split Multilink Trunking) protocol so you can deploy 802.3ad between the aggregator and multiple core routers--SMLT tricks the switches into thinking they are connecting to one switch. Nortel says the protocol works across multivendor environments and the company hopes to make it an IETF standard. Hewlett-Packard, meanwhile, uses its mesh technology in its ProCurve switches to balance traffic dynamically across redundant links using flows based on Ethernet addresses. Cisco Systems uses HSRP (Hot Standby Redundancy Protocol), the precursor to VRRP. All these vendors support the standards-based protocols as well.Still, you should weigh the cost of losing leverage with a vendor by locking into its solution. And remember: Even single-vendor solutions can have problems, so test carefully before deploying.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights