home
NEWS       BLOGS       FORUMS       NEWSLETTERS       RESEARCH       EVENTS       DIGITAL LIBRARY       CAREERS  
Network Computing Network Computing Powered by InformationWeek Business Technology Network

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers




Table of Contents

Fault-Tolerant Networking

By Brian Walsh
Introduction

You will succeed in building fault-tolerant networked systems only if you have a certain attitude. The incident that gave me that attitude happened when I was eight years younger, about fifteen pounds lighter, with about 150 feet of air between my feet and the ground.

I had worked for a few years as an engineer with Tandem and Stratus fault tolerant transaction processing systems. They were successful projects. We coded up a storm of checkpoints and rollbacks, put in procedures for storing database logs on mirrored media and installed dial backup links to all our sites. I thought that I had successfully mastered the concepts of redundancy, recoverable transactions and fail-over systems. But until I started rock-climbing_and slipped_I didn't really realize how important the concepts were.

Attitude makes the difference

Rock climbing and building fault tolerant systems have a lot in common. All the gear in the world does not make a safe mountaineer. All the hardware and software you can buy won't make your systems and data free from downtime_unless you approach the task with the right perspective.

When you climb rocks, your life is constantly at stake until you are back on the ground. The minute you lose sight of that fact, all the ropes, chocks, harnesses and training are worthless. It's similar with networked systems. If you can't respect the data flowing through the net and your servers as if every transaction represents your money or your life , your designs, implementations and procedures will ultimately be ineffective.

Keep that in mind the next time someone forgets to tell users that their services are going down.

"Bombproof"

"Bombproof" is a wonderful word. It doesn't mean "things will be all right as long as the disk doesn't fail," or "if a certain hub dies we will only lose the fourth floor." It means that you can drop a bomb on the network, and users will keep working. It means that you can depend on it.

A leader, in climbing terms, is the one who assumes the lion's share of the risk, by climbing first and setting protection for others to follow. I sometimes wonder what would happen if other fields applied the same definition of a leader. The first thing that a leader does when climbing is to establish a bombproof anchor, a set of redundant, simultaneous, load bearing devices attached to the cliff. The leader starts the climb from there.

At this point, the leader has taken a recoverable risk. Falling means dropping twice the height of the climb. Then the rope goes taut, and force factors several times the weight of the climber pull on the anchor.If the anchor fails, the leader gets hurt or dies. He nce the importance of the anchor.

As the leader climbs higher, he or she places intermediate anchors. Falls will then stop at the highest intermediate anchor. If that anchor fails, the fall continues until the next highest anchor. The fewer anchors placed, the nastier the fall.

When the lead climber exhausts the length of the rope, the leader sets another bombproof anchor, relieves the second climber of the responsibility for holding the rope and assumes responsibility for the second climber, who follows the leader's route. This is repeated for the rest of the climb.

This simple system has been used to scale walls from the old rock quarry outside your home town to the cliffs of Yosemite. For veteran explorers on Baffin Island and flabby weekend warriors out for the afternoon, the stakes remain the same.

Your role

Your role is analogous to the climbing leader's. You must guide management and users through the design, selection, purchase and implementation of fault-tolerant networks. Even though you may fade into the background once the system is established, you are responsible for their safety en route . Their data and therefore their livelihoods depend on your design and decisions.

Hardware and software vendors can't remove you, the engineer and decision maker, from that role. A hardware vendor can make a hub or server fault resilient or even fault tolerant. But if you don't specify and design the correct cable plant or implement a realistic backup schedule, disaster looms.

It doesn't matter where the outage comes from. An outage is an outage. Too many techncial support staff and managers wash their hands of problems, saying, "The network is up, it must be the server," or "My server is up, it must be the network." Different disciplines need to work together to achieve fault tolerance_workstation and server vendors, in-house developers, cable designers and installers, rout er and hub designers, WAN architects, database vendors_everybody.

Basic Concepts

Fault Tolerance, Fault Resilience and Disaster Recovery
Destinguishing fault tolerance from fault resilience can be tricky. As with performance claims, vendors have a way of leading the public astray. Fault tolerance is the stronger term, indicating that every component in the chain supporting the system has redundant features or is duplicated.

Fault tolerance means the system will not fail because any one component fails. The system also should provide recovery from multiple failures. Components are often overengineered or purposely underutilized to ensure that while performance may be affected during an outage the system will perform within predictable, acceptable bounds.

Fault resilience usually indicates that at least one of the modules within a component, say, a power supply in a hub, is backed up with a spare. Not all modules within that component are necessarily redundant, however. The hub may have two power supplies but only one CPU. Performance of the system during an outage is therefore undefined. One fault-resilient component does not make the entire system fault tolerant.

While fault-tolerant LANs are impossible without fault-resilient components, a system cannot truly be fault tolerant if there is no way to re-establish it in the event of disaster. You need a disaster recovery, or business resumption, plan to address the types of outages in your environment. It should specific step-by-step plan for each IS group. It addresses the fault-tolerant capabilities of specific sites or sub systems, and it includes test rehearsals of mock disasters and recoveries on a regular basis.

Fault tolerance, fault resilience and disaster recovery are intimately interrelated. You need to understand how they work together to design a fault-tolerant network. Simply put, our goal is to keep runnin g no matter what, to maximize the number of failures the system can cope with, and to minimize any potential weaknesses.


Management Involvement

Before design and implementation work can start, we need to educate management and make decisions.

Prior to decision making time, it is necessary to lay out alternative designs and component selections. Management will usually jump right to the bottom line when presented with alternatives. Considerable skill is required when presenting information that combines graduated cost,graduated design complexity and graduated exposure to failure.

Any plan for network fault resilience should be a chapter in the organization's disaster recovery plan. Therein, management should state their goals for all operations, staff and systems. At a minimum, management should provide information and set expectations for costs of outages, escalating penalties for length of outages and the hierarchy of importance of different activities, from key operations to support functions ( i.e., the assembly line versus human resources.



Table of Contents

November 15, 1996

Print This Page


e-mail E-mail this URL






Ready to take that job and shove it?

Function:

Keyword(s):

State:
SPONSOR
RECENT JOB POSTINGS
CAREER NEWS
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

Ari Balogh was named to the post of chief technology officer as the companys for a "realignment" of employees.










InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Download Today
 
ROLLING RIGHT ALONG
Follow key Network Computing Reviews from conception to completion. This Week: Holistic APM.



Network Computing Reports Emerging Enterprise Podcast Series: Secrets to Success








TechSearch


Microsite of the Week


Powerful Information at Your Fingertips



InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space


App Infrastructure   |   Messaging & Collaboration   |   Network & Systems Mgmt   |   Network Infrastructure   |   Security  |   Storage & Servers   |   Wireless   |   Enterprise Apps
About Us  |  Contact Us  |  Site Map  |  Technology Marketing Solutions  |  Advertising Contacts  |   Briefing Centers
Copyright © 2008  United Business Media LLC  |  Privacy Statement  |  Terms of Service  |  Your California Privacy Rights