You will succeed in building fault-tolerant networked systems only if you
have a certain attitude. The incident that gave me that attitude happened
when I was eight years younger, about fifteen pounds lighter, with about
150 feet of air between my feet and the ground.
I had worked for a few years as an engineer with Tandem and Stratus fault
tolerant transaction processing systems. They were successful projects.
We coded up a storm of checkpoints and rollbacks, put in procedures for
storing database logs on mirrored media and installed dial backup links
to all our sites. I thought that I had successfully mastered the concepts
of redundancy, recoverable transactions and fail-over systems. But until
I started rock-climbing_and slipped_I didn't really realize how important
the concepts were.
Attitude makes the difference
Rock climbing and building fault tolerant systems have a lot in common.
All the gear in the world does not make a safe mountaineer. All the hardware
and software you can buy won't make your systems and data free from downtime_unless
you approach the task with the right perspective.
When you climb rocks, your life is constantly at stake until you are back
on the ground. The minute you lose sight of that fact, all the ropes, chocks,
harnesses and training are worthless. It's similar with networked systems.
If you can't respect the data flowing through the net and your servers as
if every transaction represents
your money
or
your life
, your
designs, implementations and procedures will ultimately be ineffective.
Keep that in mind the next time someone forgets to tell users that their
services are going down.
"Bombproof"
"Bombproof" is a wonderful word. It doesn't mean "things
will be all right as long as the disk doesn't fail," or "if a
certain hub dies we will only lose the fourth floor." It means that
you can drop a bomb on the network, and users will keep working. It means
that you can depend on it.
A leader, in climbing terms, is the one who assumes the lion's share of
the risk, by climbing first and setting protection for others to follow.
I sometimes wonder what would happen if other fields applied the same definition
of a leader. The first thing that a leader does when climbing is to establish
a bombproof anchor, a set of redundant, simultaneous, load bearing devices
attached to the cliff. The leader starts the climb from there.
At this point, the leader has taken a recoverable risk. Falling means dropping
twice the height of the climb. Then the rope goes taut, and force factors
several times the weight of the climber pull on the anchor.If the anchor
fails, the leader gets hurt or dies. He
nce the importance of the anchor.
As the leader climbs higher, he or she places intermediate anchors. Falls
will then stop at the highest intermediate anchor. If that anchor fails,
the fall continues until the next highest anchor. The fewer anchors placed,
the nastier the fall.
When the lead climber exhausts the length of the rope, the leader sets another
bombproof anchor, relieves the second climber of the responsibility for
holding the rope and assumes responsibility for the second climber, who
follows the leader's route. This is repeated for the rest of the climb.
This simple system has been used to scale walls from the old rock quarry
outside your home town to the cliffs of Yosemite. For veteran explorers
on Baffin Island and flabby weekend warriors out for the afternoon, the
stakes remain the same.
Your role
Your role is analogous to the climbing leader's. You must guide management
and users through the design, selection, purchase and implementation of
fault-tolerant networks. Even though you may fade into the background once
the system is established, you are responsible for their safety
en route
.
Their data and therefore their livelihoods depend on your design and decisions.
Hardware and software vendors can't remove you, the engineer and decision
maker, from that role. A hardware vendor can make a hub or server fault
resilient or even fault tolerant. But if you don't specify and design the
correct cable plant or implement a realistic backup schedule, disaster looms.
It doesn't matter where the outage comes from. An outage is an outage. Too
many techncial support staff and managers wash their hands of problems,
saying, "The network is up, it must be the server," or "My
server is up, it must be the network." Different disciplines need to
work together to achieve fault tolerance_workstation and server vendors,
in-house developers, cable designers and installers, rout
er and hub designers,
WAN architects, database vendors_everybody.
Basic Concepts
Fault
Tolerance, Fault Resilience and Disaster Recovery
Destinguishing fault tolerance from fault resilience can be tricky.
As with performance claims, vendors have a way of leading the public astray.
Fault tolerance is the stronger term, indicating that every component in
the chain supporting the system has redundant features or is duplicated.
Fault tolerance means the system will not fail because any one component
fails. The system also should provide recovery from multiple failures. Components
are often overengineered or purposely underutilized to ensure that while
performance may be affected during an outage the system will perform within
predictable, acceptable bounds.
Fault resilience usually indicates that at least one of the modules within
a component, say, a power supply in a hub, is backed up with a spare. Not
all modules within that component are necessarily redundant, however. The
hub may have two power supplies but only one CPU. Performance of the system
during an outage is therefore undefined. One fault-resilient component does
not make the entire system fault tolerant.
While fault-tolerant LANs are impossible without fault-resilient components,
a system cannot truly be fault tolerant if there is no way to re-establish
it in the event of disaster. You need a disaster recovery, or business resumption,
plan to address the types of outages in your environment. It should specific
step-by-step plan for each IS group. It addresses the fault-tolerant capabilities
of specific sites or sub systems, and it includes test rehearsals of mock
disasters and recoveries on a regular basis.
Fault tolerance, fault resilience and disaster recovery are intimately interrelated.
You need to understand how they work together to design a fault-tolerant
network. Simply put, our goal is to keep runnin
g no matter what, to maximize
the number of failures the system can cope with, and to minimize any potential
weaknesses.
Management Involvement
Before design and implementation work can start, we need to educate management
and make decisions.
Prior to decision making time, it is necessary to lay out alternative designs
and component selections. Management will usually jump right to the bottom
line when presented with alternatives. Considerable skill is required when
presenting information that combines graduated cost,graduated design complexity
and graduated exposure to failure.
Any plan for network fault resilience should be a chapter in the organization's
disaster recovery plan. Therein, management should state their goals for
all operations, staff and systems. At a minimum, management should provide
information and set expectations for costs of outages, escalating penalties
for length of outages and the hierarchy of importance of different activities,
from key operations to support functions ( i.e., the assembly line versus
human resources.
REPORTS
Analyize In-Line NAC strategies and products.
ANALYTICS Plan and design your enterprise blade server deployments
InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Download Today