The principles that guide the design of practical fault-tolerant networks
are simply put:
Design and implement all important components in a fully redundant
fashion with the capacity to continue processing in the event of failure.
Use fault resilient components to minimize component failure.
Deploy network nodes in a matrix topology with robust recovery. No
dependence on single point of failure links.
Insist on industry standards for all components in order to ensure
protection of investment and interoperability
Instrument all components so they can be managed.
Client Workstations
Who cares if one client crashes? Obviously, the person sitting in front
of it does. The workstation support staff bears the brunt of the calls from
these users. Too often, these problems
are ignored by the rest of the team.
Since the network, servers and databases were all up, they don't count as
outages, right? Wrong. Numerically, these outages are by far the most frequent.
They are insidious because they are distant from the support staff, hard
to debug and difficult to track. The client desktop also has a way of aging
and altering itself over time. Unless rigor and control are applied, the
number of desktop outages always increases.
Statistics from
the Client's Viewpoint
On one project several years ago, we were called upon to develop a communciations
gateway for a Hotel Property Management System that resided at the end of
a very low-speed leased line in each hotel. The remote gateway was central
to all transactions that happened at the site. Uptime and response time
were imperative. Therefore, we decided to add code to the gateway that recorded
response time, all re-boots, and any line or protocol errors.
All of these were collected and reported on centrally so that we could:
Examine statistics on a daily basis, deal effectively with trouble calls,
document actual, versus planned, uptime and engage in general problem solving
The key point here is that we instrumented all of our uptime and performance
statistics from the client's point of view. If one of those clients had
a bad day we knew about it and were able to react to it.
Gain Control of the Desktop
Most client failures are software related. This is not the time to take
a free-market attitude about the software that users install. If uptime
is an issue, the responsibility for desktops should be centralized. This
is not to say that IS is a dictator about approved for-purchase lists. But
IS should establish certification and burn-in processes.
certification process
is an organizational procedure for determining which software will be
allowed on t
he network.
A certification process should include:
Classification of supported desktops, each with its own capacity,
performance, and redundancy characteristics. For example,
Power User
Executive workstation
Secretarial
Analyst
Clerical
Establishment of a support matrix for the given classes of desktop.
For example, analyst workstations will support
OS version: W
Spreadsheet: X
Word processor: Y
In house developed applications: A,B,C
Third party vertical applications: D,E,F
A supported list of middleware and drivers
A defined set of nework interface cards
Don't let fads and marketing hype drive the client software set. the watchword
should be slow and steady.
Burn-in processes
are
the actual physical testing procedures each piece of software should be
put through before it is allowed, or, at least, supported, on the network.
The purpose of a burn-in process is to ensure that there will be no negative
impact from adding the software.
Once desktops have been classified and a matrix of supported software established,
we can move to the burn-in stage. For every mission-critical application
certified for use on any client, the development and QA teams should develop
a regression test script, that is, a manual or automated process to exercise
the software and record the results. This can be as simple as a set of instructions
to follow as transactions are entered against a test database. More significant
applications will require the use of scripting software.
Fault tolerance is impossible without full control over the upgrades made
to the system and its components. The burn-in and certification of any component,
hardware or software, should include the changes to design, their development
and testing, integration and final installation and acceptance.
Ideally, the same configuration rules and processes
should apply during
development and production to all the different units in the company dealing
with the software. For example, the reports and data generated by the developers
should be available to the testing team, and they should be in the same
formats. Everybody involved should have access to and be able to understand
the regression test scripts, the version control tools, the project management
softwarerformance statistics, etc. This provides clear traceability and
coherence of all documentation, hardware and software components. Understanding
and control over the certification process should be clearly mentioned in
the service level agreement.
Preventative Measures
Here are two specific steps for eliminating client outages. These provide
essentially the same level of redundancy usually reserved for servers. They
can be enormously expensive to install and maintain, but if the stakes are
high, consider both.
Use an uninterruptable power supple for client machines. Contact an engineer
to discuss providing conditioned and backed up power services to all workstations.
In critical areas, such as hospital medical equipment, this is already done.
Install dual network interface cards in each workstation. Make sure that
each NIC is connected to a separate LAN segment. The installation process
for this varies depending on operating system. Take care to ensure that
the client can simultaneously communciate on both segments. In the case
of a LAN failure the client can communciate on the remaining NIC. However,
this does little good if the process for determining failure does inelegant
things like crashing the machine or forcing the user to re-logon or reboot.
Finally, don't overlook the obvious.
Keep spares around to replace blown monitors, systems and keyboards.
All critical peripherals should be networked devices under central control.
For example, a modem for dial out may be critical for a given job
function.
Think twice about placing it on someone's desk where it is sure to fail
at some point and prove impossible to back up. Instead, place it on the
network and physically locate it in a nice rack in the computer room where
your fault tolerant LAN and redundant local exchange carrier's service can
get at it.
Forget about diskettes! Get a real backup scheme. What happens if your users
indeed do the backup. Who is responsible for locating the diskettes and
applying them? So, investigate backup software systems, make a selection,
make backup network based, use off-site storage and ...
backup and test!!
Make sure that all users know their responsibilities, whether that means
not introducing un-certified software or not storing their data outside
the backup scheme.
Network
Wiring
Everything starts with those cables behind your desk and in the walls.
You can think of a cable plan in stages:
What type of cable to buy
How much cable to buy
Where to put the cable
How to connect the cable
What to plug it into
How to remember where you put all that cable
We will look at the considerations that apply to fault tolerance in terms
of the components that make up a structured cable system. What's a structured
cable system? The keywords are structured and system. If you are afraid
to go into the wiring closet in your site, its probably not a good idea
to build a system you will have to bet the farm on.
For typical corporate networks we can make several safe recommendations
as they apply to fault tolerance.
Cabling should outlast the system you are installing. Wiring vendors are
even offering 10 year warranties on sites that implement their products
from end to end. This means that all your cables, connectors, outlets and
patch panels all comply with a given standard. Testing a circuit shoul
d
include testing through the patch panel or other connections since this
is how your cable will be used in production. It means that the resulting
cable plant responds predictably to cable testing and that jacks and terminations
have the same pinouts.
Please forget about any of that coax stuff you have laying around. Use a
standard hub design. Instantly, you eliminate segment outages caused by
a single device beaconing or jabbering.
Wire all the runs from the wiring closets to the desktop with Category 5
unshielded twisted pair wire, for both voice and data. You may save a little
using Category 3 for voice, but the cost of smashing out walls and re-cabling
later will dwarf the short term savings.
Don't cut corners. Cable problems introduce insidiously hard problems to
track on the network. Hard problems take longer to fix. Problems that take
a long time to fix can run up astronomical outage expenses.
Use more cable than you think you will need. A quick rule of thumb for moves
and changes is to plan on moving people, not network equipment and certainly
not wiring. Therefore, the number of data and voice jacks to place at every
cubicle is equal to what the most demanding group in the facility needs.
For example, if 90 percent of the users need a single PBX phone line and
a single LAN connection and 10 percent need two LAN connections, a PBX line
and an analog line, make sure that everyone gets outlets and cables for
all four connections, not two.
You can often add wiring between floors, but you rarely can add horizontal
cable behind the walls on a floor, so don't skimp. This extra cable gives
you flexibility to move equipment and people if a disaster affecs part of
the building. In the event of a cable problem, the uniformity of using Category
5 horizontal for both voice and data will pay off since you can use a patch
panel to move usage from data to voice and back. A modular system helps,
too. For example, AMP allows you to exchanger
RJ11 jacks for RJ45 jacks
at the wall plate without tearing the plate apart and re-wiring.
Stagger or stripe cable across departments. This is similar in concept to
RAID striping. RAID devices take a byte of data and spread it across multiple
disk drives. What we are recommending is taking the floor plan and distributing
adjacent desktops among different, separate cable components. In this way
a given department will not be completely wiped out by a LAN segment outage.
This means that people sitting next to each other should not be on the same
LAN segment.
This implies that no matter how small the installation, we plan on establishing
at least two LAN segments. If we go for the maximum LAN redundancy - two
LAN connections for every workstation - the same principle applies. Each
of the two drops is connected a different LAN segment.
This strategy is impossible without accurate records. You must ensure that
each jack and patch panel has labels for each of its outlets, whether they
are connected to horizontal or vertical cabling. You must have a list of
which jacks are on which LAN segments. Consider getting cable management
software with which to maintain the configuration of your cable routing,
termination and labeling. If you do, make sure that it is accessible to
everyone who maintains the cable and patch panels. The cable management
system should contain your entire original design and should be updated
with all changes.
Use fiber to extend your data runs from the closets to your main distribution
frame, e.g. your highest level hub. Copper just can't beat the reliability
and resilience of fiber. Use twice the number of fiber strands you think
you will need today. Tomorrow's segmentation needs will lower the number
of nodes per segment, and you will need those strands for more hubs.
So many outages are caused by technicians
mislabeling or misusing cable,
so remember where you pulled it. Consider putting the cable plant data in
a data base and insisting on updates as changes are made.
Put some time into selecting a good cabling contractor. They should work
hand in glove with your facility engineers at the planning stage. This way,
the building drawings will reflect where the cubicles or desktops go and
the electricians and plumbers will know where the data lines are to go and
the cable installers will know about the plumbing and electrical. This improves
the quality of the finished product, and improved quality reduces outages.
Insist on a hard copy of the end-to-end testing that was done on the cable
installation.
Leverage your standard wiring to locate specialized devices. If you have
any devices not on the LAN which require special wiring, try to put them
on the LAN. For example, instead of running special wiring to standalone
modems, use a modem pool.This will improve reliability for those devices,
and streamline moves and changes. It also will make the modems a shared
resource. If a device can't be made into a LAN service, try at least to
use the same cable for it that you are using for the LAN. For example, AS/400
devices typically use twinax cable, but by using baluns they can be run
over category 5 unshielded twisted pair wire.
Hubs
Hubs play several important roles in ensuring a fault tolerant network:
The best physical wiring plan is useless unless it is connected to an equally
well thought out hub deployment.
Hubs prevent errors from a device from impacting the rest of the network.
You need to design your network as a series of defenses against the problems
of the different devices on it. Hubs are your first line of defense. All
your hubs should be manageable and ideally support RMON, so you know what's
happening at that first line.
Hubs can be fault resilient. Recognize, however, that no matter how robust,
a hub is a single point of failure. If it crashes, everyone connected to
it is down. There are three ways to minimize this:
Connect every workstation with dual cable to separate hubs on
separate segments.
Use small stackable hubs to keep the number of "eggs in the
basket" (devices connected to a hub) small, perhaps between 16 and
32.
Stripe users in each department across multiple hubs.
The quality of your wiring plan extends to the connection to the hub. Ensure
that your contractor has bid quality, certified components in the patch
panel and the patch cords. Insist that testing include the patch panel itself.
Label or otherwise document the connections between patch panel and hub
port. Ideally, a central data base would give a remote support technician
the confidence to know which patch panel number is connected to what hub
port number.
Overengineer the number of hubs, keeping down the number of connections
per hub. I would recommend thinking about hubs in the 32 to 48 port range
today, less tomorrow. Besides minimizing the impact of hub failure, they
ensure adequate bandwidth and room for growth. Think seriously about smaller
stackable hubs, especially in the wiring closets. They have a smaller impact
if they fail. They are easier to swap out. They are easier to re-use or
re-deploy.
Select hubs that are fault resilient. The larger central hubs in the MDF
should at a minimum have backup power supplies. Ideally, they have redundant,
load-sharing power supplies and fans.
You can configure redundant fiber links such as FDDI between the central
hubs and the hubs in the wiring closets.
As your first line of defense, hubs should support, and
you should enable
defensive measures, such as retiming, beacon recovery, wrong speed detection,
and jitter elimination for Token Ring and autosegmentation based on collision,
streaming node, runt and corrupt frame thresholds for Ethernet.
All the hub management features in the world won't make a difference, however,
unless you enable them and monitor them. Given today's client workstation
environment, it is not always practical for networking folks to insist on
active monitoring of every client. But that is not so in regard to hubs.
All of the leading hub vendors support defensive measures and monitoring,
but it's up to you to manage it. This is the time to start planning your
network management strategy. Make sure that you include the costs to procure
the right management application for your hubs.
Today's networks generate thorny, difficult to analyze, distributed problems
from hell. There is some debate on the matter, but a hub that supports RMON,
and therefore supports the remote capture of LAN packets for analysis by
a network management system or protocol analyser, is certainly better than
no visibility to the segment at all. The ability to centrally diagnose problem
means that you will solve them earlier. Therefore, minimizing transient
outages.
Routers and Protocols
Routers mean protocols, and protocols need consistency to succeed.
Consider protocol choice first. As a practical matter, it's impossible to
support all the protocols everybody may want on your network, much less
your backbone or WAN, and still provide them with the same level of support.
Different protocols have different levels of resilience. Our goal for high-stake,
fault-tolerant networking is to eliminate unsupported protocols at the outset
and to create uniform local and backbone protocols. By eliminating unsupported
protocols, we mean just that. Kick them off the LAN. Do not allow systems
onto your network that use protocols you can't support. Twist, or break,
arms until in-house or third-party applications comply. If they can't or
won't comply, make it clear they wi
ll not get the full protection against
failure that your standards support.
Having achieved that, you will need to create a robust manageable scheme
for the protocols you will support. This means consistency. For TCP/IP,
it means a single Internet address scheme system wide. For IPX, it means
a central assignment of network numbers. For source-route bridging, it means
central assignment of ring numbers.
For the routers themselves, it means consistency in the way routers are
configured. For example, if IPX SAP broadcasts are filtered, filter them
the same way in all routers. The router is your second line of defense,
so use it to defend the backbone and WAN against outages by blocking inconvenient
protocols such as NETBIOS, Appletalk and IPX broadcasts.
Consistency applies to the hardware as well. Inventory your routers. As
much as is practical, use a limited subset of cards and features. Upgrade
or replace all routers to make sure all support flash memory and can support
a uniform set of firmware.
When selecting routers, Consider the features that apply to fault resilience.
On-line software reconfiguration and hardware service. The router
should support major changes to its configuration without requiring an outage.
This includes Software configuration changes, adding new LAN and serial
ports, and hot swapping of router cards and power supplies.
Passive backplane and dual load-sharing power supplies. Dual power
supplies should be used. Either one should be able to support the entire
router independently. A passive backplane means that any failures should
occur in cards that you can swap. This should eliminate most foreseeable
disasters for the actual router cage.
Boot capabilities.- the router should get on line quickly and shouldsupport
an intelligent hierarchy of boot paths. Boot information is typically stored
at several locations in the network, locally in flash memory, on a local
server, or
on a more distant server at least another router away. The router
will pick the next distant boot alternative in the hierarchy if one is unavailable.The
router should be able to be configured to do this. (If it's booting from
a server, make sure there is an alternate backup your router scripts!)
Deploying routers raises several considerations. Cabling and hubs cascade
data into the routers. Depending how you have striped the users across the
hubs, remember to stripe LAN segments across the routers. For example, make
sure that a single router outage cannot take out an entire section of the
building. If you choose to deploy two LAN segments to each desktop, make
sure that each of those segments terminates in a different router.
A more robust alternative to striping is available. A second router per
segment can transparently mask the effect of primary router failure. After
a router failure, robust protocol implementations in both the desktop and
router will kick in. The second router will take over for the first. As
long as we ensure a second path into the backbone, connectivity should not
be affected. Remember to factor in the impacts to load balancing when routers
housing WAN serial links go down. Correct use of routing metrics ensure
load balancing and switchover. Please test this before you depend on it.
Backbones
For fault tolerant facilities we need a reliable, self-healing backbone.
The factors involved in local backbone design are, isolation, alternate
paths and robust media.
Isolation means dedicating a LAN segment exclusively to communications among
routers. Only routers or equivalent switches can participate in these links.
This means that end-user segments, servers and gateways are deliberately
excluded. The local backbone is the final line of defense, the citadel You
want to defend it from problematic devices and protocols.
Alternate paths as it applies to the local backbone means at least
two links.
For example, the primary local backbone may be an FDDI ring, while an Ethernet
or Token Ring segment provides an alternate. Whenever a link failure occurs,
the alternate provides backup.
Robust media means overengineering. FDDI is a mainstream solution for local
backbones. Each FDDI ring is actually two separate rings. The protocol allows
traffic to alternate to the backup ring to bypass failures. Even with a
robust link between the routers, a second backup is advisable.
Some situations call for a more aggressive stand-in approach.. Specialized
fail-over schemes, such as that in Digital's Virtual Router Cluster product
may allow multiple routers to share a single address. When the primary router
fails, the backup router can step in. This does not require the use of specialized
discovery protocols.
Wide Area Links
No man is an island. No site is self contained. Wide area links are the
outer skin of the onion that forms the organization's backbone. We can apply
the same principles to them that we did to the local backbone.
If we designed the local backbone correctly we should not have any problems
with the data carrier on our WAN. Correct deployment of routing tables will
isolate traffic and filter miscellaneous broadcasts.
Alternate paths are applied the same way. If a link is important, make sure
there are two. One can approach the problem by using redundant parallel
circuits between important sites. A second approach to parallel circuits
is to use temporary dial-on-demand backup using Switched 56 or ISDN. Alternatively,
you can use a meshed topology where traffic is triangulated between sites.
This design will protect against link failure only if y
ou correctly understand
your carriers infrastructure, however. Most people ignore what happens to
the line after it leaves their building, but hat's really like ignoring
another LAN segment. What you don't know can hurt you. If robust media links
are important on LANs, they are important on WANs. Ignorance is not bliss.
In too many situations the circuits you order from your local exchange carrier
(LEC) and interexchange carrier (IEC) share facilities. That means that
your dedicated 56kb line and your Switched 56 circuit can be affected by
the same telecommunications outage. The only way to address this is first
to become familiar with the carriers' facilities in the area and second
to diversify.
Here are some questions to ask yourself and your carriers:
For ISDN and Frame Relay especially, does the carrier's network certify
and support the devices you want to connect? Lack of certification leaves
you open for poor support and therefore outages. If you need to use multiple
carriers, are their networks compatible? Check references and pilot recovery
scenarios before you need to use them.
We have talked about
your
recovery, what about your vendor's?
What can you expect if there is an outage in its central office? How will
it re-establish your network? Remember, it is subject to the same software
glitches, power outages and cable cuts that you are.
What type of diverse routing do they use? What facilities to they
share or lease from other carriers?
For frame relay, is the switch you will be using located in the local
central office? What happens when it crashes?
Check to make sure your LEC and IEC are coordinated about the recovery of
your network.
You may elect to create a separate entrance facility into your site ( usually
a new trench or overhead wire ) where new fiber of copper facilities go
to a separate central office. Ideally, the second facility entrance should
be
on the opposite side of the building from the original. That way no single
backhoe can take both out. The ultimate in WAN robustness diversifies across
two carriers for both local and long distance access. In some ways, this
remains only in the realm of the large users. As regulatory barriers drop,
however, alternate carriers increasingly will compete for your network dollar.
If you have value added network links such as credit card verification,
EDI or Internet access, the same rules apply. If they are critical to your
business, you need to know their recovery plans as well.
Servers
You can approach server availability in two ways: lost data or lost time.
The approach to take is make it right, then make it fast.
Step one is make your servers fault resilient. Several vendors sell fault
resilient servers with basic levels of fault tolerance including ECC memory,
RAID arrays and multiple NIC cards. Every vendor from AST to Zenith has
some type of resilient server. Be careful with the distinction between fault
tolerant and fault resilient, however. Very few can claim no single point
of failure.
Other pertinent features to look for in a fault tolerant system are:
Hardware-based reboot and power on/off for remote support of hung
servers.
OS support for changes in configuration and drivers without requiring
a re-boot.
A UPS interface for a graceful shutdown in the event of a major power
outage.
Hot-swappable drives, both RAID and otherwise.
Realistic backup and restore. Will the system be usable during backups
and restores, or will performance or file locks reduce usability.
Remote management of
everything.
The way to prevent lost data is to record it simultaneously to multiple
media. Use either disk mirroring or RAID. We won't spend time describing
or debating that here_with one exception: Make sure the drive farm is access
ible
by two SCSI adapters. Often a fault-resilient server talks to a fault-tolerant
RAID device across a single SCSI controller card. What if the card fails?
The next question is time. If a server fails, your data may be safe, but
can you get the standby server up in time? Test to see how long it will
take to load a cold, supposedly identical, standby. You may be shocked to
see how long it takes. Experiment with more tape drives, extra disks, and
more memory to see if the time to restore can be reduced. A cold standby
can be an effective alternative if you have the time.
A variety of products on the market, including Novell's SFTIII and Vinca
Corp.'s StandbyServer, further extend the reliability factor by providing
hot standby. The two systems exchange an "I'm alive" message.
When one stops sending, the standby system comes on line. The key point
here is that these types of solution capture disk writes and replicate them
to two live servers. Every piece of data written to the volume is captured
on the backup server. The backup server is always on-line ready to step
in. For the Unix market, IBM (HACMP/6000) software and HPs SwitchOver/UX
fill similar requirements.
If you dont have the time there are two alternatives to hot standbys: .applications-based
redundancy and peripheral switching.
Applications-based redundancy refers to applications-level routines designed
to protect data integrity, such as two-phase commit or various data replication
processes. Server availability is only as good as the availability of applications
running on it. As we move from simple file and print services to server-based
applications such as messaging, database management and image search engines,
we can't ignore the recoverability and resilience features of those applications.
Database management systems which support two-phase commit typically are
more robust when it comes to recovery and ensuring no data is lost, for
example.
Peripheral switch
ing places an intelligent switch on the SCSI chain between
the server and the disk farm (typically a RAID device). This is a logical
extension of the matrix switches used for years in large mainframe shops
to move huge numbers of peripherals from one mainframe channel to another.
A typical example of peripheral switching would unfold like this. Two RAID
devices are connected to the switch along with two servers. One server is
primary, the other an active backup. Each server owns one of the RAID devices.
In the standby server, a background application runs which periodically
polls the primary and performs small disk reads to ensure operation. If
the test fails, it waits a configurable period to re-ry. After a second
failure it signals the switch to move the failed servers RAID device to
the secondary. The secondary then mounts the volumes and starts the appropriate
applications. The SCSISwitch from ApCon provides a 4x2 configuration of
stackable SCSI hubs that can be extended with Fiber strands.
Finally, don't forget about strategic server deployment. Following the cascading
redundancy in our cabling, hub and router deployment, there are a few configuration
issues:
All servers should have at least two NIC cards.
Those interfaces should be connected to two different LAN segments.
Each of those segments should connect to separate interfaces on different
routers.
Application
developers and third-party applications
You can't exempt applications on either the client or server from their
responsibility in keeping things running. Insist that application testing
includes taking part in your failure simulations. At a minimum, look for
the following features in applications:
Intelligent resumption of broken connections
Support for two phase commit or data replication
Hot standby or alternate servers
How much processing can continue when parts of the
network are down
In addition to maximizing the stability of the software environment, we
should encourage developers to assume that failures will happen. Development
staff should consider automatic detection of failures, and automatic logons
to alternative servers. Such actions can minimize the event of an outage.
Sometimes a component can fail, but the system can be useful without it.
Take the average in-house developed system. As a network representative
on the development team ask these evaluation questions; How does it perform
on the LAN? How well does it perform on a dial in line? How well does it
perform, if at all, when it is disconnected? Deal with these questions before
the application is developed and deployed. The majority of client server
apps simply die if they cannot find their server. Is there any useful work
that can be accomplished without the central database? If so, how will the
work accomplished during downtime processing be synched up with the server?
Is there an alternate server that is available? Can the application be coded
to try an alternate and log onto it? How will the client know that the server
is back up and they need to log onto it? How many of these functions can
be automated?
Network management
In one sense, network management doesn't have a large role in fault tolerance
design. As we have seen, surviving a failure depends on every link of the
chain; client, cabling, hub, router, server and application. It's really
too automatic a process for network management. Avoidance of outages ,minor
and major, requires management of the network - clients, servers and network
components alike.
Summary
This article attempted to instill a certain attitude a certain urgency
regarding the design and support of fault-tolerant networks. You may not
find all the answers here, but hopefully you have found a few good questions.
Resources
Token Perspectives Newsletter
March 1995 "Special report:
IBM strategies for high availability."
Enterprise Systems Journal
Nov 1994 "Performance assurance;
a new paradigm for performance and capacity planning."
HP Professional
Sept 1994 "Down but not out."
Communications News
June 1995 "Service level agreements: your
peace of mind"
REPORTS
Analyize In-Line NAC strategies and products.
ANALYTICS Plan and design your enterprise blade server deployments
InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Download Today