Managers often find it nearly impossible to devote time to adequately understanding
issues that do not contribute directly to the bottom line. Consequently,
it is usually difficult to justify the expense associated with the network,
and as we will see later on, the price tag goes up with the level of redundancy.
One useful technique for securing adequate funding, however, is to lay out
in black and white the cost of downtime.
The Cost of Downtime
In this simplistic example, we examine the cost of downtime for a mythical
consumer-oriented business, such as an airline's or hotel's reservation
center. The customers have a choice. If they cannot reach the reservation
center, they will call a competitor and place their order there. Lost business
is really gone for good.
Our hypothetical customer service center has a staff of 500 people, each
of which carries a burdened cost of $25 an hour. They make an average of
60 transactions per hour and average of three high-priced sales per hour.
Hours of operation are 24 hours a day, seven days a week, 365 days a year.
In actuality, line managers of the site should calculate the costs of downtime,
not the IS staff. This information often is not forthcoming, however. So
you present it to give a general sense of the impact that downtime has on
the bottom line. The goal is to open some eyes and generate some debate.
Use this example as a guideline for how to estimate the cost of outages
in your environment.
As we can see, the cost of outages in our hypothetical network with an availability
rate of 99.9 percent is about half a million dollars a year. We have already
bought the hardware and software necessary to do the job. We can consider
this estimate a guideline on the
additional
budget to spend on providing
redundancy. This is separate and apart from the funds required to provide
a base level of network functionality.
Some additional industry statistics may help. In the September 1994 issue
of
HP Professional
, an article,
Down but not out,
. said:
The average company loses two to three percent of its gross sales within
10 days after losing its data processing, and critical business functions
cannot continue for more than 4.8 days without a recovery plan in progress.
Half of the companies that do not restore their data center to operation
within 10 business days never fully recover. Ninety-three percent of the
companies lacking a recovery plan are out of business wi
thin five years
of a major disaster.
It is really not worth rushing headlong into designing a fault-tolerant
network unless all parties agree on all the implications that downtime has
to the operation. This is the time to seek an executive sponsor to champion
the process. Assuming there is a consensus on the real cost of downtime,
now we can move on to crafting a plan of action.
The Service Level Agreement
That plan should start with a service level agreement. A service level agreement
is simply a contract between your corporate customer and the IS department.
Basically, the service level agreement formalizes the relationship on a
customer/supplier basis. The agreements documents the understanding between
customer and supplier. Some IS departments will view the process with skepticism.
It can be unnerving to relinquish the upper hand if users have been viewed
as mere consumers - not valued customers.
In order to receive funding and more importantly to document the responsibilities
and expectations of all parties, however, this process fits many medium-to-large
organizations. It should be a win-win for all involved. We should strive
to maximize results, concentrate effort, and recommend organizational change
where appropriate.
At the outset of the plan, note that fault tolerance is not simply a response
to failure. It involves an ongoing cycle of planning, design, daily monitoring,
long term trends and regular re-evaluation. We should include all assumptions
and forecasts as part of the plan and update it as assumptions or growth
changes.
Measurement of progress against the plan should not be cast in terms of
how long a particular switch or server has been up. Tracking individual
components and sub-systems is obviously important, but it cannot be reflected
in terms of customer service. Rather, progress should reflect the ability
of the system to meet the users expectations as documented in th
e service
level agreement.
The service level agreement should document the understanding between the
parties about:
The priority that systems or groups receive in a triage situation.
Mandatory or core functions that need extra protection versus desirable
or support functions: In some situations, core functions can comprise as
little as 20%-30% of the total number of features.
User responsibilities. For example, only approved software will be
used.
The understanding that no unauthorized software will be installed
The responsibilities of all IS parties_development, support, database,
network, operations, and vendors
Time frames for response and repair
Expected levels of unplanned outage
Expected levels of planned outage
Expected performance characteristics during normal conditions
Expected performance characteristics during failure conditions
Certification process for new systems
Standards and guidelines for all components
Resolution of inadequate performance
Costs for different alternatives
Process for changes in forecasts
Exceptions, if any
Escalation procedures
Re-Evaluation processes
Ideally, this should be applied to all parts of the system including central
and remote sites. Don't forget to consider your partners and customers external
to the organization.
Hold everybody's feet to the fire until you get participation. Draw up a
set of assumptions based on your own experience with the applications and
groups involved. Then, on an individual basis, set up interviews, meetings,
surveys or whatever it takes to get buy-in.
If this sounds like a lot of extra effort, it is. But the IS business is
about service. Excellence in customer service is the only real difference
between you and the competition. It makes sense to apply a certain amoun
t
of rigor to the process. If yours is the type of shop that flies by the
seat of the pants, now is a good time to re-evaluate that position. Again,
it goes back to attitude. Organizational determination needs to exist in
order to truly provide for fault tolerance. Remember the responsibility
for avoiding failures, recovering from them, and providing backup and restore
falls entirely on your shoulders. No vendor can relieve you of this responsibility.
Conforming to
the Service Level Agreement
The point is to ensure that we have consistency between design, implementation
and our goals. We need to put methods in place to ensure that. The first
step in ensuring the long term quality of the effort is to determine which
statistics will be tracked. At this stage, we will need a plan to track
conformance to the service level agreement. The plan should include:
recommended methods and tools
change control
configuration management
daily and long term statistics
documentation plan for service level reporting
Be practical about how much data to store in your database of statistics.
A roll-up, or summary, of statistics, if done with foresight, may be sufficient.
Plan on keeping it to a reasonable size. Break it into sections which can
be managed independently by different groups.
Scope
Unfortunately, in the real world, there is no way to limit scope. Disasters
can occur anywhere in the network, at any level in the ISO communciations
model from physical layer to presentation. It can be useful, even necessary,
however, to separate the problem into logical groups. For example, most
corporate IS staff are divided into something like the following groups:
workstation support
network
server support
database
development, both client and server
In addition two functions are shared across all group
s:
planning
operations
In the following sections we will address the issues and responsibilities
regarding fault tolerance as they apply to those groups. Emphasis will be
placed on network considerations. However, remember that responsibility
is shared across all groups.
REPORTS
Analyize In-Line NAC strategies and products.
ANALYTICS Plan and design your enterprise blade server deployments
InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Purchase Today: $299