4 Steps To Prevent Infrastructure Outages

The busy lives we lead make us all very impatient people; personal tasks requiring multiple steps can feel insurmountable. Case in point: planning ahead to buy airline tickets for a family trip.

Say you’ve finagled 30 quiet minutes at the computer, coordinated everyone’s availability for a fall trip to Grandma’s house, and done so in time for a great fare sale. So imagine your dismay when you can’t access your favorite airline’s website, or it’s so slow you can’t complete the purchase. Do you say “oh well” and try again later, or do you check to competitors' offers? Most likely, it's the latter, plus you might be irked enough to share your bad experience on Twitter and Facebook.

Southwest Airlines experienced a multi-day slowdown in early June just as its fall fare sale got underway. Heavier than expected website traffic caused the site to slow down so much that pages timed out for most customers. To make matters worse, the toll-free phone number was overwhelmed as well. Southwest anticipated increased demand as per usual for annual fall fare sales, but the additional capacity brought online beforehand wasn’t sufficient. This explanation isn’t likely to satisfy inconvenienced customers, or executives bemoaning revenue loss.

In our always on, WiFi world, customer expectations for ecommerce escalate rapidly and continuously. Even a few seconds of poor website performance (let alone a few days!) can be enough to send even loyal customers elsewhere. IT proves its value by enabling the business to meet these elevated expectations through mature, optimized capacity management.

What can organizations do to avoid such customer service disasters?  It comes down to four essential components of capacity management: predict and prevent; analyze meaningful metrics; plan in alignment with business needs; test and test again.

Predict and prevent

The best way to avoid losing revenue, reputation, and customers is to prevent outages, especially the type of routine failures that can’t be blamed on a major disaster.  Collect and analyze machine, power, log, usage and cost data, with particular emphasis on performance and resource consumption. Inventory and assess your current capacity in detail. Work with customer-facing business units to identify usage trends based on historical data as well as planned future initiatives.

Once you've collected and correlated meaningful data sets, you can apply predictive analytics. In this way, scenarios can be run against the data (based on current and/or planned capacity) that will allow IT to foresee at what point outages are likely to occur due to heavy use or machine failure. Using detailed data to drill down to the root of any unexpected results, weaknesses can be pinpointed and permanently and deliberately fixed and tested before they turn into public embarrassments.

Intelligent, data-driven projections (and simulations when possible) reveal the cascading effects of forecasted growth or surges in traffic. Accurate predictions make it possible to carefully and cost-effectively provision ample resources to meet demand as needed, rather than adding them haphazardly after a capacity shortage has impacted end users.

Analyze response, not utilization

With so much data being generated in the data center, it can be hard to know what to analyze, and IT is often sidetracked by metrics that only tell part of the story. Focus on performance, not machine utilization; understand how, when, and why your customers visit your site and how they expect it to perform. Investigate the actual end user expectations and experience. If planning future business initiatives, be sure to understand the business objective and what end user needs the change is supposed to address.

On the technical side, analyze the latency and response time carefully to have an accurate statistical picture of end user experience. Analyze workload transaction and application performance and VMs to see how much utilization is service activity vs. waiting for resources. Make the necessary adjustments to optimize for end user-oriented workloads and transactions.

Plan

If an ounce of prevention is worth a pound of cure, a good plan saves a ton of remediation. Anticipate the impact of sales (ahem, Southwest), promotions, new application and site rollouts, seasonal or time-of-day surges. I can’t emphasize enough how important it is to collaborate with business units (marketing, sales, operations, etc.) when doing this type of planning. When planning architecture upgrades, include time and resources to run simulations so you have an accurate idea of how it will perform for the end user under various  scenarios.

You can’t predict every possible outcome or control every factor, so it’s imperative to plan your response to an outage or slowdown. The speed and effectiveness of the response makes the difference between hiccup and havoc. The Southwest outage stretched to two days, a virtual eon in ecommerce time, and call volume quickly maxed out the only alternative.  Incident response planning is a vital component of your core offering and your ability to compete.

Test

Testing, like proper planning, requires time and resources that can be challenging to justify. It’s important to remember that thorough, targeted testing can reveal unforeseen incompatibilities, glitches, and capacity issues. Earlier this year, iTunes and Apple Store outages caused by a configuration mistake made on an internal DNS cost Apple an estimated $25 million in lost revenue in just 12 hours.

Test both before and after each change or upgrade. All too often, testers overlook the before or after test or move ahead without explaining all of the differences in testing output. Testing repeatedly for various scenarios can help prevent customer and revenue loss that often follows a service failure. It also avoids public embarrassment and subsequent brand or reputation damage, which can have long-lasting impact.

The healthcare.gov debacle is a perfect cautionary tale; the widespread service failures and delays resulting from a lack of planning and testing had far-reaching political, financial, and public welfare effects that are still reverberating years later.

In the vast and complex ecosystem of websites and online services, there are too many factors at play to guarantee flawless performance. Sustaining competitive advantage depends on winning and keeping happy customers, which is hard to do without optimized IT service delivery. Capacity planning and testing based on a thorough, data-driven understanding of your systems and how your customers interact with them is key to resiliency and growth.