Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Building Reliability Into IT

There's a lot to like about Twitter; for me, the interesting discussions that crop up are a main draw. This was especially true recently, regarding a conversation about the foundations and premises of cloud computing. The discussion (sorry, I have yet to find a way to retrieve an entire Twitter thread) started with Sam Johnston saying  "Legacy: unreliable software on reliable hardware. Cloud: reliable software on unreliable hardware." That's the kind of loaded statement Johnston likes to throw out to the world, but the salient point is that software should not only fail gracefully, it should survive gracefully. That's my kind of sentiment.
Building reliable computing systems is more than cobbling together a bunch of components that have a long mean time between failure (MTBF) or a short mean time to repair (MTTR). I know, it's what we did, and still do, and we pay a premium for these supposedly highly available systems. It was the best way at the time to build reliable systems. The problem is the more components in series that have 99.999 percent uptime in a system, the less reliable the systems. Each component is dependent on all the other components in the system. If one component fails, the system fails.

Increasing reliability involves adding parallel redundancy. With parallel redundancy, there are multiple, independent execution paths so that if one path fails, the other can pick up the work. Pixelbeat has a nice description of series vs parallel reliability, and their example is a RAID 1 array where there are two disks, both with 99.999% reliability. Since the two disks are parallel, the combination yields eight nines or 99.999999% reliability. 

Systems built with better quality components and components in parallel like RAID, dual power supplies, N+1 fans, etc, had higher the reliability, but also cost more than consumer grade systems. The hardware still failed, but higher MTBF meant on average they were less likely to fail and failed less often. In addition, many of the parts were easily replaceable without taking the system from the rack reducing MTTR. That was part of the premium of buying "enterprise class" servers from well known names like Dell, HP, IBM and Sun versus white box vendors. The premium servers ought to be more reliable, and often were, than white boxes.

Building reliable hardware worked well when IT bought monolithic software applications that ran on single servers until the inevitable happened. Applications like Microsoft's Exchange 2003 mail box server, for example, are  prone to outages because if the mailbox server failed (a single point of failure), it took down Exchange. Granted, Exchange could be clustered, but Microsoft's clustering at the time caused as many problems as it solved, but I digress. The Mailbox server, the mailboxes it contained, were tightly coupled to the hardware it ran on. Even if the Exchange Mailbox server was virtualized, it doesn't improve uptime by much. VMware's HA, for example, can detect failure and restart the failed VM automatically, which is good, but do that a few times a year and you go from five nines to less.

Taking advantage of that agile hardware means having software that is designed to be resilient. A three tier web application is a perfect example of a design that can be resilient. HTTP is a stateless protocol and web applications are often comprised of 2-3 tiers. For example, you might have a web server tier that handles session functions, an application tier that actually crunches data and performs actions, and a database tier that stores data and maybe does some processing. An application delivery controller can be inserted between any of these tiers to provide automatic resiliency by de-coupling server layers from each other. But that isn't sufficient to provide real reliability. A failure at the web server layer--the one that interacts with the client--means that when a web server fails, the client's session and the web server to application servers sessions are disconnected. You could solve that with an ADC that supports HTTP stateful fail-over, or you could write code that supports HTTP statefulness at the application layer. Either way is fine, but if you are using an ADC, then it becomes part of your application. But what about other applications that aren't as easily supported?

  • 1