Building Reliability Into IT

There's a lot to like about Twitter; for me, the interesting discussions that crop up are a main draw. This was especially true recently, regarding a conversation about the foundations and premises of cloud computing. The discussion (sorry, I have yet to find a way to retrieve an entire Twitter thread) started with Sam Johnston saying "Legacy: unreliable software on reliable hardware. Cloud: reliable software on unreliable hardware." That's the kind of loaded statement Johnston likes to throw o

Mike Fratto

September 27, 2010

5 Min Read
Network Computing logo

There's a lot to like about Twitter; for me, the interesting discussions that crop up are a main draw. This was especially true recently, regarding a conversation about the foundations and premises of cloud computing. The discussion (sorry, I have yet to find a way to retrieve an entire Twitter thread) started with Sam Johnston saying  "Legacy: unreliable software on reliable hardware. Cloud: reliable software on unreliable hardware." That's the kind of loaded statement Johnston likes to throw out to the world, but the salient point is that software should not only fail gracefully, it should survive gracefully. That's my kind of sentiment.
 
Building reliable computing systems is more than cobbling together a bunch of components that have a long mean time between failure (MTBF) or a short mean time to repair (MTTR). I know, it's what we did, and still do, and we pay a premium for these supposedly highly available systems. It was the best way at the time to build reliable systems. The problem is the more components in series that have 99.999 percent uptime in a system, the less reliable the systems. Each component is dependent on all the other components in the system. If one component fails, the system fails.

Increasing reliability involves adding parallel redundancy. With parallel redundancy, there are multiple, independent execution paths so that if one path fails, the other can pick up the work. Pixelbeat has a nice description of series vs parallel reliability, and their example is a RAID 1 array where there are two disks, both with 99.999% reliability. Since the two disks are parallel, the combination yields eight nines or 99.999999% reliability. 

Systems built with better quality components and components in parallel like RAID, dual power supplies, N+1 fans, etc, had higher the reliability, but also cost more than consumer grade systems. The hardware still failed, but higher MTBF meant on average they were less likely to fail and failed less often. In addition, many of the parts were easily replaceable without taking the system from the rack reducing MTTR. That was part of the premium of buying "enterprise class" servers from well known names like Dell, HP, IBM and Sun versus white box vendors. The premium servers ought to be more reliable, and often were, than white boxes.

Building reliable hardware worked well when IT bought monolithic software applications that ran on single servers until the inevitable happened. Applications like Microsoft's Exchange 2003 mail box server, for example, are  prone to outages because if the mailbox server failed (a single point of failure), it took down Exchange. Granted, Exchange could be clustered, but Microsoft's clustering at the time caused as many problems as it solved, but I digress. The Mailbox server, the mailboxes it contained, were tightly coupled to the hardware it ran on. Even if the Exchange Mailbox server was virtualized, it doesn't improve uptime by much. VMware's HA, for example, can detect failure and restart the failed VM automatically, which is good, but do that a few times a year and you go from five nines to less.

Taking advantage of that agile hardware means having software that is designed to be resilient. A three tier web application is a perfect example of a design that can be resilient. HTTP is a stateless protocol and web applications are often comprised of 2-3 tiers. For example, you might have a web server tier that handles session functions, an application tier that actually crunches data and performs actions, and a database tier that stores data and maybe does some processing. An application delivery controller can be inserted between any of these tiers to provide automatic resiliency by de-coupling server layers from each other. But that isn't sufficient to provide real reliability. A failure at the web server layer--the one that interacts with the client--means that when a web server fails, the client's session and the web server to application servers sessions are disconnected. You could solve that with an ADC that supports HTTP stateful fail-over, or you could write code that supports HTTP statefulness at the application layer. Either way is fine, but if you are using an ADC, then it becomes part of your application. But what about other applications that aren't as easily supported?Many custom built applications are not good candidates for parallel reliability because the original specifications were modest and relied on expensive hardware for reliability. Informationweek Analytics contributor Mike Davis pointed out in a conversation with me that retrofitting existing applications may not be as hard as you think. Making the software smarter may not require a complete rebuild of the application. Any modern software application should be written in a modular enough fashion that separating the tiers, for example, and adapting a service broker model, can make the application more amenable to parallel reliability.

Regardless of the method used to modularize applications, there will be an impact on the network design and operation. More transactions will require lower latency networks to improve application performance. More network equipment like ADC between application layers is going to have to be factored into overall application designs and data center operations. The network layer is going to have to be more aware of application location. The separation between application architecture and network architecture is going to get very blurry.

It's not going to be enough for the network team to design the network in a vacuum. Similarly, the application development team is going to have to learn more about the capabilities built into the network. Both teams are going to have to work together to ensure your organization runs reliable applications. You can look at this additional team work as a chore to be completed or as an opportunity to build better systems. As IT professionals, I'd think the latter is the preferred choice. 

A private cloud can play a key role in your disaster recovery strategy. We dig into the storage, LAN, and WAN requirements to build a cloud for DR. That and more--including articles on automated data centers and SaaS Web security--in the new all-digital issue of Network Computing. (Registration required.)

About the Author(s)

Mike Fratto

Former Network Computing Editor

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights