With the tools available today, organizations have few excuses -- not even budgetary ones -- for relying on the hours-long process of manually restoring mission-critical apps from backup.
Application failover approaches run the gamut, from basic clustered server "ping and a prayer" software to complete virtualized systems and application-specific schemes. Finding the one that's right for you will involve more than a glance at the price tag, which runs from $1,500 to $10,000-plus per protected server. You'll also need to consider ease of use, speed of failover, bandwidth consumption, and how much data is at risk.
When most system administrators look to improve application availability, they start with server clusters. Failover clustering has been available in Windows Server's Enterprise Editions since Windows NT 4 was state of the art in the mid 1990s, but it developed a well-deserved reputation for being finicky.
Windows clusters used shared storage, which of course made the storage subsystem a single point of failure, until Windows Server 2008 was released. Microsoft insisted on only integrated server and storage solutions, so users that had Hewlett-Packard servers and EqualLogic storage, for example, were out of luck when it came to support. Most significantly, applications had to be cluster-aware to smoothly fail over from one server node to another.
Even before Microsoft added clustering to Windows itself, vendors like Double-Take Software released solutions that combined data replication, which eliminates storage as a single point of failure, with automatic failover. Early versions of these products required a lot of setup and tweaking, including installing the OS and applications on both servers. However, the current crop, such as SteelEye's LifeKeeper, CA/XOsoft's WANsync, NeverFail's Continuous Protection Suite, and, of course, Double-Take, can clone a production server to the standby server, both speeding setup and ensuring the servers are similarly configured. And some of these offerings support Linux clustering as well as traditional Windows clusters.
In a generic cluster or high-availability system, the failover server, or servers, monitor the primary host by exchanging heartbeat messages across the network (see diagram, "Two Ways To Keep Apps At Your Service"). If the primary host doesn't respond within a given period of time, the standby server assumes the primary host's identity and starts processing data in its place.
This method can prevent data loss due to a complete failure of the primary host and allows manual failovers for patching and other server maintenance, but it can't detect more subtle failures of services and daemon processes. Vendors including SonaSoft and Marathon sell more app-aware offerings, which check the state of services or connect directly to applications to ensure they're running.
Products also use different methods to allow a standby server to assume the identity of a production server in the event of a failure. The simplest way is to assume the production server's IP address and start appropriate services. A more sophisticated approach used by NeverFail and others is to hide the standby server behind an internal firewall to prevent users from accessing it until it's called on to take on the primary server role. At the top end of the product spectrum, Marathon's EverRun runs the primary and standby servers in lockstep in a virtual environment. Each server processes all data, but users access only the primary one. The backup server waits in the wings until something goes wrong.