EMC Merges High Availability, Disaster Recovery into One

EMC is combining products and services to turn high availability and disaster recovery into a single concept it calls Continuous Availability.

David Hill

February 15, 2013

6 Min Read

Service providers and large enterprises have a goal of delivering "24x7 forever" availability, particularly for mission-critical services and applications. EMC wants to help customers meet that goal with the concept of continuous availability (CA), which marries high availability (HA) and disaster recovery (DR). The CA approach is built around EMC's VPLEX product, as well as a new service offering to perform assessments and analyze costs.

The first step in delivering a 24x7 forever is to provide enough extra server and storage capacity to create an HA system. The HA system is the first line of defense against problems that threaten "five nines" application availability. Different services and applications will require different levels of redundancy. For an enterprise database application, servers are typically replicated 100% for redundancy. EMC estimates, though, that for a Web farm, only about 20% more servers need to be provided, so only 20% redundancy is necessary.

The second step is to create a DR capability at a site geographically separate from the original data center. This typically requires 100% redundancy in both servers and storage. Note that the 100% is true both for enterprise databases and Web farms because if availability is impacted it would be the whole site (otherwise it would not qualify as a disaster).

Notice that the redundancy required to fully protect availability is an extra 200% in the case of databases and 120% in the case of Web farms.

An Alternative to the Conventional Architecture

With its new Continuous Availability Advisory Services offering, EMC proposes an alternative to traditional scenarios--a merger of traditional single-site HA with dual-site DR to create a continuous availability system. In a full CA architecture, transactions from the same application are processed in each of the two sites simultaneously. This is done by using global load balancing to distribute transactions to each site. Web and application farms are stretched between sites creating active-active applications.

At the data layer, for example, a local Oracle RAC cluster can be stretched between the sites to provide a locking mechanism over the databases. And then the storage layer is connected via EMC's VPLEX to provide a data coherency mechanism that syncs the data between the storage arrays deployed between the sites.

The final piece of the architecture is the use of active-active data center infrastructure components, such as a shared name space and common IP addressing, which are deployed so that applications can run seamlessly in either site. Probably the most interesting thing about EMC's approach is that the company claims the architecture can be provisioned with off-the-shelf components and most applications can be adapted without code changes.

[ Join us at Interop Las Vegas for access to 125+ IT sessions and 300+ exhibiting companies. Register today! ]

And where an application does not fit nicely into the mold of an active-active application architecture over Oracle RAC, a near-CA architecture can be deployed where application and database clusters run normally in one site and fail-over to another site. In this near-CA architecture, the storage layer is still using VPLEX, and the applications and DBs are set up in a two-site HA mode. This new paradigm that EMC is rolling out can provide many different combinations of CA and 2-site HA modes at the Web (presentation), application, data and storage layers to provide a level of resiliency above what was previously achievable.

In this architecture, EMC argues, each of the two sites needs only about 60% of the original performance capabilities for a total of 120%, which is 20% redundant. What magic does the company use to achieve this? EMC employs an approach called "fractional provisioning" of the server count. Under normal circumstances, 100% is enough by definition; and in most cases, if you look at the CPU utilization under most day-to-day circumstances, utilization averages somewhere in the 50% to 70% range. The remaining free space (above the 50% to 70% mark) is headroom and is used during peak hours or heavy business usages. So, the logic goes, put the average compute of 60% of the need in each site for a total of 120%.

Next Page: Going Not Quite the DistanceThe 20% extra server capacity can accommodate processing needs if a few servers fail, or if demand fluctuates. If one site goes down, you immediately have the average compute available. If the outage is prolonged, then you may need to run some triage and defer workloads or add capacity, which is getting easier to do in the virtual world.

But how likely is this to happen? EMC reports that a study shows of all the events that affect availability (including not only natural disasters but also business mergers, acquisitions and data center relocations), less than 1% requires a failover from one site to another. In other words, DR is a necessary but expensive insurance policy. By contrast, a CA approach provides the DR shield, allowing companies to put "insurance premiums" to work elsewhere while saving on the overall IT business continuity investment.

Note that the 20% redundancy offered in CA is much less than the redundancy in a standard HA + DR combination. Because of this, EMC claims its approach offers a potential 28% to 50% reduction in overall compute costs.

The CA model requires that applications be run in parallel (or failover in near-CA versions) between two geographically separated sites. The synchronization of workloads at two distributed sites has to occur with low enough latencies (less than or equal to 5 milliseconds) that no one will be able to discern any difference in performance.

EMC's VPLEX Metro enables users to have read and write access to the exact same information at the same time (for all practical purposes) in two separate locations. It can support that performance level at distances of up to roughly 100 km, and work with common third-party clustering products--notably, VMware, Oracle RAC, HACMP/PowerHA, MC/Service Guard, MSCS and VCS. In other words, IT doesn't necessarily have to change how it works to accommodate the applications in the EMC CA world.

But what about the distance factor? Typically, 100 km is not considered enough for true DR. Some organizations may be willing to take the risk of having a second site within the 100 km range, with the understanding that a major disaster (think Hurricane Katrina) may still bring down a CA architecture. Others may have to take additional steps, such as bringing in third-party site for worst case scenarios.

A good economic case can be made for a CA approach based on a reduced server count and power savings. But from an IT perspective, the simplification inherent in EMC's CA Services is a big benefit because maintaining both HA and DR environments is a significant challenge. Moreover, in CA, the IT personnel at the remote site now are not an afterthought but are fully involved in normal operations.

Still, IT organizations will have to perform thorough evaluations to justify a move to CA.

Mesabi Musings

The consolidation of HA and DR into one CA system seems to be a logical evolutionary step. The same benefit (24x7 forever availability) at lower cost and greater IT simplification makes a CA approach attractive. However, it is also a major decision that IT organizations have to think through carefully. So even though technologies are available that can do the job, enterprises may still have to be convinced to go forward.

EMC is a client of David Hill and the Mesabi Group.