Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Disaster Recovery in the Public Cloud

I’ve had the opportunity to speak with many users about their plans for public cloud adoption; these discussions frequently revolve around how to avoid being impacted by potential cloud outages. Questions come up because public cloud outages do occur, even though they happen less frequently now than they may have in the past, and customers are concerned about mitigating the risk of disruption.

Thankfully, every major public cloud vendor offers options for building highly available environments that can survive some type of outage. AWS, for example, suggests four options that leverage multiple geographic regions. These options, which are also available with the other public cloud vendors, come with different price points and deliver different recovery point objectives (RPO) and different recovery time objectives (RTO).

 

Companies can choose the option that best meets their RPO/RTO requirements and budget. The key takeaway is that public cloud providers enable customers to build highly available solutions on their global infrastructure.

Let’s take a brief look at these options and review some basic principles for building highly available environments using the public cloud. I’ll use AWS for my examples, but the principles apply across all public cloud providers.

First, understand the recovery point objective (RPO) and recovery time objective (RTO) for each of your applications so you can design the right solution for each use case. Second, there's no one-size-fits-all solution for leveraging multiple geographic regions. There are different approaches you can take depending on RPO, RTO, and the amount of cost you are willing and able to incur and the tradeoffs you are willing to make. Some of these approaches, using AWS as the example, include:

  • Recovering to another region from backups – Back up your environment to S3, including EBS snapshots, RDS snapshots, AMIs, and regular file backups. Since S3 only replicates data, by default, to availability zones within a single region, you’ll need to enable cross-region replication to your DR region. You’ll incur the cost of transferring and storing data in a second region but won’t incur compute, EBS, or database costs until you need to go live in your DR region. The trade-off is the time required to launch your applications.
  • Warm standby in another region – Replicate data to a second region where you’ll run a scaled-down version of your production environment. The scaled-down environment is always live and sized to run the minimal capacity needed to resume business. Use Route 53 to switch over to your DR region as needed. Scale up the environment to full capacity as needed. With this option, you get faster recovery, but incur higher costs.
  • Hot standby in another region – Replicate data to a second region where you run a full version of your production environment. The environment is always live, and invoking full DR involves switching traffic over using Route 53. You get even faster recovery, but also incur even higher costs.
  • Multi-region active/active solution – Data is synchronized between both regions and both regions are used to service requests. This is the most complex to set up and the most expensive. However, little or no downtime is suffered even when an entire region fails. While the approaches above are really DR solutions, this one is about building a true highly available solution.

One of the keys to a successful multi-region setup and DR process is to automate as much as possible. This includes backups, replication, and launching your applications. Leverage automation tools such Ansible and Terraform to capture the state of your environment and to automate launching of resources. Also, test repeatedly to ensure that you're able to successfully recover from an availability zone or region failure. Test not only your tools, but your processes.

Obviously, much more can be said on this topic. If you are interested in learning more about disaster recovery in the cloud, you can see me in person at the upcoming Interop ITX 2018 in Las Vegas, where I will present, "Saving Your Bacon with the Cloud When Your Data Center Is on Fire." 

Get live advice on networking, storage, and data center technologies to build the foundation to support software-driven IT and the cloud. Attend the Infrastructure Track at Interop ITX, April 30-May 4, 2018. Register now!