When running active-active data centers (where both data centers can service an application at any given time), it’s easy to rest comfortably, thinking everything is just fine. Both data centers are set to run your business critical applications and act as the front-end for certain transactions. Databases are synced. Storage is synced. Security policies are synced. All seems to be well.
But is it really? At the risk of sounding alarmist, there are some myths I’ve debunked while running active-active data centers. Here are three risky assumptions to make about the data center design:
Myth No.1: A DNS record change is a reliable way to move an application.
Domain name record changes are the most straightforward way to move an application between data centers. Let’s say address space 100.100.1.0/24 is homed to DC-1, and 100.100.2.0/24 is homed to DC-2. Your application myapp.mycorp.com is set with a DNS A record of 100.100.1.100 when processing in DC-1, and 100.100.2.100 when processing in DC-2. Therefore, to move the front-end of the application from one DC to the other, the easiest way to do it is to update the DNS record from one IP address to the other.
This works -- well, kind of. There are a couple of gotchas to watch out for:
- Time-to-live. DNS records have a TTL value assigned when they are created. The TTL value tells caching DNS servers and querying endpoints how long the record is good for. If the TTL is one day, DNS caches will hold onto the record for up to 24 hours. Cached entries are not affected by your DNS change. Therefore, you must maintain a low TTL, say 30 seconds, if you expect to switch application over to another DC quickly.
- Some client-side applications do not honor TTL settings. This issue shows up with HTTP clients, where a client might hold onto a DNS response until some indeterminate time that could be interpreted as “when it feels like it.”
Myth No.2: Your failover will work.
Applications that live in multiple data centers are complicated beasts. In some ways, they are small-scale examples of distributed computing, with many of the caveats and concerns related to data synchronization. This isn’t a problem when the application is initially set up. Everyone in IT is on their best behavior to make sure firewall policy functions well, routing behaves as expected, and data synchronization is both functional and timely. Failover is tested and even works.
Inevitably, time goes by, and as it does, infrastructure changes. Applications get new features. Perhaps ADCs get upgraded with fancy new clustering. A new inter-data center network link is installed. And for whatever reason, testing application failover after each seemingly straightforward change gets overlooked. It's too hard to schedule, there are too many people involved in the change control process, or no one thinks failover would simply break.
The reality is that unless you’re testing your failover regularly, your active-active data center probably isn’t active-active at all. Your failover might just fail.
Myth No. 3: Using 80%+ of computing capacity is efficient.
Organizations with very high application availability requirements rely on a strategy of disaster avoidance as opposed to disaster recovery. A DR strategy may still be important for an organization, but in the scenario of active/active DCs, one of the goals is to avoid significant downtime in the face of a disaster. In other words, if DC-1 is no longer able to process application transactions, then DC-2 must be able to pick up the entirety of the load.
This has an important implication for those monitoring capacity. Let’s say both DCs are running a mix of applications, and are running at a reasonably efficient 80% of utilization. That utilization could be of any particular resource (CPU, memory, storage I/O, network capacity, etc.) or of some aggregate metric intended to track overall system capacity and utilization.
What happens in the scenario that a single DC is now called upon to do the work that two DCs used to do? Eighty percent utilization leaves little headroom to avoid a disaster. Therefore, in the case of active/active DCs, it might not be possible to meet business goals of uptime and application availability while at the same maximizing the utilization of all the shiny metal and costly cables -- something to ponder.
I will discuss these and several other points in my “school of hard knocks” presentation, “Lessons Learned Operating Active/Active Data Centers,” at Interop Las Vegas. I’ll talk about these points in more detail, as well as share practical thoughts on Internet-facing BGP, web proxies, stateful firewalls, data synchronization, ADC monitoring, managing network latency, elephant vs. mice flows, and more.
Register now for Interop, April 27 to May 1, and receive $200 off.