When running active-active data centers (where both data centers can service an application at any given time), it’s easy to rest comfortably, thinking everything is just fine. Both data centers are set to run your business critical applications and act as the front-end for certain transactions. Databases are synced. Storage is synced. Security policies are synced. All seems to be well.
But is it really? At the risk of sounding alarmist, there are some myths I’ve debunked while running active-active data centers. Here are three risky assumptions to make about the data center design:
Myth No.1: A DNS record change is a reliable way to move an application.
Domain name record changes are the most straightforward way to move an application between data centers. Let’s say address space 100.100.1.0/24 is homed to DC-1, and 100.100.2.0/24 is homed to DC-2. Your application myapp.mycorp.com is set with a DNS A record of 100.100.1.100 when processing in DC-1, and 100.100.2.100 when processing in DC-2. Therefore, to move the front-end of the application from one DC to the other, the easiest way to do it is to update the DNS record from one IP address to the other.
This works -- well, kind of. There are a couple of gotchas to watch out for:
- Time-to-live. DNS records have a TTL value assigned when they are created. The TTL value tells caching DNS servers and querying endpoints how long the record is good for. If the TTL is one day, DNS caches will hold onto the record for up to 24 hours. Cached entries are not affected by your DNS change. Therefore, you must maintain a low TTL, say 30 seconds, if you expect to switch application over to another DC quickly.
- Some client-side applications do not honor TTL settings. This issue shows up with HTTP clients, where a client might hold onto a DNS response until some indeterminate time that could be interpreted as “when it feels like it.”
Myth No.2: Your failover will work.
Applications that live in multiple data centers are complicated beasts. In some ways, they are small-scale examples of distributed computing, with many of the caveats and concerns related to data synchronization. This isn’t a problem when the application is initially set up. Everyone in IT is on their best behavior to make sure firewall policy functions well, routing behaves as expected, and data synchronization is both functional and timely. Failover is tested and even works.
Inevitably, time goes by, and as it does, infrastructure changes. Applications get new features. Perhaps ADCs get upgraded with fancy new clustering. A new inter-data center network link is installed. And for whatever reason, testing application failover after each seemingly straightforward change gets overlooked. It's too hard to schedule, there are too many people involved in the change control process, or no one thinks failover would simply break.
The reality is that unless you’re testing your failover regularly, your active-active data center probably isn’t active-active at all. Your failover might just fail.
Myth No. 3: Using 80%+ of computing capacity is efficient.
Organizations with very high application availability requirements rely on a strategy of disaster avoidance as opposed to disaster recovery. A DR strategy may still be important for an organization, but in the scenario of active/active DCs, one of the goals is to avoid significant downtime in the face of a disaster. In other words, if DC-1 is no longer able to process application transactions, then DC-2 must be able to pick up the entirety of the load.
This has an important implication for those monitoring capacity. Let’s say both DCs are running a mix of applications, and are running at a reasonably efficient 80% of utilization. That utilization could be of any particular resource (CPU, memory, storage I/O, network capacity, etc.) or of some aggregate metric intended to track overall system capacity and utilization.
What happens in the scenario that a single DC is now called upon to do the work that two DCs used to do? Eighty percent utilization leaves little headroom to avoid a disaster. Therefore, in the case of active/active DCs, it might not be possible to meet business goals of uptime and application availability while at the same maximizing the utilization of all the shiny metal and costly cables -- something to ponder.
I will discuss these and several other points in my “school of hard knocks” presentation, “Lessons Learned Operating Active/Active Data Centers,” at Interop Las Vegas. I’ll talk about these points in more detail, as well as share practical thoughts on Internet-facing BGP, web proxies, stateful firewalls, data synchronization, ADC monitoring, managing network latency, elephant vs. mice flows, and more.
Register now for Interop, April 27 to May 1, and receive $200 off.


Comments
Network Computing
User Rank: Apprentice
Wed, 03/11/2015 - 11:38
Enteprises have a lot of choices today when it comes to how to host their applications. They can go for a fully cloud approach (which breaks down further into several options), a hybrid cloud approach (where they can either use the cloud for failover like in Active-Active or simply serve different applications in their own datacenter vs the cloud), and of course for some companies a more traditional approach still works just fine. In that case, they still have tons of decisions to make about which hardware, from which vendors, in what configuration, and what to virtualize. Phew. So, talking about active-active, what you're hinting at here seems to be this; before you even decide to go with it in the first place, take it back to square one and ask yourself not only if you're really ready for these hurdles, but if you can maximize it's specific benefits - which sound like redundancy & low downtime.
Breaking these myths down actually leads to a mix of quick fixes (changing the DNS TTL settings), general enterprise complacency problems (not testing and updating anything after initial implementation), and larger Active-Active-specific problems (capacity vs uptime vs utilization) that get at the core of how the technology works. All that sounds like a fair bit of effort to get the desired result when there are plenty of easier options out there (again, cloud), so the lesson seems to be 'don't try it unless you're going to do it right'. To that end, is there anyone you think is in the best position to take advantage of active-active? Are there certain sectors where you see it being common (financial makes sense to me), or certain kinds of applications that you think are well-suited to this kind of environment? Conversely, who's in the need-not-apply camp?
Network Computing
User Rank: Apprentice
Thu, 03/12/2015 - 10:47
Financials, absolutely. Any financial services group likes and/or might be obligated to host their own data due to regulatory concerns. But really, any organization *can* take advantage of active/active if they like. The constructs required are reasonably well known. Many application vendors can assist with reference models for how to build the network, storage, and database infrastructure. But then the question becomes one of in-house expertise. Do you have the know-how to cope with the additional complexity of an active/active environment? My point being, active/active is a committment beyond the initial build. You have to keep up with it to make sure the benefits remain constant.
The "need not apply" folks might be those who are (a) unwilling to make the investment or (b) are unwilling to hire the in-house expertise. Perhaps (c) might be those organizations with dysfunctional IT teams. They'll likely struggle to be successful with A/A deployments. Internal IT teams with their different areas of expertise need to work closely together to successfully build & deploy applications for A/A. When there are problems, these same folks need to be able to work closely together to resolve the issue. With a culture of blame or highly siloed mentality, A/A problems can be particularly sticky to resolve. Only with a culture of trust and mutual respect do A/A problems get sorted out in a timely fashion.