A Near Miss and a Total Loss: Lessons from 2021 in Data Center Resiliency
Companies need to operate production systems and backup systems in different regions to ensure a power outage in one region won’t interrupt operations indefinitely.
May 14, 2021
In the past few months, we’ve seen two disasters jeopardize the stability of data centers to very different outcomes. The February ice storm in Texas and the March 10 fire at OVHcloud’s facilities in Strasbourg are unrelated and unprecedented events that caused operational disruptions whose effects extended beyond their regional borders.
For many of us in the data center and enterprise IT industries, this has been a wakeup call and a stark reminder of the importance of not just having a disaster recovery plan. Having a robust plan that will meet your business requirements when a real disaster strikes is just the first step, though. You must also regularly test it. A disaster recovery plan without regular testing and updates as your business evolves is the same as not having a disaster recovery plan at all. It's just that simple.
How an ice storm in Texas highlighted the importance of geographic diversity
In mid-February, Texas saw an unusual cold snap turn into a devastating ice storm that took out critical utilities across the state. The initial storm hit on February 13, 2021, but the damage and repairs are ongoing. Widespread power outages and shutdowns of water services could have easily spelled disaster for the state’s data centers. As bad as the storm was and as tenuous things got, however, most data centers made it out okay.
How? While some data centers were getting close to running out of fuel when power was restored, many shared with competitors to maintain uptime and avert disaster. For everyone who relies on Texas-based data center facilities to serve their applications, data, content, and customers, it was an eye-opener.
There’s been wide speculation that the massive Texas power outage could've been mitigated if the state had invested more aggressively in winterizing critical infrastructure, if the power grid was more evenly distributed across the state and if known and trusted disaster protocols were available. This is a common concern facing the data center industry: it's crucial to have comprehensive processes for environmental challenges as well as geographical diversity in the distribution of the data centers you rely on, lest a regional storm takes your entire IT infrastructure down in a matter of seconds.
What the OVH fire teaches us about DR planning
Nearly one month after the storm hit Texas, on March 10, 2021, a fire engulfed OVHcloud's SBG2 data center in Strasbourg. The rapid destruction of entire buildings at a center run by a trusted data center provider surprised industry professionals around the world, as did the response we saw from their leadership in real time on Twitter.
The service provider's founder Octave Klaba tweeted frankly about the extent of the damage, writing, "We recommend [you] activate your Disaster Recovery Plan."
Too many clients found that their DR plans did not take into account the possibility that a fire could take down multiple buildings at the same facility. Many of those whose DR plan was based on a copy of their data located on a different server on the same campus found themselves without backups.
If there are two takeaways from this incident, they should be:
Disasters can happen anywhere and to anyone. Not if, but when.
If your DR plan is limited to having a backup copy of your data on another system in the same location, the only thing this achieves is to satisfy auditors.
A true BCDR plan requires time, effort, geographic diversity, and investment – and it's not worth the risk to try to skirt any of these requirements.
What to learn from recent data center disasters
In light of these recent disasters, it may be best to think of the buildings that house our data centers as big pieces of equipment: they can fail, and they are susceptible to risk just like any other computing system.
So, how can companies that rely on data centers, cloud storage, and cloud computing move forward and be better prepared for the next disaster?
Regularly test your DR plan. Having a DR plan isn’t enough; you must regularly test your plan and do true failovers to uncover its weaknesses.
Do not assume that having a copy of your data is enough. It’s easy to say that you can handle a week or two of downtime before a disaster strikes, but your tolerance for this lag will go out the window when a real event occurs. The other issue with copies is that they often neglect to take into account how many external connections you have to data processors and intermingled applications. Instead of asking, “Do you have a backup copy of your data?” ask, “How will you fail back and get your data into the production environment with the least amount of impact on your operations?”
Prepare for the last disaster (at a minimum). The weather event we just saw in Texas is not likely to be unique. This may sound dramatic, but if you reflect back on how close many data centers in Texas came to shutting down during that storm due to a lack of fuel, you’ll see that the minimum DR plan updates must at least meet the requirements to survive that storm again.
Fires, floods, and ice storms will continue to increase in frequency and severity around the globe. Any organization that wants to maintain uptime must prepare for that future and do true failover tests before the next weather-induced disaster happens.
Data center geodiversity is the key to 100% uptime
To guarantee 100% uptime, we recommend true geodiversity. Where you operate production systems and your backup systems reside should be in different regions to ensure a power outage in one state (or a facility loss at another) won’t interrupt your entire operation indefinitely.
Ensuring that your data is stored in data centers that rely on different branches of a power grid or in areas that are subject to distinct disaster profiles may sound radical, but when the next major disaster hits, this is the type of thinking that ensures you’re prepared to weather whatever happens.
Tom Kiblin is the vice president of cloud and managed services at ServerCentral Turing Group (SCTG).
About the Author
You May Also Like