News

04:23 PM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Google Vs. Zombies -- And Worse

Take an inside look at how Google prepares for catastrophes, from flesh-eating zombies to earthquakes. This disaster prep team keeps it real and keeps it interesting.

Anonymous: 10 Things We Have Learned In 2013
Anonymous: 10 Things We Have Learned In 2013
(click image for larger view and for slideshow) \
After the zombies took over Google's data center, the heroic action of a few selfless individuals saved the day. Never underestimate what a site reliability engineer can do with an axe.

"If you look at zombies in the data center, they're after the people," explained Kripa Krishnan, technical program manager at Google. "So it becomes less of a machine's problem and becomes more of a people problem..."

The zombie invasion occurred back in 2007. It was one of the first Disaster Recovery Testing (DiRT) events created to evaluate Google's operational resilience in a crisis. This was before Centers for Disease Control and Prevention began warning about zombies because storms, pandemics and earthquakes don't get people's attention anymore.

Although heroism has played a central role in saving Google more than once -- another scenario involved an executive wielding the teleportation gun from Valve's Portal -- it's not something that can be relied on when disaster strikes, just like any IT system or business process at a time of crisis. Google as a company promotes the perception that its employees are exceptionally talented. But when it comes to preparing for the worst, the company can't simply assume that exceptional skills will save the day.

[ Could what you wear be used to identify you in the future? Read Google Funds Fashion Recognition Research. ]

"We find that people are people, and they burn out if they work insane hours and long shifts," said Krishnan. "Heroic tactics are not a sustainable model if you're in a disaster."

The DiRT program was created seven years ago and Krishnan began managing DiRT events a year after that. Genial and sharp, with a penchant for using the word "goodness" to emphasize a point, her background recalls the famously overachieving Buckaroo Banzai, depicted in the 1984 movie that bears his name as a neurosurgeon, physicist, rock musician and test pilot.

Hyperbole perhaps, but it's a necessary element in a story about heroism. Krishnan was studying medicine over a decade ago when her interests took her to music and theater. Three years in, she decided to study performance arts, and eventually came to the U.S. to focus on theater. Then a professor convinced her to take a computer science course. Having left science for the arts, Krishnan finally emerged from graduate school with a degree in Management Information Systems. Thereafter, she became involved with telemedicine networking in Kosovo and later landed at Google.

Now her job is to break things, as Krishnan explained in an interview at Google's Mountain View, Calif., headquarters.

"Sometimes we will bring in someone to write something that will cause a failure in some underneath layer and it will manifest itself as cascading failures in some front-end facing product," Krishnan said. Other times, she says, her team might direct someone to introduce corrupt data into a system, to see how long it takes to find the problem.

DiRT is an annual exercise. Although various Google product groups conduct their own internal stress tests, DiRT's scope is companywide. DiRT scenarios challenge both technical infrastructure and organizational dynamics. Initially, the tests were restricted to user-facing systems, but they have been expanded to cover the full range of Google operations. Beyond data centers, DiRT testing might include systems used by facilities, finance, human resources and security, among other business groups. More recently, as the company's enterprise business has become more successful, customer support systems were added to the tests.

DiRT exercises require the work of hundreds of engineering and operations employees for several days, which means they're not inexpensive to run. They can affect live systems and have even resulted in revenue loss. But the price is deemed to be worth it.

Sanjay Jain, associate industry professor in the department of decision sciences at George Washington University, said in an email that the apparent increase in manmade and natural disasters around the globe demands more active continuity planning.

Google Dirt Conference Table

"Recently, companies have had to face major issues due to disasters including the loss of operations in New York and New Jersey area following Hurricane Sandy a few months ago, and the major impact on supply chains following the tsunami in Japan in 2011," he said in an email. "Companies need to be more thorough in planning for safety of their personnel and maintaining business continuity in face of such eventualities. Such efforts have to go beyond duplicating data servers (that is of course needed) to employing live and computer simulations of potential disaster scenarios and their impact on companies' personnel, operations, and assets, and testing of measures to eliminate or substantially reduce the negative impacts."

In case of emergency, Google has a war room. DiRT tests are run from a simulated war room, which can be one of the company's many conference rooms.

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Cara Latham
50%
50%
Cara Latham,
User Rank: Apprentice
3/13/2013 | 1:09:22 PM
re: Google Vs. Zombies -- And Worse
I think it is great that Google runs such well-organized tests and constantly makes adjustments to its plans. In times of disaster, it is still good to know that if you still have the ability to reach the Internet, you could possibly search for life-saving, up-to-date information. I think other companies should be taking a page out of Google's book in their own plans. However, for them, they should also be addressing how to communicate with employees who may not have power and cannot access the Internet for work or communication. In those cases, not even Google is useful. A well-tested and thorough planning system modeled after Google's approach, and plans for these scenarios, would be helpful, though.
FritzNelson
50%
50%
FritzNelson,
User Rank: Apprentice
3/12/2013 | 11:28:23 PM
re: Google Vs. Zombies -- And Worse
I hate to say it because it sounds at this point a bit like a cliche, but there's so much to learn from Google. I wonder how many organizations take this sort of thing as seriously as Google does.
Deirdre Blake
50%
50%
Deirdre Blake,
User Rank: Apprentice
3/12/2013 | 8:39:12 PM
re: Google Vs. Zombies -- And Worse
"We find that people are people, and they burn out if they work insane hours and long shifts" Words of wisdom, especially in this industry.
Slideshows
Cartoon
Audio Interviews
Archived Audio Interviews
Jeremy Schulman, founder of Schprockits, a network automation startup operating in stealth mode, joins us to explore whether networking professionals all need to learn programming in order to remain employed.
White Papers
Register for Network Computing Newsletters
Current Issue
Research: 2014 State of the Data Center
Research: 2014 State of the Data Center
Our latest survey shows growing demand, fixed budgets, and good reason why resellers and vendors must fight to remain relevant. One thing's for sure: The data center is poised for a wild ride, and no one wants to be left behind.
Video
Twitter Feed