How Does Google Prepare for Disaster? They Create Their Own

Google Logo in Building43
0 Flares Twitter 0 Facebook 0 LinkedIn 0 Google+ 0 Pin It Share 0 Filament.io Made with Flare More Info'> 0 Flares ×

Google is a juggernaut. It’s tough to imagine that in a mere fourteen years Google rocketed from a humble search engine into a leader in information technology. We talk a lot about big data, data storage, data centers, data, data, data, and Google has some seriously big data. Millions rely on Gmail for home and business email or YouTube for pleasure every day, and it takes massive data centers with hundreds of thousands of servers to run Google search and its various services.

With so many things to plan for and data spread across the world with millions of employees, what’s Google’s approach to disaster recovery? They attack themselves.

According to a recent wired.com article, Google employs a team of people called Site Reliability Engineers (SREs) whose main focus is to keep Google search and other services running. The team, which wears super-cool leather jackets with military-inspired patches, runs a simulated war on Google’s infrastructure that they call DiRT (disaster recovery testing). This “war” involves everything from causing leaks in water pipes to staging protests to attempting to steal disks from the servers—whatever it takes to bring down the infrastructure. The data center attacks aren’t real, but they are hard to distinguish from an actual event, even though the SRE team has a little fun by attributing each attack to a fictional event like a zombie, alien, or supernatural attack.

Each annual attack is headed by an engineer named Kripa Krishan. Before the attack begins, Krishnan tells the SRE team not to fix anything and that the people on the job in the data center don’t realize the team is there. Once the attack begins, the team monitors Google incident managers and measures their response times and ability to handle the issues, a neat, and fun way to test a DR plan.

Incident managers don’t realize these issues aren’t real and must handle them as though they are actually happening—sometimes even dealing with actual service failures.  If the incident managers in charge of a particular site can’t stop the SRE team’s attack, however, the team can abort the attack before real users are affected. As Krishan explains, they have “become braver in how much we’re willing to disrupt in order to make sure everything works.”

Krishan explains that her role is “to come up with big tests that really expose weaknesses.” Through the information they gain from a fake attack, they know what is working, and what needs improvement. Google realizes the importance of a disaster recovery plan, and they test theirs regularly in a realistic but fun way to expose any weakness and analyze its effectiveness in real-life scenarios.

Enhanced by Zemanta
Casey Morgan

Casey Morgan

Casey Morgan is the marketing content specialist at StorageCraft. U of U graduate and lover of words, his experience lies in construction and writing, but his approach to both is the same: start with a firm foundation, build a quality structure, and then throw in some style. If he’s not arguing about comma usage or reading, you'll likely find him and his Labrador hiking, biking, or playing outdoors -- he's even known to strum a few chords by the campfire.

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

0 Flares Twitter 0 Facebook 0 LinkedIn 0 Google+ 0 Pin It Share 0 Filament.io Made with Flare More Info'> 0 Flares ×

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>