David Krider The Recent Amazon Outage

July 10, 2012 by · Leave a Comment 

Amazon’s Elastic Computing Cloud recently suffered an outage. Summarizing, Amazon’s explanation was that power to the facility failed, their generators failed to provide “stable” power, and energy within their UPS’s was exhausted before power could be restored. It only took about 20 minutes to fix the generators and get power back online, but as anyone who has supported large distributed systems knows, it’s not that easy. Getting services back online involves more than just restoring power, and made for serious interruptions for several popular web sites, such as Pinterest and Instagram. For instance, Netflix was down for three hours.

The Hacker News community discussed the event at length. There are a couple of takeaways that I would like to point out.

First, it’s not common for a data center — even a large data center — to have fully redundant power equipment. Data Cave does. We have two power feeds and two generators which all feed into electrical switch gear which can choose either utility power or generator power for either the “A” or “B” side. In addition, both “sides” are sized to run the entire demand alone, should the need arise, and the switchgear can feed both sides with any one of the 4 sources. Both sides then feed dedicated flywheel UPS systems, which, in turn, supply the A/B PDUs which then break down power by rack through breakers.

The main weakness in this setup is customer equipment. Most “enterprise”-level equipment has redundant power supplies, but they are not always connected. Also, it’s critical that the load gets split even across the A/B sides of power — and that no more than 50% of the breaker capacity is used on either side — so that if one fails, breakers aren’t tripped and power is lost when the load falls fully to the other side. At Data Cave, we help monitor this situation for our customers.

Second, it’s not common for data centers — even large ones — to have people on staff that really understand generators. Data Cave does. Our affiliate company, located right next door, is a world leader in high-end, large-displacement diesel engine testing. As such, we have, at our disposal, experts with both large diesel engines, and the large electrical motors they drive. If something goes wrong with either one, someone can fix it immediately. We don’t need to call a service company to come fix “their” equipment; these are not “black boxes” to us.

Data Cave was designed, from the ground up, to be a world-class data center, and the implementation of our systems would make a textbook of best practices. Our staff is filled with subject matter experts in everything relating to building and running a facility like this. If you’re in the market for a colocation or disaster recovery site, you owe it to yourself to come see the features of Data Cave for yourself.

Comments are closed.