Amazon Web Services had an outage last week that again got the media into a frenzy. There is some good analysis at Hacker News.
Rather than point fingers, I wanted to spend a minute writing about the particular failure mode that AWS experienced
Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit.
The overall gist here is that a primary generator stopped due to overheating, and when the secondary generator had to carry the load, the breaker tripped causing a total loss of power. This, in particular, is what I want to go to into more detail on.
From the time the power enters the data center facility until it gets to the server, it travels through a number of circuit breakers. Entering the facility it likely goes through a large main breaker, much larger than anything you’d have in your home, something like this:
From there it may be distributed to subpanels, with progressively smaller breakers, until it gets to say a standard 20 or 30 amp breaker that connects to your equipment rack. These breakers each have a main purpose, protect equipment and people. If too much electricity attempts to flow, the breaker is designed to trip. Plain and simple.
A big issue however comes with what’s known as “breaker coordination.” If you have a chain of circuit breakers, the idea is that you want the one closest to the “fault” to be the one that trips. If I accidentally overpower my computer rack, then we want the circuit breaker that is closest to my computer rack, the 20 or 30 amp breaker noted above, to trip. We most certainly DON’T want the main circuit breaker to the building to trip. In fact, it’s likely impossible it would, since it’s rated for much higher current capabilities than the 20 or 30 amp breaker in my cabinet so it would never, even though there was a problem.
However, these big circuit breakers have a number of custom settings that allow you to change the dynamics of their “trip curves.” That is, you can set values for instantaneous trip, long term current trip values, and slopes of amps versus time for tripping. Why do you want to do this? For two reasons: one, so you can precisely ensure that the larger breakers don’t trip before the smaller ones do, and two, to ensure your current trip values make sense. For example, if you have a 4000A breaker but are only going to be able to, at most, pull 3000A of current due to the way your systems are designed, then you can reduce the capacity of the breaker down to 3200A, so it trips at a more reasonable level. It’s the same reason you wouldn’t put a 100 Amp breaker on a 20 Amp wall outlet in your living room – you want the breaker size to be close to the actual electrical flow size as possible.
In addition, there is a programmble instantaneous trip value, and it’s usually set much higher than than the rating of the breaker, perhaps 2-3x higher. When transitioning between power systems (from grid power to generator power for example), as the current turns off momentarily and then back on again, there is a quick inrush of current that may very rapidly exceed the normal steady-state current value for a few moments. We need to make sure our breaker settings accounts for this and doesn’t prematurely trip the breaker.
In Amazon’s case, I don’t know exactly what happened, but for this example I will speculate. The timing from the RCA indicates that secondary generator power was provided at 8:53pm, and at 8:57pm, the breaker tripped, so we can likely conclude it was not an instantaneous trip otherwise it would have happened within moments.
My best guess is that Amazon’s normal running setup relies on multiple power feeds, and potentially multiple generators when the main power feeds go out. In the normal scenario, the load to the end servers is balanced through a number of feeds. In this scenario, with the failures, the number of power paths to the server was reduced, which means that the amount of current through the remaining path(s) increased. In this case, it increased enough to cause an incorrectly configured breaker to trip too aggressively.
Could this have been caught earlier? Absolutely. When new electrical distribution systems are added to the fold, a re-coordination of the entire system should be done. Also, more thorough testing methodologies probably could have caught the issue sooner too. However, it may have had the same impact in catching it by tripping it – which may be why Amazon doesn’t include it on their test schedule. In most colocation data centers, customers are notified during test windows in case something like this were to happen (and in general, is designed to catch), but since AWS doesn’t notify its end customers of its test windows, there’s no added customer value in testing some of these scenarios.
I think the biggest takeaway is that you need to ensure your data center provider has a solid grasp on what the testing schedule looks like, and the reasons why the tests are conducted. If that’s outsourced, or automated, it may be a sign that the operator doesn’t have the technical resources to understand the overall electrical system – and respond to issues quickly.
The number of Cisco and Microsoft certified guys on staff doesn’t matter at bit if nobody can troubleshoot why the power goes out.
That’s why we keep technically knowledgeable staff onsite to maintain and test our equipment. We don’t have to wait on someone to troubleshoot for us. We are data center people, this is what we do and why we’ve experienced 100% uptime.
Data Cave is a privately owned and operated Tier IV Midwest data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.