AWS Outage – Circuit Breaker Testing
June 18, 2012 by Caleb Tennis · Leave a Comment
Amazon Web Services had an outage last week that again got the media into a frenzy. There is some good analysis at Hacker News.
Rather than point fingers, I wanted to spend a minute writing about the particular failure mode that AWS experienced
Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit.
The overall gist here is that a primary generator stopped due to overheating, and when the secondary generator had to carry the load, the breaker tripped causing a total loss of power. This, in particular, is what I want to go to into more detail on.
From the time the power enters the data center facility until it gets to the server, it travels through a number of circuit breakers. Entering the facility it likely goes through a large main breaker, much larger than anything you’d have in your home, something like this:

From there it may be distributed to subpanels, with progressively smaller breakers, until it gets to say a standard 20 or 30 amp breaker that connects to your equipment rack. These breakers each have a main purpose, protect equipment and people. If too much electricity attempts to flow, the breaker is designed to trip. Plain and simple.
A big issue however comes with what’s known as “breaker coordination.” If you have a chain of circuit breakers, the idea is that you want the one closest to the “fault” to be the one that trips. If I accidentally overpower my computer rack, then we want the circuit breaker that is closest to my computer rack, the 20 or 30 amp breaker noted above, to trip. We most certainly DON’T want the main circuit breaker to the building to trip. In fact, it’s likely impossible it would, since it’s rated for much higher current capabilities than the 20 or 30 amp breaker in my cabinet so it would never, even though there was a problem.
However, these big circuit breakers have a number of custom settings that allow you to change the dynamics of their “trip curves.” That is, you can set values for instantaneous trip, long term current trip values, and slopes of amps versus time for tripping. Why do you want to do this? For two reasons: one, so you can precisely ensure that the larger breakers don’t trip before the smaller ones do, and two, to ensure your current trip values make sense. For example, if you have a 4000A breaker but are only going to be able to, at most, pull 3000A of current due to the way your systems are designed, then you can reduce the capacity of the breaker down to 3200A, so it trips at a more reasonable level. It’s the same reason you wouldn’t put a 100 Amp breaker on a 20 Amp wall outlet in your living room – you want the breaker size to be close to the actual electrical flow size as possible.
In addition, there is a programmble instantaneous trip value, and it’s usually set much higher than than the rating of the breaker, perhaps 2-3x higher. When transitioning between power systems (from grid power to generator power for example), as the current turns off momentarily and then back on again, there is a quick inrush of current that may very rapidly exceed the normal steady-state current value for a few moments. We need to make sure our breaker settings accounts for this and doesn’t prematurely trip the breaker.

Breaker Coordination Curve Example
In Amazon’s case, I don’t know exactly what happened, but for this example I will speculate. The timing from the RCA indicates that secondary generator power was provided at 8:53pm, and at 8:57pm, the breaker tripped, so we can likely conclude it was not an instantaneous trip otherwise it would have happened within moments.
My best guess is that Amazon’s normal running setup relies on multiple power feeds, and potentially multiple generators when the main power feeds go out. In the normal scenario, the load to the end servers is balanced through a number of feeds. In this scenario, with the failures, the number of power paths to the server was reduced, which means that the amount of current through the remaining path(s) increased. In this case, it increased enough to cause an incorrectly configured breaker to trip too aggressively.
Could this have been caught earlier? Absolutely. When new electrical distribution systems are added to the fold, a re-coordination of the entire system should be done. Also, more thorough testing methodologies probably could have caught the issue sooner too. However, it may have had the same impact in catching it by tripping it – which may be why Amazon doesn’t include it on their test schedule. In most colocation data centers, customers are notified during test windows in case something like this were to happen (and in general, is designed to catch), but since AWS doesn’t notify its end customers of its test windows, there’s no added customer value in testing some of these scenarios.
I think the biggest takeaway is that you need to ensure your data center provider has a solid grasp on what the testing schedule looks like, and the reasons why the tests are conducted. If that’s outsourced, or automated, it may be a sign that the operator doesn’t have the technical resources to understand the overall electrical system – and respond to issues quickly.
The number of Cisco and Microsoft certified guys on staff doesn’t matter at bit if nobody can troubleshoot why the power goes out.
That’s why we keep technically knowledgeable staff onsite to maintain and test our equipment. We don’t have to wait on someone to troubleshoot for us. We are data center people, this is what we do and why we’ve experienced 100% uptime.
_______
Data Cave is a privately owned and operated Tier IV Midwest data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.
Indiana Data Center Disaster Recovery
June 1, 2012 by Kara Manon · Leave a Comment
As a data center located in Indiana, outside of metropolitan areas for added security, Data Cave is the logical choice for an Indiana disaster recovery service provider. Close to Indianapolis, Louisville and Cincinnati, we house customers of all sizes and with any power, managed services and connectivity needs. In fact, we have customers from all over the country. Whether you are in Michigan, Oregon, New Jersey or wherever, we can protect your data and business. If you are looking for Colocation, Disaster Recovery, a Private Cloud, Offsite Backup or Business Continuity seats, as a Tier IV facility, you can bet your equipment will remain safe, secure, always available and private with Data Cave.
Data Cave is an SSAE 16 compliant data center and we provide healthcare, financial services and other regulated organizations the opportunity to exceed compliance with HIPAA, HITECH, PCI, GLBA, SOX and any other regulations. In fact, we have developed whitepapers to help you better understand these regulations and how to comply with them.
Solving the Mystery of HIPAA and HITECH
A Guide to Financial Services Regulations
A Tier IV data center means that we are truly fully redundant and provide our clients with maximum uptime. In fact, since we opened, we have provided all of our customers with 100% uptime. A remarkable feat for any data center, especially a Midwest Indiana data center that has to deal with the blows that Mother Nature brings. With our purpose built structure, we are able to provide protection from EF5 tornadoes. That’s 200+ mph winds!
Stop losing money to downtime! If you are looking for an Indiana data center, look no further. We can provide you with the best security and uptime (not to mention a knowledgeable and friendly staff) that your company needs to survive in this technology driven world.
Call us at 866-514-2283 or use our Contact Us page to request more information.
The Data Cave Advantage Part 2
April 10, 2012 by Kara Manon · Leave a Comment
No Raised Flooring: Back in the day, data centers began using raised flooring so they could install large cables under the floor to run to equipment. Then, someone thought it was a good idea to use that space to run chilled water and under-floor air distribution for cooling server racks. These solutions are no longer necessary. We have found that you can get much better cooling without a raised floor. Our data suites feature overhead hot-aisle and cold-aisle containment. Our redundant chilled water pipes run OUTSIDE of the rooms containing customer equipment. There is no need for water piping to be in the same room as equipment in the first place. Cables are small and can be installed in overhead racks above the customer cabinets and around the data center. By doing away with raised floors, we achieve the cooling capacity we need, the facility stays a lot cleaner and we don’t have to worry about possible issues with water piping sharing space with customer equipment. Data center owners with raised floors also have to worry about buckling when tiles are removed. It also reduces a data center’s seismic rating and could have catastrophic affects if an earthquake occurred.
Redundancy: We really do have two of every piece of critical equipment. Two utility lines, generators, cooling towers, water pumps, chillers, Uninterruptible Power Supply (UPS) systems, chilled water pipes, etc. We have two Power Distribution Units (PDUs) and Computer Room Air Conditioner (CRAC) units per suite. Our data center is split into four quadrants, each of which runs independently of the other. That’s like four independent data centers within one building. Even our bathrooms are redundant.
Uptime: Data Cave has experienced 100% uptime since we opened. Don’t get us wrong, we have lost utility power but because we are constantly monitoring and testing critical (top of the line) equipment and maintain a high level of redundancy, the switch to generator power, when necessary, is seamless. Our UPS flywheels run more than long enough for our 2MW generators to start which takes about eight seconds.
Staff: We have an amazingly knowledgeable staff that includes programmers, technicians, electricians, plumbers, engineers, etc., who are able to handle facility issues without reliance on outside contractors. This is an important point because a lot of data centers don’t have this technical base to pull from when an issue arises. We don’t have to wait for a contractor to arrive, we can start working on a problem as soon as it happens. Most of our staff have been here since the planning stages of Data Cave. Working at Data Cave isn’t just a job. A lot of what we do in our jobs, we do as hobbies too. We love what we do and it shows.
Come see for yourself the advantages of choosing Data Cave. Call us at 866.514.2283.




