David Krider The Recent Amazon Outage

July 10, 2012 by · Leave a Comment 

Amazon’s Elastic Computing Cloud recently suffered an outage. Summarizing, Amazon’s explanation was that power to the facility failed, their generators failed to provide “stable” power, and energy within their UPS’s was exhausted before power could be restored. It only took about 20 minutes to fix the generators and get power back online, but as anyone who has supported large distributed systems knows, it’s not that easy. Getting services back online involves more than just restoring power, and made for serious interruptions for several popular web sites, such as Pinterest and Instagram. For instance, Netflix was down for three hours.

The Hacker News community discussed the event at length. There are a couple of takeaways that I would like to point out.

First, it’s not common for a data center — even a large data center — to have fully redundant power equipment. Data Cave does. We have two power feeds and two generators which all feed into electrical switch gear which can choose either utility power or generator power for either the “A” or “B” side. In addition, both “sides” are sized to run the entire demand alone, should the need arise, and the switchgear can feed both sides with any one of the 4 sources. Both sides then feed dedicated flywheel UPS systems, which, in turn, supply the A/B PDUs which then break down power by rack through breakers.

The main weakness in this setup is customer equipment. Most “enterprise”-level equipment has redundant power supplies, but they are not always connected. Also, it’s critical that the load gets split even across the A/B sides of power — and that no more than 50% of the breaker capacity is used on either side — so that if one fails, breakers aren’t tripped and power is lost when the load falls fully to the other side. At Data Cave, we help monitor this situation for our customers.

Second, it’s not common for data centers — even large ones — to have people on staff that really understand generators. Data Cave does. Our affiliate company, located right next door, is a world leader in high-end, large-displacement diesel engine testing. As such, we have, at our disposal, experts with both large diesel engines, and the large electrical motors they drive. If something goes wrong with either one, someone can fix it immediately. We don’t need to call a service company to come fix “their” equipment; these are not “black boxes” to us.

Data Cave was designed, from the ground up, to be a world-class data center, and the implementation of our systems would make a textbook of best practices. Our staff is filled with subject matter experts in everything relating to building and running a facility like this. If you’re in the market for a colocation or disaster recovery site, you owe it to yourself to come see the features of Data Cave for yourself.

Caleb Tennis Infrastructure 101: Flywheel UPS

December 27, 2010 by · Leave a Comment 

It’s no secret at Data Cave that our critical UPS infrastructure is built from flywheel UPS units.  Our initial build consists of multiple Active Power based CS1200Z units providing the critical power needs to the customer and internal IT loads.

Conceptually, the system works almost exactly the same as Read more

More on the UPSs

October 2, 2009 by · Leave a Comment 

The UPS manufacturer was onsite this week to bring the additional UPS capacity online.

Read more

Additional UPS Capacity

September 15, 2009 by · 2 Comments 

A few weeks ago, we ordered additional flywheels for our UPS systems.

Read more

Blogging Hiatus

September 15, 2009 by · Leave a Comment 

I apologize for the lack of blog post this past month.  Although, I haven’t been blogging about the progress being made at Data Cave, our crew has been hard at work.  I plan to elaborate on most of these items in the next week or two but here are a few things we’ve been working on:

  • increasing the capacity of the UPS systems,
  • data center cooling (processed water, chillers,etc.),
  • completion of the fence around the perimeter,
  • installation of additional fire suppression systems for individual data suites,
  • additional network configuration,
  • testing of monitoring systems, and
  • installation and configuration of security systems.

Additionally, we’ve been finishing up the front office area and we’ve also moved into the new facility.

Flywheel UPSs

July 21, 2009 by · Leave a Comment 

In late June, the UPSs were set into place, wired up, and installed.

Read more

Outside and Inside

May 22, 2009 by · Leave a Comment 

Work is going on both outside and inside the data center.

Read more

Taking Delivery

December 5, 2008 by · Leave a Comment 

In the past two weeks, we’ve taken delivery of a number of pieces of critical data center infrastructure.

Read more