A disaster exposes the need for data center redundancy testing
It’s not very often that you hear of equipment explosions in our industry, but that very thing occurred a couple of months ago in California (thankfully no one was seriously injured). In downtown Los Angeles, a generator located in the basement of a shared office building exploded, injuring 4 and causing a ripple effect that impacted multiple data centers within the immediate area. This was largely in part due to a power station located nearby, which was highly damaged by the explosion.
I want to look at 2 of the data centers that were impacted the most from this event, and what lessons can be learned from it.
Data center #1: Cooling system power loss
One of the impacted data centers was located close to where the explosion happened, and while they didn’t lose full power to their facility, they did lose power to their entire cooling system. This resulted in highly elevated data center temperatures throughout their facility. While there was no reported damage or customer outages as a result of this, it did place their staff and customers on high alert.
Data center #2: Connectivity outage
A connectivity provider also operates a facility near to where the explosion took place, and as a result of the power station being damaged by the blast, their facility lost utility power. While it appears that they had a generator for backup power, they were unable to successfully switch over to generator power. As a result, their connectivity services were out for a noticeable amount of time.
A lesson to take from this
I think that the biggest lesson this incident speaks to is the importance of testing for any data center; specifically, testing your equipment as well as your redundancy procedures.
Equipment: You obviously never want to be in a position where a generator explodes on you. While the specific reason for the explosion still hasn’t been determined, the risk of failure for a generator or any piece of equipment can be reduced when that equipment is properly maintained and tested on a regular basis. When you have multiple types of equipment that your data center relies on for redundancy, this kind of testing is crucial. (Be on the lookout for a future blog post on this subject.)
Redundancy: It is equally important to test your redundancy measures themselves on a regular basis, to ensure that your equipment failover processes will work when they need to, should a piece of equipment fail. In the case of the connectivity provider’s facility, it was unable to switch over to generator power when they lost utility power. This situation isn’t good, and is something that can definitely be mitigated through ongoing testing and optimization.
There are always lessons to be learned from any incident, and this one revealed all too clearly that having ongoing equipment and redundancy testing procedures in place is absolutely crucial for data centers.
Share this with your friends!