How a Server Outage Affects Everyone
A story I came across recently reminded me about the consequences of having a failure in a data center, as well as the devastating ripple effect it can have on everyone who relies on it. It is definitely a story worth sharing (for the original story, click here).
Time: The beginning of the 2013 school year.
Place: The Davis Joint Unified School District in Davis, California. The district includes a total of 20 elementary, junior high, and high schools.
During the weekend before the scheduled start of the school year, one of the district’s air conditioning units broke down in their server room, causing all of their servers to overheat (the room itself soared to 120 degrees as a result). The district’s IT team scrambled to get all of the servers back online, and saw multiple days of downtime before everything was back up and running.
Within those few days there were several negative effects, however:
- Teachers were unable to take attendance electronically.
- Students were not able to work because they could not access the school computers.
- Email across the district was completely down, both incoming and outgoing, so parents and teachers couldn’t effectively communicate with one another.
When you think about the wide range of people and activities that really depend on a fully running data center (like all of those teachers, students, and parents), it makes it VERY easy to see how one outage can cause a fast-moving and devastating ripple effect that impacts everyone. This story serves as an excellent reminder that redundancy is a very big deal for any data center, whether it is for a school district, a bank, or any other type of organization.
By only having the one air conditioning unit in place for its entire data center, the school district had one single point of failure; this led to failure across the board when the AC unit failed. For situations like this, where there are so many stakeholders who rely on the data center running efficiently, having a high level of redundancy is essentially a requirement. An outage like this could likely be prevented altogether with the following measures:
- Redundant cooling: By having multiple AC units in place in the data center, a failure on one of the units wouldn’t cause the entire room to overheat.
- Temperature monitoring: While it is unclear what monitoring the school district had in place, having an effective monitoring system that notifies on-call staff when there is a drastic change in temperature can dramatically improve the reaction time for identifying and resolving any issues that may arise in the data center.
These are just a couple of the measures for redundancy that we have in place here at Data Cave (you can read more about our cooling infrastructure here), and these same measures would have gone a long way in preventing the outage that was seen earlier in the month, as well. While an outage like this one is never a good experience to have to deal with, it does serve as a good learning experience for why redundancy is such a big deal for organizations everywhere, and especially for us here at Data Cave.
Share this article with your friends!