Redundancy and Uptime
Is your data center maximally redundant?
Running a data center is a tricky business. Operational controls and efficiences are paramount. From the customer perspective though, the key word is: uptime. While the day to day operations are important, its the operations that result in maintaining customer uptime as the most critical.
Many data centers tout uptime statistics. In fact, the Uptime Institute is an organization that has developed standards surrounding levels of redundancy within data centers and the types of uptime that is to be expected. Most data centers strive to achieve what is known as Tier IV, or the highest tier. To get there, a data center must have multiple active systems (cooling, electrical, etc) such that failure of any single system goes unnoticed. This allows the data center to target 99.995% uptime.
Because of the expense, most data centers choose to build to Tier III, or close to it. They sacrifice some of the inherent redundancies for operational and capital cost.
One major aspect of data center redundancy is in the cooling systems. Most data centers employ multple CRAC (Computer Room Air Conditioner) units to keep the data center area cool. Generally, an extra unit is installed such that failure of any single unit will go unnoticed to the end user. This is generally known as an N+1 setup, meaning N is the number of units needed to operate, and +1 denotes an extra unit is running as a backup.
While this design is sound, the CRAC units are not the only part of the entire cooling system that is critical. The CRAC units themselves must be serviced by another piece of the HVAC chain, be it a refrigerant based system or a chilled water based system. Thus, there should also be some redundancy in those servicing systems as well.
Does the data center have multiple units that can withstand a single point of failure? If not, then having an extra CRAC unit is not an encompassing N+1 design of the overall HVAC system.
Furthermore, what about the piping to and from the CRAC units? Are there multiple sets of pipes supporting the units in case a pipe does break? If there is a single piece of infrastructure, such as a pipe, that supports multiple units, and there is no backup pipe, there may be problems.
Most data centers will offer redundant electrical circuits, commonly called A+B feeds. Again, it is necessary to follow the distribution chain of these circuits back to the source. Do they go to separate power distribution units? If not, there is a single point of failure. Do they go back to separate UPS units? If not, there is a single point of failure. Does the power come in from two separate transformers? If not, there is a single point of failure.
Furthermore, what happens during a power outage. Just about every critical data center will have backup generator capacity, but is it just a single generator? What happens if the generator fails to start? What happens if the service department is performing an oil change on the generator when the power goes out? Again, this is another single point of failure risk that must be analyzed.