Data Center Redundancy and Uptime
Is your data center fully redundant?
Running a data center is a tricky business. Operational controls and efficiencies are a tremendous deal, but from the customer perspective, the key word is uptime. While a data center’s day to day operations are important, it’s the operations that result in maintaining customer uptime that are the most critical.
Many data centers tout uptime statistics. In fact, the Uptime Institute is an organization that has developed standards surrounding levels of redundancy within data centers and the types of uptime that is to be expected. Most data centers strive to achieve what is known as Tier IV, or the highest tier. To get there, a data center must have multiple active systems (cooling, electrical, etc) such that failure of any single system goes unnoticed. This allows the data center to target a 99.995% level of uptime.
However, because of the expense that is required to achieve that uptime level, most data centers choose to build to Tier III, or close to it. They sacrifice some of the inherent redundancies for operational and capital cost.
I’m going to look at some of the key redundant features that go into a data center’s tier ranking, primarily its cooling and electrical systems. As you read through these, consider whether your current data center aligns with any of these factors, or if it may be lacking in some areas.
For more information on data center tiers, check out our Data Center Tiers Explained blog post.
One major aspect of data center redundancy is in the cooling systems. Most data centers employ multiple CRAC (Computer Room Air Conditioner) units to keep the data center area cool. Generally, an extra unit is installed such that failure of any single unit will go unnoticed to the end user. This is generally known as an N+1 setup, where ‘N’ is the number of units needed to operate, and ‘+1’ denotes an extra unit that is running as a backup.
While this design is sound, the CRAC units are not the only part of the entire cooling system that is critical. The CRAC units themselves must be serviced by another piece of the HVAC chain, be it a refrigerant based system or a chilled water based system. Thus, there should also be some redundancy in those servicing systems as well.
Does the data center have multiple units that can withstand a single point of failure? If not, then having an extra CRAC unit is not an encompassing N+1 design of the overall HVAC system.
Furthermore, what about the piping to and from the CRAC units? Are there multiple sets of pipes supporting the units in case a pipe does break? If there is a single piece of infrastructure, such as a pipe, that supports multiple units, and there is no backup pipe, there may be problems.
Most data centers will offer redundant electrical circuits, commonly called A+B feeds. Again, it is necessary to follow the distribution chain of these circuits back to the source. Do they go to separate power distribution units? If not, there is a single point of failure. Do they go back to separate UPS units? If not, there is a single point of failure. Does the power come in from two separate transformers? If not, there is a single point of failure.
Furthermore, what happens during a power outage? Just about every critical data center will have backup generator capacity, but is it just a single generator? What happens if the generator fails to start? What happens if the service department is performing an oil change on the generator when the power goes out? Again, this is another single point of failure risk that must be analyzed.
If these questions have got you thinking about your current data center, we encourage you to contact us to continue the conversation!
Share this with your friends!