Our last data center dictionary entry covered Disaster Recovery. We discussed what disaster recovery is, how to reduce various threats, and how to begin your own disaster recovery plan. Next, we move to Colocation.
What is colocation?
Simply stated, colocation is the practice through which a business locates its servers and IT equipment in an offsite data center. These facilities are often designed provide rich connectivity options, which would be otherwise unavailable to a business or organization. In application, colocation allows a business to locate its servers and other IT equipment securely in a data center. As opposed to dedicated server hosting, colocation allows businesses to own and manage their servers in an environment designed to support and enhance server activity.
Why do businesses practice colocation?
Colocation provides businesses with several advantages, including:
- Improved facility and network security
- High uptime and availability
- Increased connectivity options
- Cooling, electrical and networking redundancy
- Scalability for future growth
- Cost-effective bandwidth
- Outage protection
Who should consider colocation?
While colocation can be a great resource for all businesses, medium and large-sized organizations should consider colocation. Industries that regularly handle highly sensitive information, such as financial services and healthcare, benefit from colocation because data centers have exceptional security measures in place.
Why should a financial service company consider colocation?
Today’s financial environment has given the advantage to the quick, connected, and agile. Colocation allows companies the speed, availability, and compliance adherence necessary for success. The boom in electronic trading allows companies to make transactions almost instantly but has also created an environment in which speed directly affects success. The most successful companies in this industry obtain and analyze market information to make quick and accurate decisions, and each second matters.
Colocation also prevents companies from suffering at a loss of latency. With 100% availability and uptime, a financial organization can be certain they will not miss an opportunity which might lead to a costly loss. Finally, because these companies handle sensitive data regularly, they must adhere to stringent compliance regulations. For more information about compliance, financial services, and colocation, we recommend reading our white paper, A Guide to Financial Services Regulations.
Healthcare and Colocation
In today’s healthcare environment, the IT infrastructure may be as important as the care itself. A new study published in the January/February Annals of Family Medicine estimates that 70% of family physicians are using Electronic Health Records (EHR) and by the conclusion of the year over 80% will use EHRs. Healthcare providers at all levels—from hospitals to family care practices—are relying heavily upon EHR and other technology. Today technology in medicine is no longer just for operational efficiency but also for effective patient care. Because the role of technology in healthcare has evolved as a critical component in any healthcare organization, these organizations should consider colocation. It ensures effective operation and excellent patient care as well as HIPAA and HITECH compliance.
Did you know that McDonald’s feeds more than 46 million people every day? That’s more than the population of Spain! Additionally, McDonald’s represents 43% of the United States fast food market. One would think that a company like McDonald’s would practice appropriate server maintenance. We were horrified when a friend of Data Cave sent us this picture they snapped through the window of a local McDonald’s drive through.
So let’s play a game. What’s wrong with this picture?
1. Kitchens and Technology are a Recipe for Disaster
This McDonald’s chose to locate their servers near the kitchen. It doesn’t take a data center expert to note that this is not an effective strategy. Consider your personal cell phone, for example. SquareTrade conducted research that stated that 21% of all iPhone accidents occur in the kitchen. An iPhone is a critical device for many, but most of the vital information is backed up using iCloud. And it isn’t cheap to replace an iPhone, but the price is not nearly as prohibitive as purchasing and implementing a new server. Being near to food and drink can only result in terrible technology tragedies.
2. Exposure to the Elements
Not only did this McDonald’s choose to place their servers near the kitchen, they exposed them to the elements because they were in the drive thru room. It is estimated that an average McDonald’s serves 1,584 customers daily. If half those customers came through the drive thru and the window is open for an average of 10 seconds per customer, those servers were exposed to outside conditions two hours and twelve minutes each day. This takes the idea of an uncontrolled environment to the extreme.
3. Crossed Wires
While the appearance of messy wires isn’t aesthetically pleasing, it is also dangerous. Tangled wires pose fire threats (and we are willing to bet that McDonald’s didn’t employ a fire suppression system exclusively for its servers). Due to this cabling, it doesn’t even appear as if they can shut the door (see #4). In fact, this picture below details the challenges of having messy wires.
4. An Open Door Policy
Open door policies are great for dealing with employees, but they are less than optimal when it comes to technology. Having an open door to their servers poses many security risks. Damage could be done, both intentionally and unintentionally. McDonald’s has employed one in every eight American workers. That is indicative of a high employee turnover. A disgruntled employee could easily wreck havoc on McDonald’s because the technology is so readily available. Additionally, accidents happen. By having an open door, the chance of accidents increases.
5. The Data Closet
Finally, it goes without saying that we encourage all organizations to protect their valuable technology (especially offsite). McDonald’s has their main data center in Dallas but their restaurants obviously still needs local equipment. There are so many risks that come with housing an internal data center, especially one in a closet with no ventilation or cooling. If you want cost savings and increased protection, it only makes sense to outsource your data center.
McDonald’s, we urge you to clean up your technology act! It is inevitable that something will happen, and you will suffer!
Amazon Web Services had an outage last week that again got the media into a frenzy. There is some good analysis at Hacker News.
Rather than point fingers, I wanted to spend a minute writing about the particular failure mode that AWS experienced
Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit.
The overall gist here is that a primary generator stopped due to overheating, and when the secondary generator had to carry the load, the breaker tripped causing a total loss of power. This, in particular, is what I want to go to into more detail on.
From the time the power enters the data center facility until it gets to the server, it travels through a number of circuit breakers. Entering the facility it likely goes through a large main breaker, much larger than anything you’d have in your home, something like this:
From there it may be distributed to subpanels, with progressively smaller breakers, until it gets to say a standard 20 or 30 amp breaker that connects to your equipment rack. These breakers each have a main purpose, protect equipment and people. If too much electricity attempts to flow, the breaker is designed to trip. Plain and simple.
A big issue however comes with what’s known as “breaker coordination.” If you have a chain of circuit breakers, the idea is that you want the one closest to the “fault” to be the one that trips. If I accidentally overpower my computer rack, then we want the circuit breaker that is closest to my computer rack, the 20 or 30 amp breaker noted above, to trip. We most certainly DON’T want the main circuit breaker to the building to trip. In fact, it’s likely impossible it would, since it’s rated for much higher current capabilities than the 20 or 30 amp breaker in my cabinet so it would never, even though there was a problem.
However, these big circuit breakers have a number of custom settings that allow you to change the dynamics of their “trip curves.” That is, you can set values for instantaneous trip, long term current trip values, and slopes of amps versus time for tripping. Why do you want to do this? For two reasons: one, so you can precisely ensure that the larger breakers don’t trip before the smaller ones do, and two, to ensure your current trip values make sense. For example, if you have a 4000A breaker but are only going to be able to, at most, pull 3000A of current due to the way your systems are designed, then you can reduce the capacity of the breaker down to 3200A, so it trips at a more reasonable level. It’s the same reason you wouldn’t put a 100 Amp breaker on a 20 Amp wall outlet in your living room – you want the breaker size to be close to the actual electrical flow size as possible.
In addition, there is a programmble instantaneous trip value, and it’s usually set much higher than than the rating of the breaker, perhaps 2-3x higher. When transitioning between power systems (from grid power to generator power for example), as the current turns off momentarily and then back on again, there is a quick inrush of current that may very rapidly exceed the normal steady-state current value for a few moments. We need to make sure our breaker settings accounts for this and doesn’t prematurely trip the breaker.
In Amazon’s case, I don’t know exactly what happened, but for this example I will speculate. The timing from the RCA indicates that secondary generator power was provided at 8:53pm, and at 8:57pm, the breaker tripped, so we can likely conclude it was not an instantaneous trip otherwise it would have happened within moments.
My best guess is that Amazon’s normal running setup relies on multiple power feeds, and potentially multiple generators when the main power feeds go out. In the normal scenario, the load to the end servers is balanced through a number of feeds. In this scenario, with the failures, the number of power paths to the server was reduced, which means that the amount of current through the remaining path(s) increased. In this case, it increased enough to cause an incorrectly configured breaker to trip too aggressively.
Could this have been caught earlier? Absolutely. When new electrical distribution systems are added to the fold, a re-coordination of the entire system should be done. Also, more thorough testing methodologies probably could have caught the issue sooner too. However, it may have had the same impact in catching it by tripping it – which may be why Amazon doesn’t include it on their test schedule. In most colocation data centers, customers are notified during test windows in case something like this were to happen (and in general, is designed to catch), but since AWS doesn’t notify its end customers of its test windows, there’s no added customer value in testing some of these scenarios.
I think the biggest takeaway is that you need to ensure your data center provider has a solid grasp on what the testing schedule looks like, and the reasons why the tests are conducted. If that’s outsourced, or automated, it may be a sign that the operator doesn’t have the technical resources to understand the overall electrical system – and respond to issues quickly.
The number of Cisco and Microsoft certified guys on staff doesn’t matter at bit if nobody can troubleshoot why the power goes out.
That’s why we keep technically knowledgeable staff onsite to maintain and test our equipment. We don’t have to wait on someone to troubleshoot for us. We are data center people, this is what we do and why we’ve experienced 100% uptime.
Data Cave is a privately owned and operated Tier IV Midwest data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.
No Raised Flooring: Back in the day, data centers began using raised flooring so they could install large cables under the floor to run to equipment. Then, someone thought it was a good idea to use that space to run chilled water and under-floor air distribution for cooling server racks. These solutions are no longer necessary. We have found that you can get much better cooling without a raised floor. Our data suites feature overhead hot-aisle and cold-aisle containment. Our redundant chilled water pipes run OUTSIDE of the rooms containing customer equipment. There is no need for water piping to be in the same room as equipment in the first place. Cables are small and can be installed in overhead racks above the customer cabinets and around the data center. By doing away with raised floors, we achieve the cooling capacity we need, the facility stays a lot cleaner and we don’t have to worry about possible issues with water piping sharing space with customer equipment. Data center owners with raised floors also have to worry about buckling when tiles are removed. It also reduces a data center’s seismic rating and could have catastrophic affects if an earthquake occurred.
Redundancy: We really do have two of every piece of critical equipment. Two utility lines, generators, cooling towers, water pumps, chillers, Uninterruptible Power Supply (UPS) systems, chilled water pipes, etc. We have two Power Distribution Units (PDUs) and Computer Room Air Conditioner (CRAC) units per suite. Our data center is split into four quadrants, each of which runs independently of the other. That’s like four independent data centers within one building. Even our bathrooms are redundant.
Uptime: Data Cave has experienced 100% uptime since we opened. Don’t get us wrong, we have lost utility power but because we are constantly monitoring and testing critical (top of the line) equipment and maintain a high level of redundancy, the switch to generator power, when necessary, is seamless. Our UPS flywheels run more than long enough for our 2MW generators to start which takes about eight seconds.
Staff: We have an amazingly knowledgeable staff that includes programmers, technicians, electricians, plumbers, engineers, etc., who are able to handle facility issues without reliance on outside contractors. This is an important point because a lot of data centers don’t have this technical base to pull from when an issue arises. We don’t have to wait for a contractor to arrive, we can start working on a problem as soon as it happens. Most of our staff have been here since the planning stages of Data Cave. Working at Data Cave isn’t just a job. A lot of what we do in our jobs, we do as hobbies too. We love what we do and it shows.
Come see for yourself the advantages of choosing Data Cave. Call us at 866.514.2283.
There’s something you might not know about Data Cave. Everything we distribute, from handouts to our website, says that we provide 99.999% uptime. However, that simply isn’t true. In reality, we can boast 100% uptime. This means that our facility has weathered storms and power outages with NO downtime. We began with our first client in early 2010 which amounts to nearly two years of uninterrupted power for our clients.
How are we able to maintain 100% uptime?