A day in the life of a DC manager, part 3
March 21, 2011 by Caleb Tennis · Leave a Comment
In the first part of this series we took a look at the HVAC and chilling systems, and in the second part we looked at the incoming electrical systems. In this third and final part, we’ll look at all of the other things that go on that we observe daily.
Network Access
Of course network access is a critical part of our infrastructure. Each customer within Data Cave has differing needs for connectivity – some need internet, some need redundant internet. Some customers have private point-to-point connections across either lit or dark fiber. We provide access to what the customer needs, and monitor everything along the way 24/7.
Perimeter Security
One important part of Data Cave is the fence and gate system that keep us isolated from foot and car traffic. Every day we spot check the fence and gate and ensure no breaches have occurred. In addition, we use a number of video cameras around the outside of the building to spot motion and indicate where areas might be of concern.
Our fence also has a major grounding system in it to help mitigate lightning strikes, and as part of the daily check we’ll inspect this grounding system to make sure its still in place and no lines have been cut.
Door Access System
Data Cave has over 80 doors in the building. Access into different parts of the facility is controlled by a door access system, granting both entry and exit to authorized personnel. Also integral in this is a biometric hand scanning system, to doubly ensure the correct person is entering the critical areas. One benefit of this system is that we can know who is in the building, and in which places, at all times. This system is monitored by our NOC 24/7.
Weather
We are always keeping our eye on the weather. If a storm system is moving in, we’ll stay vigilant at the notion of a potential power outage until everything passes. Besides monitoring radar and satellite images, we also have an emergency weather radio in our NOC to receive updates.
-
What this 3 part series highlights is simply this: we are working hard every day to ensure the best protection for your IT equipment. Contact us today to see just how our impressive infrastructure can work for you.
A day in the life of a DC manager, part 2
March 7, 2011 by Caleb Tennis · Leave a Comment
Following off part 1, today we’ll look at the electrical systems.
Incoming Power
The incoming power feeds that service data cave run through large 4000A breakers before distributing out to subfeed breakers to the equipment in the facility. At this stage the power is monitored for quality, harmonics, and proper voltage. We have status lights on the gear itself, as well as panel readouts. Each breaker in the switchgear has been individually tuned to support its own load, and the health of each breaker is watched.
Of course, this is all monitored remotely to our NOC as well.
Flywheel UPS
Two feeds of power go to the flywheel UPS systems. These systems condition the power, ensuring that what the end customer IT load sees is always available, clean, redundant power. These units have status system displays on them that we spot check daily, and the rotational speed (7700rpm) and DC bus voltages of the flywheels are monitored remotely in our NOC.
Power Distribution Units
Each data suite in the facility has separate power distribution units that take the power from the flywheel UPS systems, step down the voltage, and then provide the individual breaker feeds to the customer cabinets. We spot check these systems daily, and the health of each PDU is monitored from our NOC.
Generators
Each of our generators is spot checked daily, and more thoroughly checked weekly to ensure they are ready to start at a moments notice. Remotely, our NOC monitors for any fault conditions that the Engine Control Module on the generator may have encountered, and we also monitor important standby conditions such as coolant temperature.
Stay tuned for part 3.
A day in the life of a DC manager, HVAC
February 21, 2011 by Caleb Tennis · Leave a Comment
ThePlanet has a blog post from a couple of years ago from their DC Manager highlighting some of the daily things they go through in keeping their data centers up and running. I’ll highlight some similar bits of information about Data Cave in the next couple of posts.
Air Conditioning
Each data suite in our center has multiple air conditioning units (CRACs). We monitor these units remotely from our Network Operations Center (NOC) and keep an eye on the temperature and humidity levels in the rooms. We keep historical data for trending analysis. We also will spot check the units daily, verify they are working properly, and ensure the screen readouts agree with our remote monitoring.
Water Chillers
The main part of our chilled water system are our multiple centrifugal chillers that create the chilled water. These units run mostly autonomously, but we still spot check them daily for things like oil level, and level of the refrigerant in the system. We also remotely monitor them in our NOC to ensure that no faults have occurred, and that the water temperatures and flows stay within bounds.
Process Water
The chillers make cold water, but they do so by rejecting heat into a separate water system, known as our process water. This separate water loop is also computer controlled via a system of pumps and cooling tower fans where it is taken outside and water is evaporated to reduce the temperature again. Makeup water is also brought in through wells located around the building, and is purified by reverse osmosis and softening systems. Again, the whole system is computer controlled and remotely monitored, and we spot check it daily.
Because the process water is warm, it is a breeding ground for bacteria. Thus, we have to periodically add chemicals to it to keep bacteria from forming and to prevent it from rusting or deteriorating steel and copper tubing throughout the cooling system. This treatment process is done in house, and is monitored weekly.
Stay tuned for part #2, where I’ll talk about our electrical systems.
Raid 5 is dead
February 17, 2011 by Caleb Tennis · Leave a Comment
A few weeks ago I posted an opinion that tape backups are dead – and that generated some feedback telling me I was plain wrong. For better or worse, I’m sticking with another “is dead” mantra: raid (particularlly, raid 5).
Now, in all reality RAID5 isn’t dead. But you shouldn’t be using it. The meat of why is here, in an old piece at ZDNet. Now, let’s dive into why.
The main concept behind RAID5 is that in a disk set, an extra disk is used for storing parity information. The parity information is actually stored across all of the disks, not just on a single disk. The main idea behind this is that any disk can fail in the set, and the set can continue on. Once the failure is noticed, an extra disk can be brought into the set (usually automatically by modern SAN devices) and rebuild the extra parity information.
The problem is that this rebuild takes time. A lot of time, for today’s modern disks. And disk failure rates are fairly high to begin with. So, statistically, there’s a somewhat good likelihood of a secondary disk failure during the parity rebuild of the first disk. And if that happens, you are in for a really bad day. Wikipedia says it best:
As the number of disks in a RAID 5 group increases, the mean time between failures (MTBF, the reciprocal of the failure rate) can become lower than that of a single disk. This happens when the likelihood of a second disk’s failing out of N − 1 dependent disks, within the time it takes to detect, replace and recreate a first failed disk, becomes larger than the likelihood of a single disk’s failing.
Basically, the more disks there are in a RAID5 set, the better chances there are of two disks failing than just a single disk failing.
Of course, RAID6 is an alternative to RAID5, with yet another an additional disk used for parity storage, so that a two disk failure can be handled. But the same limitations exist as with RAID5: at some point it just won’t be reliable anymore. ZDNet even follows up. The problem still lies in that with increased disk sets and parity striping information, any failure takes a really long time to rebuild from, and that’s when things are most vulnerable.
What are the solutions? Well, for one, you could store the data on multiple RAIDsets – perhaps in completely different SAN units. This adds significantly more storage, but makes reliability much higher. You could just back everything up to tape (kidding!). Or start using a more reliable data store on top of the drives, like ZFS.
There are a lot of options. What are you doing to mitigate data loss?
Strata Conference Roundup
February 10, 2011 by Caleb Tennis · Leave a Comment
Last week I attended the inaugural Strata Conference, put on by O’Reilly.
This conference is about one thing: big data. The idea is that we are acquiring vast amounts of data, and new tools are needed to harness, understand, and visualize this vast amount of data.
The conference talks were given by people in various data industries, and highlighted some of their approaches to dealing with vast data sets.
While it kicked off a bit slow, disappointingly, there were a few highlights and things I took away from it.
First, big data isn’t just a problem in the science industry. Every industry has this problem. One of the most obvious ones, and one that was represented in a few different presentations, was healthcare. Beyond just medical records, things like clinical trial data, drug interaction information, and all data points in between are being saved away – and tools are needed to help interpret all of this information. In fact, at the conference the Heritage Health Prize was announced – a $3million dollar reward for interpreting health care data to come up with an algorithm to help predict and prevent unnecessary hospitalizations.
Journalism areas were also presented. I attended a really nice talk by a gentleman from Forbes all about mining campaign contribution information for the worlds billionaires. I wasn’t so much interested in the results of the study, but moreso the process of which they acquired and analyzed this data.
The best technical talk of the conference was by Ben Black of fastip. They’re collecting an enormous amount of real-time network flow data, and have to be able to not only acquire that data, and thus save it into the database, but also give back real time flow reports and statistics. His talk was about the ways that they failed at building their system a number of times before finally hitting on the right path. The sophistication of their database (a large amount of custom code built on top of the Cassandra database) is amazing.
The conference was well attended, and the interest is high. There were a few rough spots, but I think through feedback they received it will tighten up and be a better event next time. If there’s one main takeaway, it’s that big data is real, it’s a growing field (the # of job openings posted for data scientists was staggering), and to compete in the future, businesses are going to have to start being very smart about their data intelligence.
Contact us if you would like to learn more about our data center and how we can store all of that big data in a reliable, affordable way, for your business.
Tape backups go bye-bye
February 7, 2011 by Caleb Tennis · Leave a Comment
Our friends at Data Center Knowledge have a great article about how hard drives are replacing tapes for backup purposes.
This is something we’ve seen, and experienced now for a while. The overall cost of a hard drive based system is on par with, if not cheaper, than a comparable tape system. But factor in “other” costs and it really puts the hard drive system over the edge.
First, one cannot underestimate the human cost. Someone has to rotate the tapes, and replace bad ones in the deck over time. This requires a person to constantly monitor the system, something which over time usually languishes.
Second, it’s hard to know if the tape has gone bad until you need it. While modern tape decks will issue error reports, it’s not always obvious that data was not on a tape until you go to restore from it – and it doesn’t work. And obviously, that’s the time it’s needed most.
Third, you simply never know when a vendor is going to stop supporting your tape hardware – something happening more and more often. With a SAN based system, it’s easier to migrate data between vendors over time. And hard drives are much faster at data transfer than tape based systems.
As the article points out, it’s sometimes even cheaper to buy a direct Gig-E link between your primary facility and your off-site data center and do a backup over-the-wire than use a tape system and rotate tapes offsite manually.
Interested in this approach? We’ve implemented it for multiple customers, and can help implement it for you as well – saving you money each month on your current tape based system. Contact us to learn more.
Stuxnet: The new type of virus
January 27, 2011 by Caleb Tennis · Leave a Comment
The past few months have seen a lot of discussion about the Stuxnet virus. The virus targets unprotected Siemens PLCs, but the main focus of the media attention has been more on the political side – that someone (the US govt?) has been targeting Iranian nuclear facilities to try and disrupt them.
That side of the story is interesting, but what I find more interesting isn’t the political part, but the underlying story: that viruses are now being made to target things other than PCs. While that’s not new news, it is something to think about.
Every modern industrial building, including Data Cave’s data center, has a number of building automation pieces that all work together to keep things ticking. For example, our chillers, have small onboard computers that keep the units running – with external interfaces that we use to talk to them. These aren’t desktop computers, they aren’t running Windows, but they are running software and someone could potentially target them with a virus.
What’s interesting about Stuxnet is it targets PLCs, Programmable Logic Controllers, which are very common products in the industry. In my experience, much of this equipment comes with factory default login information, things like “admin/password”. Very unsecure. The history of these products was that they were quite standalone and required someone to physically connect to them to program. But over time, these devices have been integrated into modern networks from a convenience standpoint, without much thought to the open attack vector.
What this means is that all throughout the industrial world, important logic computers are running very advanced building automation systems, attached to an unprotected network, with default login credentials that have never been changed.
The virus/malware authors recognize this as a new method of attack and are now beginning to take advantage of it. This is something we’ve been thinking about for a long time, and we’ve gone to great lengths to ensure our data center is protected from these types of attacks – by keeping certain critical infrastructure off-net – or using secure gateways to contain that infrastructure which needs to be networked.
We’d love to share more about our facility and design. Contact us for more information.
An engineer’s guide to bandwidth
January 24, 2011 by Caleb Tennis · Leave a Comment
As a data center, we sell and resell bandwidth to customers in a number of different fashions. This can be anything from a small amount of “blended” bandwidth, a large amount of direct bulk bandwidth, or just some kind of point to point connection linking sites together. There are lots of options available.
But there are a lot of customers who don’t have a good fundamental understanding of exactly what bandwidth is, and there is a great Yahoo article with a good overview of it. This post isn’t all that basic, but it does a good job of trying to explain, particularly to someone who is a developer or administrator, exactly what “bandwidth” means and how it effects the end user application.
Particularly interesting is trying to convey the understanding between bandwidth and latency. It’s a tough concept to think about until you sit right down and analyze it.
Got bandwidth woes? We can help. Contact us to learn more about what kind of bandwidth we can provide out of our Tier IV midwest data center.
The issue of data sovereignty
January 20, 2011 by Caleb Tennis · Leave a Comment
As the concept of the cloud, especially the data cloud, grows, the physical location of the storage is supposed to be unimportant. The whole notion of the cloud is to “trust your data to us, and don’t worry about the details”.
But, laws around data retention and storage may not see it that way. A recent article out of New Zealand talks about data sovereignty issues surrounding rules by the tax authority:
It is the Commissioner’s view that only business records stored in data centres physically located in New Zealand will comply with the record keeping obligations in the Inland Revenue Acts. Taxpayers are responsible for ensuring they comply with their record keeping obligations. Therefore, taxpayers using a cloud computing service will need to be satisfied that all their business records will be stored in data centres located in New Zealand
While this particular issue is probably going to be an easy one to address, it does bring up a number of questions. The most interesting one I’ve been thinking about lately: When an entity owns data, but that data is located outside of the country, what laws are that data subject to?
It’s a loaded question, and not an easy one to answer, because laws are subject to change at any time. Microsoft’s stance with Azure is to let users choose what data centers to keep their data in. But that also breaks the more abstract premise of the cloud.
What are your thoughts? Does the end location of the data matter? Would you prefer to keep business data out of certain countries?
The fallacies of distributed computing
January 13, 2011 by Caleb Tennis · Leave a Comment
NewRelic has an interesting new take on an old classic from Sun on “The fallacies of distributed computing“, which is an old look at 8 axioms that designers often take for granted when designing distributed computing systems. The list if very succinct and to the point including gems like “the network is reliable” and “there is one administrator”.
Many applications that are being designed today for the cloud take advantage of Read more

