Recently, I had the opportunity to attend the Open Compute Summit in Santa Clara, CA – I believe this was the third such summit in the past two years. I also attended the one in New York in 2011 where about 300 people attended. This summit blew away everyone’s expectations at around 1900 participants.
There was a definite interest in what the Open Compute Project (OCP) is pushing. The first summit was very Facebook centric; this summit was about vendors taking the ideas Facebook has been pushing and moving the mainstream forward. While I still think we are in the early days of the value of Open Compute and its offerings, the interest is extremely high and vendors are definitely taking notice.
The one major announcement that caught my attention was Intel’s release of a concept they are calling silicon photonics. The idea is that they have created a light based data bus using their traditional silicon wafer and die plant systems. This allows them to build traditional silicon chip as a light source at a fraction of a cost of older traditional methods. This means they can create a backplane data bus, currently at 100Gbps, that provides the interconnect between devices on the motherboard, and present it as both a PCI or Ethernet device to the computer.
I got to see the device and concept first hand, but the initial information I was able to glean was mostly marketing. I’m keeping an eye on this space as I think if it comes to fruition it will really push forward motherboard design and the ability for multiple devices to talk to each other in extremely fast ways. In particular, I think about a rack of machines providing big data services, like Hadoop, and not worrying any more about data locality, since data can be moved between machines so fast it’s not an issue anymore.
One of the first major announcements of Open Compute a few years ago was a rethought motherboard and rack design that threw away many traditional 19″ rack concepts. This design pushed the envelope, but was really only practical for data centers who had been designed to take advantage of it. Since at that time only Facebook had such a data center, the design itself wasn’t particularly useful other than to show what kinds of advances could be made if you had the opportunity to rethink from the ground up.
At this summit, it was evident the manufacturers are motivated to make it work. There were at least half a dozen vendors there making Open Compute Rack and motherboard designs. Again, I don’t explicitly see who the buyers are for those products, other than Facebook and now Rackspace, but I think that now that the vendors are taking charge of these designs we’ll start seeing bits and pieces being taken back and incorporated into normal motherboard and chassis designs over the coming years, which will only be a good thing for all of us.
Both shows were expertly run, and the presentations were very interesting. I’m always intrigued to learn about the way Facebook operates their data centers at scale and how they tackle big data problems, and the presentations by their engineers were excellent. Getting to see the vendors and how they are taking the project and its feedback extremely seriously is, frankly, refreshing. I am looking forward to seeing these technology advances continued to be incorporated into mainstream server designs and their utility growing for the “rest of us.”
We all know that backups are important. But beyond just having backups, having a comprehensive validation and restoration strategy is paramount as well.
Case in point, Toy Story 2, which was accidentally deleted through an errant command by an administrator. But beyond that, the admins found that the backups were bad. Think about that for a second – they go to their backups, and find out they’re worthless. Nobody thought to check that stuff ahead of time. Only when they really needed them did they realize they weren’t going to work.
Luckily, someone had archived a backup copy on a computer at home, which ultimately kept the movie from going away forever. But the story is clear, the Popeil method of “set it and forget it” related to computer data backups isn’t sufficient.
Periodic restoration and checking of backups is essential to make sure you know you’ll have what you need when you need it.
Data Cave is a privately owned and operated fully redundant Midwest data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.
Amazon Web Services had an outage last week that again got the media into a frenzy. There is some good analysis at Hacker News.
Rather than point fingers, I wanted to spend a minute writing about the particular failure mode that AWS experienced
Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit.
The overall gist here is that a primary generator stopped due to overheating, and when the secondary generator had to carry the load, the breaker tripped causing a total loss of power. This, in particular, is what I want to go to into more detail on.
From the time the power enters the data center facility until it gets to the server, it travels through a number of circuit breakers. Entering the facility it likely goes through a large main breaker, much larger than anything you’d have in your home, something like this:
From there it may be distributed to subpanels, with progressively smaller breakers, until it gets to say a standard 20 or 30 amp breaker that connects to your equipment rack. These breakers each have a main purpose, protect equipment and people. If too much electricity attempts to flow, the breaker is designed to trip. Plain and simple.
A big issue however comes with what’s known as “breaker coordination.” If you have a chain of circuit breakers, the idea is that you want the one closest to the “fault” to be the one that trips. If I accidentally overpower my computer rack, then we want the circuit breaker that is closest to my computer rack, the 20 or 30 amp breaker noted above, to trip. We most certainly DON’T want the main circuit breaker to the building to trip. In fact, it’s likely impossible it would, since it’s rated for much higher current capabilities than the 20 or 30 amp breaker in my cabinet so it would never, even though there was a problem.
However, these big circuit breakers have a number of custom settings that allow you to change the dynamics of their “trip curves.” That is, you can set values for instantaneous trip, long term current trip values, and slopes of amps versus time for tripping. Why do you want to do this? For two reasons: one, so you can precisely ensure that the larger breakers don’t trip before the smaller ones do, and two, to ensure your current trip values make sense. For example, if you have a 4000A breaker but are only going to be able to, at most, pull 3000A of current due to the way your systems are designed, then you can reduce the capacity of the breaker down to 3200A, so it trips at a more reasonable level. It’s the same reason you wouldn’t put a 100 Amp breaker on a 20 Amp wall outlet in your living room – you want the breaker size to be close to the actual electrical flow size as possible.
In addition, there is a programmble instantaneous trip value, and it’s usually set much higher than than the rating of the breaker, perhaps 2-3x higher. When transitioning between power systems (from grid power to generator power for example), as the current turns off momentarily and then back on again, there is a quick inrush of current that may very rapidly exceed the normal steady-state current value for a few moments. We need to make sure our breaker settings accounts for this and doesn’t prematurely trip the breaker.
In Amazon’s case, I don’t know exactly what happened, but for this example I will speculate. The timing from the RCA indicates that secondary generator power was provided at 8:53pm, and at 8:57pm, the breaker tripped, so we can likely conclude it was not an instantaneous trip otherwise it would have happened within moments.
My best guess is that Amazon’s normal running setup relies on multiple power feeds, and potentially multiple generators when the main power feeds go out. In the normal scenario, the load to the end servers is balanced through a number of feeds. In this scenario, with the failures, the number of power paths to the server was reduced, which means that the amount of current through the remaining path(s) increased. In this case, it increased enough to cause an incorrectly configured breaker to trip too aggressively.
Could this have been caught earlier? Absolutely. When new electrical distribution systems are added to the fold, a re-coordination of the entire system should be done. Also, more thorough testing methodologies probably could have caught the issue sooner too. However, it may have had the same impact in catching it by tripping it – which may be why Amazon doesn’t include it on their test schedule. In most colocation data centers, customers are notified during test windows in case something like this were to happen (and in general, is designed to catch), but since AWS doesn’t notify its end customers of its test windows, there’s no added customer value in testing some of these scenarios.
I think the biggest takeaway is that you need to ensure your data center provider has a solid grasp on what the testing schedule looks like, and the reasons why the tests are conducted. If that’s outsourced, or automated, it may be a sign that the operator doesn’t have the technical resources to understand the overall electrical system – and respond to issues quickly.
The number of Cisco and Microsoft certified guys on staff doesn’t matter at bit if nobody can troubleshoot why the power goes out.
That’s why we keep technically knowledgeable staff onsite to maintain and test our equipment. We don’t have to wait on someone to troubleshoot for us. We are data center people, this is what we do and why we’ve experienced 100% uptime.
Data Cave is a privately owned and operated Tier IV Midwest data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.
I expect to see more cloud security articles in the near future, and this one really does a good job highlighting exactly some of the things that end users need worry about.
Most IaaS cloud vendors rely on best practices and security baked into their software to provide the security to their end clients, but with certain internally used bits of source, like was the case with VPS.NET and Rackspace, the potential for security errors creep up.
Case in point, the issue here was with regards to reusing hard disk space after it was relinquished by a customer. In the IaaS world, resources are reused once they are terminated by a user. This reuse includes hard disk space for storage of files – the data that was previously stored on the drive by an old customer is deleted before being presented as blank space to another customer.
However, as these researchers found, a method for thoroughly destroying the data was not completely employed, and bits and pieces of file data were preserved between customer use. This means that any data potentially stored on disk could have been leaked to another customer, if the other customer had known where to look.
This is very bad, but not surprising. Low level details like this are easy to overlook, and until someone finds them out and points it out to a provider, they may be blissfully unaware of its existence.
This isn’t a unique problem to these two vendors. Folks like Amazon, who have a pretty good history of security, are just as vulnerable. Perhaps not on this specific issue, but others. In fact, one of the biggest vulnerabilities is the ever changing code behind the scenes that the end user is not aware of. Amazon in particular is constantly adding new features, releasing new internal versions of their Cloud software and deploying it strategically – all out of view of the end user. However, due to the constant changing state of the code, new vectors for exploitation are always a possibility. Even security holes that didn’t exist a few days ago could potentially open up as a result of a misstep by an Amazon engineer.
The bottom line is cloud security must be taken very seriously. While there is a certain level of trust that must be accepted, going the extra mile with internal security practices to ensure that data can’t be leaked is a prudent decision for any organization utilizing the cloud.
Data Cave is a privately owned and operated Tier IV data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.
I recently read a short post by another data center that discussed the need to think about eminent domain and how it may impact your data center. The idea was that if the data center facility is located near railroad tracks, highways, or other certain utilities there may be a risk of seizure of property by the authoritative body in order to expand those services.
Yes, this probably does happen, but I would image fairly infrequently. Plus, the governing body would have to pay a reasonable sum for the property, and that isn’t going to be a small amount of money for a data center facility. I think in general, data centers are safe.
However, it does bring up another good topic: ownership of the building. You see, a nasty little secret of almost every colocation data center out there is that they don’t own the building they occupy. Our informal and unscientific calculations say at least 80% of colocation data center operators do not own their own facilities.
What this means to you, as an end customer, is that you’re at the mercy and whim of a landlord on top of your provider. While many of the facilities are on triple net leases, it’s not uncommon for the landlord to drastically raise rent when the lease comes up – and that cost gets directly passed on to the end customer.
As well, the tenant is always at risk if the landlord wishes to change ownership, or repurposes other parts of the building.
I say eminent domain is not an issue in the data center world; non-tenant owned facilities present a much larger threat to an end client.
Data Cave is a privately owned and operated Tier IV data center located in Columbus, Indiana convenient to Indianapolis, Louisville and Cincinnati. Please contact us for more information at 866-514-2283.
Here we are at the beginning of March. This winter has been the most mild I can recall in Indiana history. I didn’t even shovel snow once this season.
On Feb 29th, the high temperature at Data Cave was 66 degrees. That warm spell was kicked off with two lines of thunderstorms that came during a 12 hour period, which created tornado and thunderstorm warnings in Southern Indiana and Northwestern Kentucky.
Three days later, on Fri Mar 2nd, most of Indiana is under a severe weather alert. Northern Kentucky, including Louisville, is a 6/6 on the tornado probability scale. There are reports that a tornado touched down in Southern Indiana near Henryville as well. The red patches on the radar overwhelm everything else.
If there’s one thing that’s evident about midwest weather, it’s that the thunderstorms are becoming more and more violent. Today’s line was the first of the year, and a reminder and wakeup call for just how powerful mother nature can be. Storms are predicted again Tuesday of next week. It’s going to be a busy season.
With all of this in mind, it’s the perfect time to be thinking: is your data center prepared? Any halfway decent facility will be prepared for power outages. But is yours protected from lightning strikes? Transient surges? Tornados?
What about your office building? Are your records properly backed up off site? Is that facility far enough away to be safe if something major happens to your primary site?
These are the questions you should be asking, and reasking, at least every year. The beginning of storm season is the perfect time to revisit it.
Let us help you build a solid disaster recovery plan.
With all of the confusion around the cloud, it’s no wonder vendors take advantage of every opportunity to remind us, non-stop, about their cloud offerings.
Why, did you know Dell was a big supporter of OpenStack? Yep, and they have open source backing to show. Dell shows how they can put a team into looking like a scrappy startup.
Have no fear though, they also have a vCloud offering. Who wouldn’t want to miss out on the enterprise space?
But wait, they also preship servers with Eucalyptus on them.
I think it’s clear that..it’s not clear. The cloud concept has muddied the water so much, and created a cone of confusion for so many people. All of these vendors want to be everything to everybody, so they’re betting across the board in hopes one pays off.
I don’t necessarily chastise this approach, but I think it doesn’t help customers when a vendor offers so many solutions and says that all of them are the right approach. Sometimes you have to pick a side; sadly, the cloud space seems riddled with those who don’t.
I’m interested to see how this plays out over the next year; the cloud camp seems to be dividing between two sides: the enterprise high end space, and the commodity low end space.
Personally I think the commodity space is the better play. Let’s see if I’m right.
Yesterday afternoon and evening was a rocky day for those of us in the midwest as a powerful line of storms rolled through, in multiple waves.
The picture to the right highlights just a small bit of the destruction from the first wave. This property is due south of Data Cave, and about 200 yards from the building. You can see the lean of the trees to the left, the one tree in the middle missing a substantial limb. That limb was torn off, and hit the power line you can see leaning, and eventually went across the street.
What isn’t shown is the house just out of picture on the left, which sustained bad roof damage and major structural damage to a barn next to it. The picture below is of that house, as taken by our local newspaper.
These types of events are always scary. Luckily, we’ve designed our systems to withstand them as best as possible. In this case, the major concern we were facing was wind. The lightning was an issue as well, but the wind damage was much more of a factor.
When this type of event occurs, there’s simply nothing you can do to stop it. Taking cover is your only option.
Our facility is designed to withstand an EF5 tornado. That’s over 200mph. If a tornado comes through and hits your data center, and the equipment is blown completely away, what’s your contingency plan? You don’t want to take those chances, do you?
It seems like the only stories that make the headlines are the ones with the most chilling titles. Unless you were looking, you may never have even know about Rogue SSL certificates making their way into the wild.
The story goes like this: Certificate vendor Comodo issued SSL certificates for popular domains to end users/companies that didn’t actually control those domains. The whole concept of SSL certificates is based on trust, where everything leads back to a trusted authority to validate that the recipient of the certificates is actually who they say they are. If that’s broken, lots of bad things can happen.
What this means is that someone who wasn’t Google was able to obtain a certificate for mail.google.com. If they put up this certificate and somehow hijacked your browser to go to their site, thinking it was the real mail.google.com, you would never have known because the browser would not have put up a warning. That is, you would have been talking to the other end thinking it was Google, even though it wasn’t, and would have not had any good way of knowing.
This is scary. Since certificates were designed to validate that the end party is who they claim to be, the whole mechanism relies on the certificate vendor (or certificate “authority” as it’s called) to ensure that certificates only go to those who are valid.
From the article:
What can you do with such a certificate?
Well, if you are a government and able to control Internet routing within your country, you can reroute all, say, Skype users to fake https://login.skype.com and collect their usernames and passwords, regardless of the SSL encryption seemingly in place. Or you can read their e-mail when they go to Yahoo, Gmail or Hotmail. Even most geeks wouldn’t notice this was going on.
It’s clear that methods of attack like this are the way things are heading in the cyber world. Vigilence is a must.
Two newsworthy items came out of Cisco last week. The first was that Cisco was shutting down the $550 million dollar Flip camera business as part of an internal reorganization.
The second, and more pertinent, was that they released some specifications around their new Cloud Data Center in Allen, Texas. The data center has a number of features, including:
- rotary flywheels in lieu of batteries for the uninterruptable power supply (UPS)
- an air-side economizer Cisco estimates can use ambient outside air instead of mechanical cooling 65 percent of the time, resulting in $600,000 annual savings in cooling costs.
- 100 kilowatts of rooftop solar power for the building’s offices.
- a Power Usage Effectiveness (PUE) metric of 1.35.
- the ability to withstand tornado winds up to 175 mph.
Very interesting indeed. Data Cave utilizes flywheels, uses water side economization (which we argue is better than air-side), has a great PUE (better than 1.35 we say), and can withstand winds up to 207 mph. You could say that they copied our model pretty closely, in fact.
This announcement validates that our vision, from 2 years ago when we first start construction, is in line with what Cisco says is needed today for cloud centric data centers.
You can even see by the picture too that they are not using raised floors, but overhead ducting. Again, something we pursued from the start.
We think that this shows that Data Cave’s choice of design is well suited for the coming years cloud centric computing loads.