A few weeks ago I posted an opinion that tape backups are dead – and that generated some feedback telling me I was plain wrong. For better or worse, I’m sticking with another “is dead” mantra: raid (particularlly, raid 5).
Now, in all reality RAID5 isn’t dead. But you shouldn’t be using it. The meat of why is here, in an old piece at ZDNet. Now, let’s dive into why.
The main concept behind RAID5 is that in a disk set, an extra disk is used for storing parity information. The parity information is actually stored across all of the disks, not just on a single disk. The main idea behind this is that any disk can fail in the set, and the set can continue on. Once the failure is noticed, an extra disk can be brought into the set (usually automatically by modern SAN devices) and rebuild the extra parity information.
The problem is that this rebuild takes time. A lot of time, for today’s modern disks. And disk failure rates are fairly high to begin with. So, statistically, there’s a somewhat good likelihood of a secondary disk failure during the parity rebuild of the first disk. And if that happens, you are in for a really bad day. Wikipedia says it best:
As the number of disks in a RAID 5 group increases, the mean time between failures (MTBF, the reciprocal of the failure rate) can become lower than that of a single disk. This happens when the likelihood of a second disk’s failing out of N − 1 dependent disks, within the time it takes to detect, replace and recreate a first failed disk, becomes larger than the likelihood of a single disk’s failing.
Basically, the more disks there are in a RAID5 set, the better chances there are of two disks failing than just a single disk failing.
Of course, RAID6 is an alternative to RAID5, with yet another an additional disk used for parity storage, so that a two disk failure can be handled. But the same limitations exist as with RAID5: at some point it just won’t be reliable anymore. ZDNet even follows up. The problem still lies in that with increased disk sets and parity striping information, any failure takes a really long time to rebuild from, and that’s when things are most vulnerable.
What are the solutions? Well, for one, you could store the data on multiple RAIDsets – perhaps in completely different SAN units. This adds significantly more storage, but makes reliability much higher. You could just back everything up to tape (kidding!). Or start using a more reliable data store on top of the drives, like ZFS.
There are a lot of options. What are you doing to mitigate data loss?