Risk is a difficult concept and requires a lot of training, thought and analysis to properly assess given scenarios. Often, because risk assessments are so difficult, we substitute risk analysis with simply adding basic redundancy and assuming that we have appropriately mitigated risk.
But very often this is not the case. The introduction of complexity or additional failure modes often accompany the addition of redundancy and these new forms of failure have the potential to add more risk than the added redundancy removes. Data storage systems are especially prone to these decision processes, which is unfortunate as few, if any, systems are so susceptible to failure and more important to protect.
RAID is a great example of where a lack of holistic risk thinking can lead to some strange decision making. For instance, look at a not uncommon scenario where the goal of protecting against drive failure can actually lead to an increase in risk even when additional redundancy is applied. In this scenario we will compare a twelve-drive array consisting of twelve three-terabyte SATA hard drives in a single array. It is not uncommon to hear of people choosing RAID 5 for this scenario to get "maximum capacity and performance" while having "adequate protection against failure."
The idea here is that RAID 5 protects against the loss of a single drive that can be replaced and the array will rebuild itself before a second drive fails. That is great in theory, but the real risks of an array of this size, thirty six terabytes of drive capacity, come not from multiple drive failures, as people generally suspect. Instead, risks arise from an inability to reliably rebuild the array after a single drive failure – or from a failure of the array itself with no individual drives failing.
The risk of a second drive failing is quite low, though not non-existent. Drives today are highly reliable. Once one drives fails it does increase the likelihood of a second drive failing, which is well documented, but I don't want this risk to mislead us from looking at the true risks - the risk of a failed resilvering operation.
What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur. When it does the resilver operation halts and the array is left in a useless state - all data on the array is lost.
On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations. That means that a six-terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing. Fifty percent chance of failure is insanely high. Imagine if your car had a fifty percent chance of the wheels falling off every time that you drove it.
So with a small (by today's standards) six terabyte RAID 5 array using 10^14 URE SATA drives, if we were to lose a single drive, we have only a fifty percent chance that the array will recover, assuming the drive is replaced immediately. That doesn't include the risk of a second drive failing, only the risk of a URE failure.
It also assumes that the drive is completely idle other than the resilver operation. If the drives are busily being used for other tasks at the same time then the chances of something bad happening, either a URE or a second drive failure, begin to increase dramatically.
With a twelve -terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent - meaning that RAID 5 has no functionality whatsoever in that case. There is always a chance of survival, but it is very low. At six terabytes you can compare a resilver operation to a game of Russian roulette with one bullet and six chambers and you have to pull the trigger three times. With twelve terabytes you have to pull it six times! Those are not good odds.
But we are not talking about a twelve- terabyte array. We are talking about a thirty six terabyte array - which sounds large but this is a size that someone could easily have at home today, let alone in a business. Every major server manufacturer, as well as nearly all low cost storage vendors, make sub $10,000 storage systems in this capacity range today.
Resilvering a RAID 5 array with a single drive failure on a thirty six terabyte array is like playing Russian roulette, one bullet, six chambers and pulling the trigger eighteen times! Your data doesn't stand much of a chance. Add to that the incredible amount of time needed to resilver an array of that size and the risk of a second disk failing during that resilver window starts to become a rather significant threat.
I've seen estimates of resilver times climbing into weeks or months on some systems. That is a long time to run without being able to lose another drive. When we are talking hours or days the risks are pretty low, but still present. When we are talking weeks or months of continuous abuse – resilver operations are extremely drive intensive – the failure rates climb dramatically.
With an array of this size we can effectively assume that the loss of a single drive means the loss of the complete array, leaving us with no drive failure protection at all. Compare that to a drive of the same or better performance with the same or better capacity under RAID 0, which also has no protection against drive loss. In this case we need only use eleven of the same drives that we needed twelve of for our RAID 5 array.