Free Newsletters :

Data Storage: The Myth of Redundancy: Page 2

Redundancy in a data storage system does not by itself guarantee reliability. Are two straw houses safer than one brick house?
(Page 2 of 3)

What this means is that instead of twelve hard drives, each of which has a roughly three percent chance of annual failure, we have only eleven. That alone makes our RAID 0 array more reliable as there are fewer drives to fail. Not only do we have fewer drives but there is no need to write the parity block nor skip parity blocks when reading back lowering, ever so slightly, the mechanical wear and tear on the RAID 0 array for the same utilization. This gives it a very slight additional reliability edge.

The RAID 0 array of eleven drives will be identical in capacity to the twelve drive RAID 5 array but will have slightly better throughput and latency. A win all around. Plus the cost savings of not needing an additional drive.

So what we see here is that in large arrays (large in capacity, not in spindle count) that RAID 0 actually passes RAID 5 in certain scenarios. When using common SATA drives this happens at capacities experienced even by power users at home and by many small businesses.

If we move to enterprise SATA drives or SAS drives then the capacity number where this occurs becomes very high and is not a concern today. But it will be in just a few years when drive capacities get larger still. This highlights how dangerous RAID 5 is in the sizes that we see today.

Everyone understands the incredible risks of RAID 0. But it can be difficult to put into perspective that RAID 5's issues are so extreme that it might actually be less reliable than RAID 0.

That RAID 5 might be less reliable than RAID 0 in an array of this size based on resilver operations alone is just the beginning. In a massive array like this, the resilver time can take so long and exact such a toll on the drives that second drive failure starts to become a measurable risk as well.

And then there are additional risks caused by array controller errors that can utilize resilver algorithms to destroy an entire array even when no drive failure has occurred. As RAID 0 (or RAID 1 or RAID 10) do not have resilver algorithms they do not suffer this additional risk. These are hard risks to quantify but what is important is that they are additional risks that accumulate when using a more complex system when a simpler system, without the redundancy, was more reliable from the outset.

The Dangers of RAID 0

Now that we have established that RAID 5 can be less reliable than RAID 0 I will point out the obvious dangers of RAID 0. RAID in general is used to mitigate the risk of a single, lone hard drive failing. We all fear a single drive simply failing and all data being lost.

RAID 0, being a large stripe of drives without any form of redundancy, takes the risk of data loss of a single drive failing and multiplies it across a number of drives. Any drive failing causes total loss of data to all drives. So in our eleven disk example above, if any of the eleven disks fails all is lost. It is clear to see where this is dramatically more dangerous than just using a single drive, all alone.

In short, redundancy does not mean reliability. Just because something is redundant, like RAID 5, provides no guarantee that it will always be more reliable than something that is not redundant.

My favorite analogy here is to look at houses in a tornado. In one scenario we build a house of brick and mortar. On the second scenario we build two redundant house, each out of straw (our builders are pigs, apparently.) When the tornado (or big bad wolf) comes along, which is more likely to leave us with a standing house? Clearly one brick and mortar house has some significant reliability advantages over redundant straw houses. Redundancy didn't matter, reliability mattered in the end.

Redundancy is often misleading because it is easy to quantify but hard to qualify. Redundancy is a black or white question: Is it redundant? Yes or no. Simple. Reliability is not so simple. Reliability is about failure rates and likelihoods. It is about statistics and analysis. It’s hard to quantify reliability in a meaningful way, especially when selling a project to the business people, so redundancy often becomes a simple substitute for this complex concept.

The concept of using redundancy to misdirect questions of reliability also ends up applying to subsystems in very convoluted ways. Instead of making a "system" redundant it has become common to make a highly reliable, and low cost, subsystem redundant and treat subsystem redundancy as applying to the whole system.

The most common example of this is RAID controllers in SAN products. Rather than having a redundant SAN (meaning two SANs) manufacturers will often make that one component that is not often redundant in normal servers redundant – and then calling the SAN redundant. This meaning a SAN contains redundancy, which is not at all the same thing.

A good analogy here would be to compare having redundant cars: two complete, working cars, versus having one car with a spare water pump in the trunk in case the main one fails. Clearly, a spare water pump is not a bad thing. But it is also a trivial amount of protection against car failure compared to having a second car ready to go.

In one case the entire system is redundant, including the chassis. In the other we are making just one, highly reliable component redundant inside the chassis. It's not even on par with having a spare tire which, at least, is a car component with a higher likelihood of failure.

Single Point of Failure

Just like the myth of RAID 5 reliability and system/subsystem reliability, shared storage technologies like SANs and NAS often get treated in the same way, especially in regards to virtualization. A common scenario: a virtualization project is undertaken and people instinctively panic because a single virtualization host represents a single point of failure where, if it fails, many systems will all fail at once.

Using the term "single point of failure" causes a panic feeling and is a great means of steering a conversation. But a SPOF, as we like to call it, while something we like to remove when possible, may not be the end of the world.

Think about our brick house. It is a SPOF. Our two houses of straw are not. Yet a single breeze takes out our redundant solutions faster than our reliable SPOF. Looking for SPOFs is a great way to find points of fragility in a system, but do not feel that every SPOF must be made redundant in every scenario.


Page 2 of 3

Previous Page
1 2 3
Next Page





0 Comments (click to add your comment)
Comment and Contribute

 


(Maximum characters: 1200). You have characters left.