SHARE

RAID’s Era May Be Fading

The concept of parity-based RAID(levels 3, 5 and 6) is now pretty old in technological terms, and the technology’s limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to […]

Written By

Henry Newman

Sep 25, 2009

6 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

The concept of parity-based RAID(levels 3, 5 and 6) is now pretty old in technological terms, and the technology’s limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable.

The bottom line is this: Disk density has increased far more than performance and hard error rates haven’t changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it’s a scenario that will eventually require a solution, if not a whole new way of storing and protecting data.

We’ll start with a short history of RAID, or at least the last 15 years of it, and then discuss the problems in greater detail and offer some possible solutions.

Some of the first RAID systems I worked on used RAID-5 and 4GB drives. These drives ran at a peak of 9 MB/sec. This, of course, was not the first RAID system, but 1994 is a good baseline year. You’ll need to click on the image below for how the RAID rebuild picture has changed in the last 15 years.

RAID reconstruction time, 1994-2009 (Click for larger image)

A few caveats and notes on the data:

Except for 1998, the number of drives that were needed to saturate a channel has increased. Of course some vendors had 1Gb Fibre Channelin 1998, but most did not.
90 percent of full rate is in many cases a best-case number, and in many cases, full rebuild requires the whole drive.
The bandwidth assumes that two channels of the fastest type are available to be used for the 9 or 10 (RAID-5 or RAID-6) drives. So for 2009, I am assuming two 800 MB/sec channels are available for rebuild for the 9 or 10 drives.
The time to read a drive is increasing, as the density increases exceed performance increases.
Changes in the number of drives to saturate a channel — along with new technologies such as SSDs and pNFS — are going to affect channel performance and cause RAID vendors to rethink the back-end design of their controller.

We all know that density is growing faster than bandwidth. A good rule of thumb is that each generation of drives will improve bandwidth by 20 percent. The problem is that density is growing far faster and has been for years. While density percentages might be slowing now from 100 percent to 50 percent or less, drive performance is pretty fixed at about 20 percent improvement per generation.

Using the sample data in the table above, RAID rebuilds have gone up more than 300 percent over the last 15 years. If we change the formula from 90 percent of bandwidth to 10 percent of disk bandwidth — which might be the case if the device is heavily used with application I/O, thanks in part to the growing use of server virtualization — then rebuild gets ugly pretty, as in the RAID rebuild table below.

RAID Rebuild for Application I/O (Click for larger image)

Hard Errors Grow with Density

The hard error rate for disk drives has not, for the most part, improved with the density. For example, the hard error rate for 9GB drives was 1 in 10E¹⁴ bits, and that error rate has improved by an order of magnitude to 1 in 10E¹⁵ for the current generation of enterprise SATA and 1 in 10E¹⁶ for the current generation of FC/SASdrives. The problem is that the drive densities have increased at a faster rate.

Year	Drive Size in GB	Hard Error Rate	Density Increase in Times
1994	4	10E¹⁴
2009	450	10E¹⁶	113
2009	1500	10E¹⁵	375

What this means for you is that even for enterprise FC/SAS drives, the density is increasing faster than the hard error rate. This is especially true for enterprise SATA, where the density increased by a factor of about 375 over the last 15 years while the hard error rate improved only 10 times. This affects the RAID group, making it less reliable given the higher probability of hitting the hard error rate during a rebuild.

Device	Hard Error Rate in Bits	PB Equivalent
Consumer SATA	10E¹⁴	0.89
Enterprise SATA	10E¹⁵	8.88
Enterprise SAS/FC	10E¹⁶	88.82

Think about it: Only 8.88 PB of data has to move before you hit the hard error rate, and remember that these types of values are numbers that vendors say you cannot likely exceed. Consumer-grade SATA drives are pretty scary and must be a consideration when talking to low-end RAID vendors who might use these drives. A single 8+2 rebuild with 1.5 TB drives reads and writes about 28.5 TB (read 1.5 TB*9 drives, write 1.5 TB*10 drives). That means that in a perfect world using vendor specifications, we can expect one out of every 312 rebuilds to fail today with data loss. Moving to the next-generation 2 TB drives, this value drops to every 234 rebuilds, and with 4 TB drives it drops to every 117 rebuilds.

Some considerations:

Hard error rates are not likely to increase given the cost to make the drives more reliable
Given that what RAID vendors see for MTBF is often half of what drive vendors see, hard error rates are lower than what drive vendors report when the drives are put into RAID disk trays, with the need for both high performance and dense packaging increasing vibration and heat.

All of this means that the hard error rate is beginning to become a big factor in MTDL (Mean Time before Data Loss) as a part of rebuild, given the amount of data moved and the density of the drives.

OSD, Declustered RAID Might Help

Some vendors claim that file systems can address the storage reliability problem, but I don’t buy that argument, given the latency inherent in large operating systems and file systems, and the need for high performance (see File System Management Is Headed for Trouble and SSDs, pNFS Will Test RAID Controller Design). And running software RAID-5 or RAID-6 equivalent does not address the underlying issues with the drive. Yes, you could mirror to get out of the disk reliability penalty box, but that does not address the cost issue.

The disk reliability density problem is getting worse, and fast. Some vendors are using techniques such as write logging — keeping track of write on another disk during rebuild to allow the rebuild to occur faster — to get around the growing problem. Will this solve the problem for the long term, or is this the equivalent of the RAID-5 to RAID-6 fix that just delayed the inevitable problem? Personally, I think it will turn out to be just another short-term fix. The real fix must be based on new technology such as OSD, where the disk knows what is stored on it and only has to read and write the objects being managed, not the whole device, or something like declustered RAID. In essence, the disk drive layer needs to have more knowledge of what is storage, or fixed RAID devices must be rethought. Or both.

There are some technologies that are available today and on the way that could help alleviate the problem. The sooner they arrive, the better chance we have of avoiding MTDL.

Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.

Article courtesy of Enterprise Storage Forum.

Ethics and Artificial Intelligence: Driving Greater Equality

FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020

SEE ALL
ARTICLES

Henry Newman

Henry Newman has been a contributor to TechnologyAdvice websites for more than 20 years. His career in high-performance computing, storage and security dates to the early 1980s, when Cray was the name of a supercomputing company rather than an entry in Urban Dictionary. After nearly four decades of architecting IT systems, he recently retired as CTO of a storage company’s Federal group, but he rather quickly lost a bet that he wouldn't be able to stay retired by taking a consulting gig in his first month of retirement.