Datamation content and product recommendations are
editorially independent. We may make money when you click on links
to our partners.
Learn More
The concept of parity-based RAID(levels 3, 5 and 6) is now pretty old in technological terms, and the technology’s limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable.
The bottom line is this: Disk density has increased far more than performance and hard error rates haven’t changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it’s a scenario that will eventually require a solution, if not a whole new way of storing and protecting data.
We’ll start with a short history of RAID, or at least the last 15 years of it, and then discuss the problems in greater detail and offer some possible solutions.
Some of the first RAID systems I worked on used RAID-5 and 4GB drives. These drives ran at a peak of 9 MB/sec. This, of course, was not the first RAID system, but 1994 is a good baseline year. You’ll need to click on the image below for how the RAID rebuild picture has changed in the last 15 years.
A few caveats and notes on the data:
- Except for 1998, the number of drives that were needed to saturate a channel has increased. Of course some vendors had 1Gb Fibre Channelin 1998, but most did not.
- 90 percent of full rate is in many cases a best-case number, and in many cases, full rebuild requires the whole drive.
- The bandwidth assumes that two channels of the fastest type are available to be used for the 9 or 10 (RAID-5 or RAID-6) drives. So for 2009, I am assuming two 800 MB/sec channels are available for rebuild for the 9 or 10 drives.
- The time to read a drive is increasing, as the density increases exceed performance increases.
- Changes in the number of drives to saturate a channel — along with new technologies such as SSDs and pNFS — are going to affect channel performance and cause RAID vendors to rethink the back-end design of their controller.
We all know that density is growing faster than bandwidth. A good rule of thumb is that each generation of drives will improve bandwidth by 20 percent. The problem is that density is growing far faster and has been for years. While density percentages might be slowing now from 100 percent to 50 percent or less, drive performance is pretty fixed at about 20 percent improvement per generation.
Using the sample data in the table above, RAID rebuilds have gone up more than 300 percent over the last 15 years. If we change the formula from 90 percent of bandwidth to 10 percent of disk bandwidth — which might be the case if the device is heavily used with application I/O, thanks in part to the growing use of server virtualization — then rebuild gets ugly pretty, as in the RAID rebuild table below.
Hard Errors Grow with Density
The hard error rate for disk drives has not, for the most part, improved with the density. For example, the hard error rate for 9GB drives was 1 in 10E14 bits, and that error rate has improved by an order of magnitude to 1 in 10E15 for the current generation of enterprise SATA and 1 in 10E16 for the current generation of FC/SASdrives. The problem is that the drive densities have increased at a faster rate.
| Year |
Drive Size in GB |
Hard Error Rate |
Density Increase in Times |
| 1994 |
4 |
10E14 |
|
| 2009 |
450 |
10E16 |
113 |
| 2009 |
1500 |
10E15 |
375 |
|
What this means for you is that even for enterprise FC/SAS drives, the density is increasing faster than the hard error rate. This is especially true for enterprise SATA, where the density increased by a factor of about 375 over the last 15 years while the hard error rate improved only 10 times. This affects the RAID group, making it less reliable given the higher probability of hitting the hard error rate during a rebuild.
| Device |
Hard Error Rate in Bits |
PB Equivalent |
| Consumer SATA |
10E14 |
0.89 |
| Enterprise SATA |
10E15 |
8.88 |
| Enterprise SAS/FC |
10E16 |
88.82 |
|
Think about it: Only 8.88 PB of data has to move before you hit the hard error rate, and remember that these types of values are numbers that vendors say you cannot likely exceed. Consumer-grade SATA drives are pretty scary and must be a consideration when talking to low-end RAID vendors who might use these drives. A single 8+2 rebuild with 1.5 TB drives reads and writes about 28.5 TB (read 1.5 TB*9 drives, write 1.5 TB*10 drives). That means that in a perfect world using vendor specifications, we can expect one out of every 312 rebuilds to fail today with data loss. Moving to the next-generation 2 TB drives, this value drops to every 234 rebuilds, and with 4 TB drives it drops to every 117 rebuilds.
Some considerations:
- Hard error rates are not likely to increase given the cost to make the drives more reliable
- Given that what RAID vendors see for MTBF is often half of what drive vendors see, hard error rates are lower than what drive vendors report when the drives are put into RAID disk trays, with the need for both high performance and dense packaging increasing vibration and heat.
All of this means that the hard error rate is beginning to become a big factor in MTDL (Mean Time before Data Loss) as a part of rebuild, given the amount of data moved and the density of the drives.
OSD, Declustered RAID Might Help
Some vendors claim that file systems can address the storage reliability problem, but I don’t buy that argument, given the latency inherent in large operating systems and file systems, and the need for high performance (see File System Management Is Headed for Trouble and SSDs, pNFS Will Test RAID Controller Design). And running software RAID-5 or RAID-6 equivalent does not address the underlying issues with the drive. Yes, you could mirror to get out of the disk reliability penalty box, but that does not address the cost issue.
The disk reliability density problem is getting worse, and fast. Some vendors are using techniques such as write logging — keeping track of write on another disk during rebuild to allow the rebuild to occur faster — to get around the growing problem. Will this solve the problem for the long term, or is this the equivalent of the RAID-5 to RAID-6 fix that just delayed the inevitable problem? Personally, I think it will turn out to be just another short-term fix. The real fix must be based on new technology such as OSD, where the disk knows what is stored on it and only has to read and write the objects being managed, not the whole device, or something like declustered RAID. In essence, the disk drive layer needs to have more knowledge of what is storage, or fixed RAID devices must be rethought. Or both.
There are some technologies that are available today and on the way that could help alleviate the problem. The sooner they arrive, the better chance we have of avoiding MTDL.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
Article courtesy of Enterprise Storage Forum.
-
Ethics and Artificial Intelligence: Driving Greater Equality
FEATURE | By James Maguire,
December 16, 2020
-
AI vs. Machine Learning vs. Deep Learning
FEATURE | By Cynthia Harvey,
December 11, 2020
-
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
-
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
-
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
-
Top 10 AIOps Companies
FEATURE | By Samuel Greengard,
November 05, 2020
-
What is Text Analysis?
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
-
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
-
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
-
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
-
Top 10 Chatbot Platforms
FEATURE | By Cynthia Harvey,
October 07, 2020
-
Finding a Career Path in AI
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
-
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
-
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
-
Top 10 Machine Learning Companies 2021
FEATURE | By Cynthia Harvey,
September 22, 2020
-
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
-
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
-
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
-
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
-
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
SEE ALL
ARTICLES