The Agony and Ecstasy of RAID: Page 3

(Page 3 of 3)

When arrays with parity perform a rebuild operation they perform a complex process by which they step through the entire contents of the array and write missing data back to the replaced drive. In and of itself this is relatively simple and should be no cause for worry.

What I and others have seen first hand is a slightly different scenario involving disks that have lost connectivity due to loose connectors to the array. Drives can commonly "shake" loose over time as they sit in a server, especially after several years of service in an always-on system.

What can happen, in extreme scenarios, is that good data on drives can be overwritten by bad parity data when an array controller believes that one or more drives have failed in succession and been brought back online for rebuild. In this case the drives themselves have not failed and there is no data loss. All that is required is that the drives be reseated, in theory.

On hot swap systems the management of drive rebuilding is often automatic, based on the removal and replacement of a failed drive. So this process of losing and replacing a drive may occur without any human intervention – and a rebuilding process can begin. During this process the drive system is at risk and should this same event occur again the drive array may, based upon the status of the drives, begin striping bad data across the drives overwriting the good filesystem.

It is one of the most depressing sights for a server administrator to see when a system with no failed drives loses an entire array due to an unnecessary rebuild operation.

In theory this type of situation should not occur and safeguards are in place to protect against it. But the determination of a low level drive controller as to the status of a drive currently and previously and the quality of the data residing upon that drive is not as simple as it may seem and it is possible for mistakes to occur.

While this situation is unlikely it does happen and it adds a nearly impossible to calculate risk to RAID 5 and RAID 6 systems. We must consider the risk of parity failure in addition to the traditional risk calculated from the number of drive losses that an array can survive out of a pool. As drives become more reliable the significance of the parity failure risk event becomes greater.

Additionally, RAID 5 and RAID 6 parity introduces system overhead due to parity calculation, which is often handled by way of dedicated RAID hardware. This calculation introduces latency into the drive subsystem that varies dramatically by implementation both in hardware and in software. This makes it impossible to state performance numbers of RAID levels against one another as each implementation will be unique.

Possibly the biggest problem with RAID choices today is that the ease with which metrics for storage efficiency and drive loss survivability can be obtained mask the big picture of reliability and performance as those statistics are almost entirely unavailable. One of the dangers of metrics is that people will focus upon factors that can be easily measured and ignore those that cannot be easy measured regardless of their potential for impact.

While all modern RAID levels have their place it is critical that they be considered within context and with an understanding as to the entire scope of the risks. We should work hard to shift our industry from a default of RAID 5 to a default of RAID 10. Drives are cheap and data loss is expensive.


Page 3 of 3

Previous Page
1 2 3
 



Tags: data, backup, RAID, disk failure, Storage


0 Comments (click to add your comment)
Comment and Contribute

 


(Maximum characters: 1200). You have characters left.