For simplicity, let’s define MTBF as the average time between failures. It’s either based on historical data or estimated by vendors and is used as a benchmark for reliability. Organizations trending MTBF over time can readily see devices that are failing above average and take appropriate action.
Where MTBF breaks down is when management puts too much faith in unproven MTBF estimates and uses them to justify inordinately massive amounts of capital investment on complex systems. This may seem to be a bold statement and therefore requires some explanation.
If we assume that there are 8,760 hours per year (365 days x 24 hours per day) then we can divide MTBF claims from vendors and look at how long the system will run in years. If we buy a system, or component, with a rating of 30,000 MTBF, then we might assume that on average, the system would run 3.42 years without a failure. Granted, there are always statistical variations around the average, but 3.42 years doesn’t seem bad at all, does it?
There’s a problem with this rationale, however, especially when applied to complex systems. First, as previously mentioned, it is both an estimate and an average. You run the risk of being one of the seemingly statistical anomalies with a far higher frequency of failure that gets smoothed out by the averaging! The reason could simply be that the MTBF estimate was subjected to different environmental factors such as heat and power.
Second, fault-tolerance costs accelerate very rapidly as higher and higher MTBF levels are sought. Third and perhaps the most important, fault-tolerant systems (hardware, software, documentation and processes) in general become increasingly complex as the level of fault tolerance increases. Fault-tolerant systems typically are more complex than non-fault-tolerant systems. This increased level of complexity, in and of itself, creates fertile ground for disasters.
Coupling, Complexity and Normal Accidents
In 1984, Charles Perrow wrote an amazing book titled Normal Accidents: Living with High Risk Technologies. In it he observed that system accidents can be the result of one big failure, but most often are caused by the unexpected interactions between failures of multiple components.
In other words, complex systems whose components are tightly integrated typically fail through the culmination of multiple components failing and interacting in unexpected ways. For example, it’s very rare that a plane has a wing fall off mid-flight. It’s far more likely that several component failures interact in unpredictable ways that, when combined, cause a catastrophe. Let’s investigate this line of reasoning further.
First, errors can be readily visible or latent. The former we can deal with when we detect them. The latter are far more insidious because they can be in a system and undetected, “waiting to spring,” if you will.
Second, complex systems made up of hundreds, if not thousands of components, that interact tightly are considered to be tightly “coupled.” The possibly pathways of interaction are not necessarily predictable. Perrow points out that during an accident, the interaction of failed components can initially be incomprehensible.
Let’s take a highly fault-tolerant database server with its own external RAID and then use clustering software to join it with another server located in another data center.
At this point, we have a pretty complex system comprised of thousands of components that are all tightly coupled. The IT operations staff is capable and diligent, performing nightly backups.
Now, imagine that there is a programming error on the RAID controller caused by an unexpected combination of data throughput and multi-threaded on-board processor activity that causes a periodic buffer overflow and subsequent data corruption that is then written to the drives. It doesn’t happen often, but it does happen. As the systems are exact duplicates of one another, the issue happens on both nodes of the cluster.
At an observable level, everything would seem to be OK because the error is latent. It isn’t readily apparent until one or more database structures become sufficiently corrupt to raise awareness of the issue. Once it does happen, the network and database people scramble to find out what is wrong and go tracking through the logs looking for clues and checking for security breaches because “it was running fine.” The point is that multiple components can interact in unforeseen ways to bring down a relatively fault-tolerant system.
Mean Time to Repair (MTTR)
Let’s face it, accidents can and will happen. Fault tolerance can create a false sense of security. From our 30,000-hour example, we could unrealistically expect 3.42 years of uninterrupted bliss, but reality and Mr. Murphy don’t like this concept.
Yes, fault tolerance reduces the chance of some errors, but as the system’s inherent complexity and level of interaction increases, the chance of an accident increases. How often is a fault-tolerant system simple? How many people in your organization fully understand your fault-tolerant systems? There are many questions that can be asked, but here is the most important question: “When the system fails, and it will fail, how easy will it be to recover?”
Not too surprisingly, there often is a dichotomy with highly fault-tolerant systems. On one hand, their likelihood of failure is less than a standard system lacking redundancy, but on the other, when they do fail, they can be a bear to troubleshoot and get back on line.
Instead of spending tens, if not hundreds of thousands of dollars on fault-tolerant hardware, what if IT balanced the costs of the fault tolerance with an eye toward unrelentingly driving down the MTTR of the systems? True systemic fault tolerance is a combination of hardware, software, processes, training, and effective documentation. Sometimes, teams focus on the hardware involved first, the software requirements are a distant second and then they totally overlook the process, training, and documentation needs.
Always remember that availability can be addressed by trying to prevent downtime through fault tolerance as well as by reducing the time spent recovering when an actual outage does occur. Therefore, activities surrounding the rapid restoration of service and problem resolution are essential. The ITIL Service Support book provides great guidance on both initially restoring services through Incident Management and ultimately addressing the root causes of the outage via Problem Management.
The Berkeley/Stanford Recovery-Oriented Computing (ROC) research team, a joint project at Berkeley and Stanford, also provides great information about ROC. You can find it here.
Of course MTBF matters — it is an important metric to track in regard to system reliability. The main point is that even fault-tolerant servers fail. As the level of complexity and coupling increases, systemic failure due to the accumulation of component failures interacting in previously unexpected ways is inevitable. IT should look at availability holistically and consider addressing both initial system design fault tolerance and the speed in which a failed system can be recovered.
In some cases, it may make far more sense to invest less in capital intensive hardware and more on the training, documentation and processes necessary to both prevent and recover from failures.