Download the authoritative guide: Cloud Computing 2018: Using the Cloud to Transform Your Business
In fact, operators are identified as the culprits in 80 percent of all accidents. Doesn't this seem odd in this day and age of automation and complex systems? Doesn't this seem like an easy out on the part of investigators? Rather than play the blame game for whatever reason, organizations should focus on causal factors and address the real issues.
It's no secret that our world is becoming more complex. Not only does everything from stoves to air conditioners to the space shuttle have electronics and computers, but now they are becoming increasingly integrated into a web of connections creating dependencies and interactions never before dreamed of. However, with all of these interconnections and dependencies come the potential for some serious headaches.
When multiple things fail and interact in ways that, combined, cause the system to fail, this is known as systemic causality. The trick here is that systemic causality does not necessarily need to follow a linear path. Instead, it depends on the failure of multiple components, or subsystems, to interact and fail.
For example, a power failure causing systems to crash seems easy. But, when we dig in we find that there was a UPS (uninterruptible power supply) and generator. The systems should have been protected.
However, we find out that the UPS died earlier than expected due to a couple of space heaters being plugged in on the UPS circuit because an electrician accidentally connected an outlet to the protected power circuit and the heaters only ran during business hours when the staff was present, hence the load was never noticed during regular weekend testing.
To make matters worse, the generator hadn't been exercised for a long time due to a failure of an electronic remote starter and, hence, the fuel had evaporated and clogged the fuel injection system. Operations didn't notice because the generator's monitoring circuit had been erroneous for quite some time and operators discounted what the logs said. To make matters worse, the second generator was down having preventive maintenance performed.
The realistic answer is that nobody could know that all of these independent issues would interact to cause an outage. How would you address this type of scenario?
The hard part, and this is why operators get blamed the most, is that many times the failures in the component systems cause scenarios that were previous unfathomable. If you visit this site and go to the accidents and errors page, you will see a long list of accidents, of which many are blamed on operators. Why? My hunch is that they needed to blame somebody. "We are going to put you into a situation for which you are totally unprepared and see how you do. If you fail, we'll blame you." That sure sounds promising, doesn't it?
So what does this mean? First, vendors and internal groups must place an emphasis on proper design and thorough testing. The testing needs to be formalized and there are software test engineers versed in the proper methodologies. Note, quality must come first, before features, and while testing is a much-needed detective control that assists in ensuring quality, it is not the total solution and must be integrated with the overall system such that feedback is generating process improvement loops. To borrow a phrase from manufacturing -- you don't inspect quality in, you build quality in.
Second, an effective change management process must be in place and followed. There must be detective controls that can assist with the mapping of changes found in production back to authorized change orders. Only authorized changes should be allowed to remain. The ITIL Service Support book and the ITPI Visible Ops methodology provide great guidance here.
Third, we must evolve adaptive processes that rapidly recognize and adapt to variations from the understood mean. This applies not just to application logic, but manual human processes as well. Systems and their operators must be adept at recognizing the need to change and then actually changing in a secure, timely and efficient manner.
Fourth, members of failure review boards must avoid taking the easy way out. Rather than flag the outcome as a result of operator error, ask yourself these two simple questions: "Could anyone have realistically known what to do in that situation?" and "Could anyone have computed the solution and acted in the timeframe allotted?" Quite often, the answer is "no," which points back to systemic issues that are process and/or technically based.
Rather than expect operators to perform superhuman acts of omniscience, we must confront systemic issues in processes and technology that prevent the accidents from happening again. Yes, people can and do make mistakes. The point is that it is too simple to blame the operator for unexplained failures.
Organizations must dig in and ensure that there is learning after the accident and that appropriate measures are introduced to prevent reoccurrences in the future. This is done by addressing root causes -- not just playing the blame game.