How often have you reviewed an incident and asked, ”How could they fail
to see the cause of the error before it became such a huge problem?”
Certainly there are benefits to reviewing a negative event, or series of
events, and determining how to prevent them. During the review process,
what must be avoided is allowing knowledge of the outcome to cloud your
judgment. The affect on perception due to the knowledge of the
subsequent outcome is known as ”hindsight bias” and it can definitely
affect the quality of the review process in a negative manner.
People involved with problem management must avoid this phenomenon and
ask probing questions that dig deeper into the causal elements of the
incident, gaining better insight.
Essentially, once you know the outcome of a chain of events, you tend to
view all actions performed and decisions made through the lens of the
final outcome.
For example, a series of warning alarms go off in a sequence never
before considered. Shortly after, the system begins to fail and the
operator makes matters worse by making a decision on the spot with very
limited information and time. In going back over the accident, it is
very clear to the reviewer that the chaotic warnings were indicating
that a key subsystem was failing because the incident report lists the
failure in detail.
In hindsight, it’s obvious. But in the heat of battle, it may have been
anything but obvious for a variety of reasons.
Organizations must take care not to rush to judgment prematurely. Far
too often these days, groups are deploying systems with little testing,
little to no documentation, and virtually no training. And in this day
of compressed time and high-speed systems, once the operators do
encounter an issue, the issues often compound and mushroom out of
control at amazing speeds.
How can any operator, or group of operators, be expected to effectively
respond without proper training and support mechanisms?
In short, they can’t.
Problem Management
From ITIL, we know that Problem Management essentially involves focusing
on an incident, or series of incidents, in order to identify underlying
causal factors — ”problems” — to prevent repetition. In order to
identify the root causes accurately, problem managers and problem review
boards must beware of allowing hindsight bias to cloud the problem
review process, allowing for oversimplification and/or the
personalization of causality.
In other words, a board must not look at an accident and literally say,
”The sequence is so simple! How could they miss it? It must be operator
error.”
When complex systems are involved, there are often far more contributing
factors than one might initially think.
Focus on the Processes
First and foremost, instead of personalizing the causality and blaming
the operators, reviewers must recognize that there very often are levels
of complexity beyond what is superficially visible. Furthermore, they
need to take a step back and look at what key control points and
processes are lacking.
For example, without exception, as the level of complexity increases in
a system, the value of an effective change management process increases.
Yet, this incredibly valuable process and the associated controls are
all too often overlooked or even discounted as too bureaucratic.
Returning to the point, a great many problems are rooted in process
failures that are exacerbated by the human element being involved.
Continue on to find out what questions will get you the answers you need…
To reduce the risks associated with hindsight bias, develop post-problem
questionnaires in advance for each system, or class of system. When
incidents happen and it is time to interview and observe the team, use
the questionnaires as guides to templates.
Here are some questions that should give you a few starting points:
involved at the time of the incident:
- Did something about the process change? Was it formally documented
and adopted just prior to the incident?
- What processes failed and why?
- Are people bypassing the formal documented processes? Why?
- Are there processes and/or controls that need to be added?
- Are there processes and/or controls that need to be changed?
processes mature enough to be documented?
- Did documentation exist?
- Was the documentation readily available?
- Was it understandable?
- Could they find the needed topics in the manual?
- How can the documentation be improved?
the processes and documentation is another major step on the road to
maturity.
- Were the operators trained appropriately?
- Were they trained to handle the scenario that took place and if so,
how similar was the training to reality?
- How could training be improved?
cognitive abilities of the people operating systems. Be sure to factor
them in.
- Was fatigue a factor?
- Were the operators angry, upset, anxious?
- Were there corporate pressures, such as expense control or other
financial constraints?
- Were there pressures to meet unrealistic deadlines?
are last. You can fill in the questions pertinent to the systems used.
However, do consider including the following:
- Was there failure in multiple subsystems or just one?
- Was the failure, or sequence of failures, predictable?
- Did the alarms work?
- Were the parties involved experienced with the systems and
subsystems in production or was there a new variable?
- What testing was done prior to going into production?
It is always beneficial to learn from mistakes and outages. Problem
review boards analyzing an incident after the fact need to beware of
allowing their knowledge of outcomes to bias their examination of the
steps that led up to the event. They must pay appropriate attention to
the processes and human factors that could create fertile environments
for failure, not just the technical elements.
In this age of ever increasing complexity, there will always be
incidents and underlying problems that must be addressed with proper
organizational learning and corrective actions to keep the problem from
popping up again.