Asking Questions
To reduce the risks associated with hindsight bias, develop post-problem
questionnaires in advance for each system, or class of system. When
incidents happen and it is time to interview and observe the team, use
the questionnaires as guides to templates.
Here are some questions that should give you a few starting points:
Processes -- The first category to check is the processes
involved at the time of the incident:
Did something about the process change? Was it formally documented
and adopted just prior to the incident?
What processes failed and why?
Are people bypassing the formal documented processes? Why?
Are there processes and/or controls that need to be added?
Are there processes and/or controls that need to be changed?
Documentation -- Was the organization, system and surrounding
processes mature enough to be documented?
Was the documentation readily available?
Could they find the needed topics in the manual?
How can the documentation be improved?
Training -- Ensuring there is proper training and understanding of
the processes and documentation is another major step on the road to
maturity.
Were the operators trained appropriately?
Were they trained to handle the scenario that took place and if so,
how similar was the training to reality?
How could training be improved?
The Operators -- Fatigue, emotions and pressures all affect the
cognitive abilities of the people operating systems. Be sure to factor
them in.
Were the operators angry, upset, anxious?
Were there corporate pressures, such as expense control or other
financial constraints?
Were there pressures to meet unrealistic deadlines?
Technical Questions -- Yes, the bits, bytes and technical details
are last. You can fill in the questions pertinent to the systems used.
However, do consider including the following:
Was there failure in multiple subsystems or just one?
Was the failure, or sequence of failures, predictable?
Were the parties involved experienced with the systems and
subsystems in production or was there a new variable?
What testing was done prior to going into production?
It is always beneficial to learn from mistakes and outages. Problem
review boards analyzing an incident after the fact need to beware of
allowing their knowledge of outcomes to bias their examination of the
steps that led up to the event. They must pay appropriate attention to
the processes and human factors that could create fertile environments
for failure, not just the technical elements.
In this age of ever increasing complexity, there will always be
incidents and underlying problems that must be addressed with proper
organizational learning and corrective actions to keep the problem from
popping up again.