Download the authoritative guide: Cloud Computing 2018: Using the Cloud to Transform Your Business
At this point, three keys terms should be clarified: 1) Incidents are any deviation from the standard operations of a system that could, or does, cause a service interruption; 2) A problem is the condition of having multiple similar incidents, and 3) a known error is the identified root cause of a problem.
Essentially, from ITIL we understand that there are two management forces at work. First, there is incident management, which is concerned with restoring service as quickly as possible, often using workarounds that address known errors. Second, problem management is geared toward both proactively and reactively addressing the underlying causal factors of incidents. Readers might want to review the ITIL Service Support volume's chapter on Problem Management to gain a better understanding.
As complexity increases, the percentage of total system understanding held by any one IT person will decrease. This is because the level of expertise to build complex systems demands the involvement of multiple parties. There just is not an alternative realistic option. Whether developed entirely in-house, out-sourced or some combination thereof, there are multiple people, even multiple organizations, involved.
Problem Review Boards (PRB)
In the same manner that there are change advisory boards (CABs) for updates to production systems, there must be a parallel group(s) reviewing incidents to determine trends, problem identification and ultimately root cause and mitigation.
Depending on the complexity of the organization, there may be one PRB overall or a PRB per system. For that matter, some organizations may be so small or simple that, for whatever reason, they do not need PRBs. In those cases, it is recommended that they still understand the ITIL Problem Resolution processes and adopt best practices into their organizations. For organizations with complex systems, regardless of size, the implementation of PRBs need to be seriously considered.
The goal of the PRB is to govern problem management reactively and proactively. This is done through analyzing incidents as they happen, reviewing historic trend data and staying abreast of current industry news and vendor updates. For example, a switch may not have failed yet, but your PRB may know of an operating system bug that had been identified and eliminated by another organization using the same switch. Hence, it would be advisable to assess risks and determine the best means to modify the switch and mitigate incident risks proactively.
Continue on to find out how to structure a PRB....
The PRB must include representation from relevant stakeholders in order to effectively review incident trend data for reactive problem management as well as searching for risks from a proactive stance. This means the PRB could be comprised of vendor personnel, consultants, IT operations, IT release management, security, and so on. Incidents that appear to be establishing a trend would then be assigned to teams that would search for the underlying problem.
Once the underlying problem had been identified, a request for change would be issued through the change management process to validate and enact the recommendations of the PRB as they relate to the production systems and various stored configurations. It is important to note that the PRB must not spawn a separate change vector to production; rather, it must serve as an input to the standard change management process which, just in case, must have established means of handling emergency change requests.
Complexity is increasing the need for effective communication and coordination among the various groups involved with complex systems. As this happens along with technical specialization, enhanced processes will be needed to meet service levels including availability and security. To foster appropriate problem identification and root cause analysis, enterprises should use problem review boards with the necessary stakeholders represented in order to make decisions for both proactive and reactive problem management.
How does your organization handle incident and problem management? If you have any stories or examples you'd like to share, please email me at firstname.lastname@example.org.