So what happens when a change fails?
Is there chaos or is a contingency plan executed?
High-availability organizations are always evolving their contingency plans when failure occurs, or the likely prospect of failure appears. Rather than put their entire focus into one way of implementing a change and never taking failure into account, these groups may develop one or two contingency approaches. But they always have an ultimate ”rollback” plan to return the system to the last known good state.
The need for security and the ability to maintain service levels in increasingly complex business and IT environments should be incentives for all organizations to put rollback plans in place.
Imagine that the most critical server in the company has a planned maintenance window on Sundays from 2 a.m. to 5 a.m. That time slot has been negotiated into service level agreements (SLAs), which are taken very seriously. During that maintenance window, the planned changes are going great.
Then suddenly at 3 a.m. the server stops working — with only two hours to spare. The engineers repeatedly try to get the change in place by making ad hoc modifications. By 5 a.m. they are still pushing to get the changes in, and all the while, the system remains down.
What is wrong with this scenario?
Items four and five are crucial. The engineers should have known how long to try implementing the change, and then when to stop, change gears and execute the rollback plan to bring the system back online within the scheduled amount of time.
To explain, a rollback plan is a recovery plan that aims at returning the system to its last known good state. It may be a tape restore or a reload of a configuration file. The rollback plan is the emergency escape plan to get the system back up before the prescribed amount of time elapses. The allowable time factor is a key point.
There are times where one change is all that will happen. There are other times were the team has to install multiple changes on one host or across many hosts. To get them done within a planned maintenance window requires planning.
To make things simple, if there is one host, one change and a three-hour window, then basic logic tells us that we have three hours to get the change done. If there were three changes, then each change would use up some portion of that three-hour window based on estimates. The change planning process should always include a documented rollback plan and estimate as to how long it would take.
However long that rollback plan would take to execute is a key milestone scheduled back from the end of the permissible time. If the change is allotted one hour and the rollback plan would take 15 minutes, then between 40 and 45 minutes into the change, the engineers must actively decide whether to push ahead and finish, or to execute the rollback plan and restore the system to the last known good state.
If things are going well, then finish implementing the change. If things are going badly, then the team must roll back what has been done so far.
To decide to stop and actually admit the process has failed takes a lot of discipline.
In the previous example, notice the allotment of up to five minutes to decide. The decision control point must take place with enough time to assess the situation and make a decision. Sometimes it even takes a dispassionate third party to make the decision because the engineers, or ”change builders” are so into the details of the implementation that they fail to recognize that a decision is needed.
To be optimally effective, there are some issues to bear in mind.
First, the engineers must be able to count on the current state matching the official last known good state that is manually or automatically detailed/documented;
If change management, configuration management and release management disciplines are not in place, then precious time can be lost when a rollback plan fails because the production build didn’t match the documented last known good build.
Not only is testing rendered less meaningful since the test system doesn’t mirror production, but it also is far harder to restore a system for which there is not current accurate configuration data. In these cases, when failure happens during change implementation, work shifts from the vital task of recovery to forensics. You’ll need to ask, ”Why is this configuration value 1,000 instead of 1,500? Who changed it and why?”
System changes can and do fail. As systems become increasingly complex, the probability of subtle differences in production causing a planned change to fail during implementation will climb as well.
Groups worried about meeting service level agreements and stakeholder expectations must recognize this correlation and require rollback plans as part of the change management planning process. Without the ability to recognize failure and quickly recover, the probability of unplanned work and downtime increases… and nobody wants that.