SHARE

Recovery Process Needs Rollback Planning

So what happens when a change fails? Is there chaos or is a contingency plan executed? High-availability organizations are always evolving their contingency plans when failure occurs, or the likely prospect of failure appears. Rather than put their entire focus into one way of implementing a change and never taking failure into account, these groups […]

Written By

George Spafford

Apr 14, 2005

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

So what happens when a change fails?

Is there chaos or is a contingency plan executed?

High-availability organizations are always evolving their contingency plans when failure occurs, or the likely prospect of failure appears. Rather than put their entire focus into one way of implementing a change and never taking failure into account, these groups may develop one or two contingency approaches. But they always have an ultimate ”rollback” plan to return the system to the last known good state.

The need for security and the ability to maintain service levels in increasingly complex business and IT environments should be incentives for all organizations to put rollback plans in place.

Imagine that the most critical server in the company has a planned maintenance window on Sundays from 2 a.m. to 5 a.m. That time slot has been negotiated into service level agreements (SLAs), which are taken very seriously. During that maintenance window, the planned changes are going great.

Then suddenly at 3 a.m. the server stops working — with only two hours to spare. The engineers repeatedly try to get the change in place by making ad hoc modifications. By 5 a.m. they are still pushing to get the changes in, and all the while, the system remains down.

What is wrong with this scenario?

First off, either something in production didn’t mirror the testing environment or the change wasn’t appropriately tested;

Secondly, the allowable maintenance window has been blown and possibly other planned changes didn’t get applied or dependent changes could not be applied;

But then once the maintenance window expires, the availability of the system per the service level agreement is being negatively impacted;

And the change team didn’t plan for the possibility of failure, and

And lastly, they never detached themselves and stopped working.

Items four and five are crucial. The engineers should have known how long to try implementing the change, and then when to stop, change gears and execute the rollback plan to bring the system back online within the scheduled amount of time.

To explain, a rollback plan is a recovery plan that aims at returning the system to its last known good state. It may be a tape restore or a reload of a configuration file. The rollback plan is the emergency escape plan to get the system back up before the prescribed amount of time elapses. The allowable time factor is a key point.

There are times where one change is all that will happen. There are other times were the team has to install multiple changes on one host or across many hosts. To get them done within a planned maintenance window requires planning.

To make things simple, if there is one host, one change and a three-hour window, then basic logic tells us that we have three hours to get the change done. If there were three changes, then each change would use up some portion of that three-hour window based on estimates. The change planning process should always include a documented rollback plan and estimate as to how long it would take.

However long that rollback plan would take to execute is a key milestone scheduled back from the end of the permissible time. If the change is allotted one hour and the rollback plan would take 15 minutes, then between 40 and 45 minutes into the change, the engineers must actively decide whether to push ahead and finish, or to execute the rollback plan and restore the system to the last known good state.

If things are going well, then finish implementing the change. If things are going badly, then the team must roll back what has been done so far.

To decide to stop and actually admit the process has failed takes a lot of discipline.

In the previous example, notice the allotment of up to five minutes to decide. The decision control point must take place with enough time to assess the situation and make a decision. Sometimes it even takes a dispassionate third party to make the decision because the engineers, or ”change builders” are so into the details of the implementation that they fail to recognize that a decision is needed.

Preparations
To be optimally effective, there are some issues to bear in mind.

First, the engineers must be able to count on the current state matching the official last known good state that is manually or automatically detailed/documented;

If change management, configuration management and release management disciplines are not in place, then precious time can be lost when a rollback plan fails because the production build didn’t match the documented last known good build.

Not only is testing rendered less meaningful since the test system doesn’t mirror production, but it also is far harder to restore a system for which there is not current accurate configuration data. In these cases, when failure happens during change implementation, work shifts from the vital task of recovery to forensics. You’ll need to ask, ”Why is this configuration value 1,000 instead of 1,500? Who changed it and why?”

System changes can and do fail. As systems become increasingly complex, the probability of subtle differences in production causing a planned change to fail during implementation will climb as well.

Groups worried about meeting service level agreements and stakeholder expectations must recognize this correlation and require rollback plans as part of the change management planning process. Without the ability to recognize failure and quickly recover, the probability of unplanned work and downtime increases… and nobody wants that.

RELATED NEWS AND ANALYSIS

Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs

FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020

SEE ALL
ARTICLES

Recovery Process Needs Rollback Planning

RELATED NEWS AND ANALYSIS

George Spafford

Recommended for you...

Company

Categories