Datamation content and product recommendations are
editorially independent. We may make money when you click on links
to our partners.
Learn More
So what happens when a change fails?
Is there chaos or is a contingency plan executed?
High-availability organizations are always evolving their contingency plans when failure occurs, or the likely prospect of failure appears. Rather than put their entire focus into one way of implementing a change and never taking failure into account, these groups may develop one or two contingency approaches. But they always have an ultimate ”rollback” plan to return the system to the last known good state.
The need for security and the ability to maintain service levels in increasingly complex business and IT environments should be incentives for all organizations to put rollback plans in place.
Imagine that the most critical server in the company has a planned maintenance window on Sundays from 2 a.m. to 5 a.m. That time slot has been negotiated into service level agreements (SLAs), which are taken very seriously. During that maintenance window, the planned changes are going great.
Then suddenly at 3 a.m. the server stops working — with only two hours to spare. The engineers repeatedly try to get the change in place by making ad hoc modifications. By 5 a.m. they are still pushing to get the changes in, and all the while, the system remains down.
What is wrong with this scenario?
First off, either something in production didn’t mirror the testing environment or the change wasn’t appropriately tested;
Secondly, the allowable maintenance window has been blown and possibly other planned changes didn’t get applied or dependent changes could not be applied;
But then once the maintenance window expires, the availability of the system per the service level agreement is being negatively impacted;
And the change team didn’t plan for the possibility of failure, and
And lastly, they never detached themselves and stopped working.
Items four and five are crucial. The engineers should have known how long to try implementing the change, and then when to stop, change gears and execute the rollback plan to bring the system back online within the scheduled amount of time.
To explain, a rollback plan is a recovery plan that aims at returning the system to its last known good state. It may be a tape restore or a reload of a configuration file. The rollback plan is the emergency escape plan to get the system back up before the prescribed amount of time elapses. The allowable time factor is a key point.
There are times where one change is all that will happen. There are other times were the team has to install multiple changes on one host or across many hosts. To get them done within a planned maintenance window requires planning.
To make things simple, if there is one host, one change and a three-hour window, then basic logic tells us that we have three hours to get the change done. If there were three changes, then each change would use up some portion of that three-hour window based on estimates. The change planning process should always include a documented rollback plan and estimate as to how long it would take.
However long that rollback plan would take to execute is a key milestone scheduled back from the end of the permissible time. If the change is allotted one hour and the rollback plan would take 15 minutes, then between 40 and 45 minutes into the change, the engineers must actively decide whether to push ahead and finish, or to execute the rollback plan and restore the system to the last known good state.
If things are going well, then finish implementing the change. If things are going badly, then the team must roll back what has been done so far.
To decide to stop and actually admit the process has failed takes a lot of discipline.
In the previous example, notice the allotment of up to five minutes to decide. The decision control point must take place with enough time to assess the situation and make a decision. Sometimes it even takes a dispassionate third party to make the decision because the engineers, or ”change builders” are so into the details of the implementation that they fail to recognize that a decision is needed.
Preparations
To be optimally effective, there are some issues to bear in mind.
First, the engineers must be able to count on the current state matching the official last known good state that is manually or automatically detailed/documented;
If change management, configuration management and release management disciplines are not in place, then precious time can be lost when a rollback plan fails because the production build didn’t match the documented last known good build.
Not only is testing rendered less meaningful since the test system doesn’t mirror production, but it also is far harder to restore a system for which there is not current accurate configuration data. In these cases, when failure happens during change implementation, work shifts from the vital task of recovery to forensics. You’ll need to ask, ”Why is this configuration value 1,000 instead of 1,500? Who changed it and why?”
System changes can and do fail. As systems become increasingly complex, the probability of subtle differences in production causing a planned change to fail during implementation will climb as well.
Groups worried about meeting service level agreements and stakeholder expectations must recognize this correlation and require rollback plans as part of the change management planning process. Without the ability to recognize failure and quickly recover, the probability of unplanned work and downtime increases… and nobody wants that.
RELATED NEWS AND ANALYSIS
-
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
-
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
-
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
-
Top 10 AIOps Companies
FEATURE | By Samuel Greengard,
November 05, 2020
-
What is Text Analysis?
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
-
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
-
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
-
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
-
Top 10 Chatbot Platforms
FEATURE | By Cynthia Harvey,
October 07, 2020
-
Finding a Career Path in AI
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
-
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
-
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
-
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
-
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
-
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
-
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
-
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
-
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
-
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
-
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
SEE ALL
ARTICLES