Wednesday, December 4, 2024

Creating a Resilient IT System

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

When we look at how important IT systems have become to organizations and

society as a whole, we need to factor in resiliency when designing them.

Resiliency pertains to the system’s ability to return to its original

state after encountering trouble. In other words, if a risk event knocks

a system offline, a highly resilient system will return back to work and

function as planned as soon as possible.

While many may take this process for granted, not all systems recover

cleanly. Sometimes IT staff isn’t even involved. But other times a lot of

staff members are involved and they’re dealing with a great deal of

stress getting that system back online. And whether all the data is there

or not is another story altogether.

From now on, we cannot afford fragile systems or systems that require an

unmanageable amount of time and effort to recover. We must take

resiliency into account.

How many systems do you have that will come back online if the power is

cut and the UPS runs to the point of exhaustion, causing a hard crash?

Stand-alone PCs and network devices are usually pretty good about coming

back. However, as the level of complexity and interdependency increases,

simply coming back online after a hard crash may mean corrupted storage,

split clusters and the failure of dependent services.

That means you shouldn’t bet on highly complex systems simply coming back

up after whatever negative event you experience — be it hardware or

software failure or some form of security incident.

From an organizational perspective, if power is lost at a plant for two

days, can it recover? If a key service is lost because a database becomes

corrupt, can the business recover? Organizations that can bounce back are

resilient and the ones that can’t may have some troubled times.

Making your system resilient takes a lot of planning.

To build resilient systems, you need a holistic mentality. Prioritize

every foreseeable risk and then determine not only how to reduce the risk

in the first place, but determine how to minimize its impact on the

system and the organization. Those are two different issues.

Granted, recovery controls, also known as corrective controls, are

risk mitigating controls. However, we need to make sure that teams

managing systems take into account not just controls that reduce the

probability of a risk event, but also reduce the impact of the event.

They must plan for failure, not optimism.

Resiliency directly targets minimizing the impact by bringing people,

processes and technology either back to their original state or a

modified state until the risk has been reasonably addressed.

Any system has three dimensions that must be considered — people,

processes and technology. To build in resiliency, all must be taken into

account because if one of them fails, then the likelihood of poor

resiliency and overall system failure increases.

In addressing the ‘people’ dimension, there must be identified backups

and cross-training to ensure that if anyone is sick, on vacation or

incapacitated, there is another person, if not entire other teams, who

can fill in. For example, if a data center is damaged due to a natural

disaster and the staff there is trying to address their legitimate family

crises, is there another group that can do the work from another site and

take some of the pressure off?

When addressing process issues, IT administrators must spend some time

assessing.

Are the current processes so rigid that they break under any variation?

Or are there logical emergency processes that can be triggered in the

event of a problem?

Technology is interesting. If you have the right people and the right

processes supporting it, then magic happens. If either element is weak,

the technology is standing on a bad foundation.

With that said, resiliency can apply directly to the technology. To

illustrate, some systems are very sensitive to power or temperature

fluctuations. If there is a high risk that neither of those elements can

be reasonable safeguarded, then you have a fragile system — one that

will be prone to break during regular operations, let alone in an

emergency.

Be sure to consider environmental and other risks when evaluating systems

and subsystems. Be sure to factor in resiliency during the evaluation.

What if the power spikes? What if there is a brown out? What if the room

temperature goes to 100 degrees Fahrenheit for 48 hours? How will any of

these risks affect the system’s ability to recover following?

Again, the key is to identify risks, prioritize them, and then figure out

how to mitigate the most likely ones in an effective and cost-efficient

manner.

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Latest Articles