SHARE

Beyond Backups: The Next Steps For Fault Tolerance

For many organizations, particularly smaller ones, the concept of fault tolerance extends only as far as doing a nightly tape backup. In many cases, the reason cited for using just this measure is a lack of available funds, but perhaps more prevalent is simply a lacking full appreciation of the impact a downed server brings. […]

Written By

Drew Bird

Apr 30, 2001

4 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

For many organizations, particularly smaller ones, the concept of fault tolerance extends only as far as doing a nightly tape backup. In many cases, the reason cited for using just this measure is a lack of available funds, but perhaps more prevalent is simply a lacking full appreciation of the impact a downed server brings. Tape backups provide an insurance policy against one thing–data loss. They do not protect against downtime and, quite often constitute the slowest part of a system recovery process.

A fact that organizations must understand is the distinction between fault tolerance and data protection. The data is of value, obviously, and the hardware is of value, but the cost of downtime is somewhat harder to
determine. Backups protect the data, and the hardware is protected by virtue of its bring kept in a safe
location, but the prevention of downtime is a little more complex. Even those working in environments where
clustering and fail-over systems are used must consider downtime.

Many of the same organizations that just use tape backups will be very willing to implement fault tolerant
measures after a damaging event has occurred. As with most things, the benefit of hindsight is great. The
principle of fault tolerance is that an ounce of prevention is worth a pound of cure, and it should be viewed as an investment just like any other aspect of business. This is even truer nowadays as more companies find
themselves unable to function without the use of a server, and the price of hardware that can be used to
provide fault tolerance continues to fall.

I remember when teaching technical training courses some years ago, the downside of disk mirroring was
cited as the fact that it costs 50 percent of disk space. In today’s market, disk space is one of the cheapest
commodities we have. So should we all mirror our drives? In the absence of a RAID 5 array, I would say
yes, why not? Heck, for the sake of a few hundred bucks you could even consider implementing disk
duplexing, but more about that in part two.

For each fault tolerant step you consider, you must look at a number of factors. Possibly the biggest
consideration is the question of how likely a given component is to fail. I attended a seminar given by Intel
recently, where we were discussing a feature called Fault Resilient Booting (FRB) which is where if one
processor fails, the system will disable the failed processor and reboot. Someone had to ask the question, so
I did. How often does a processor fail? (In my 13 years on the job I have never, to my knowledge, had a
failed processor.) The answer was ‘very, very seldom’ though of course what would you expect someone
from Intel to say? Unless you are looking to create a supremely fault tolerant system, features that protect
against ‘very, very seldom’ occurrences must be weighed against those that protect a more susceptible
component. But that raises another question. What is a susceptible component?

Some years ago, when working for a major financial institution I arrived at work one Monday morning (have
you ever noticed how these things always happen on a Monday!) to find that three drives in the RAID array
of one of the servers had gone down. The cause? No, it wasn’t a faulty batch of drives–it was a faulty back
plane. This server was the full meal deal–‘biggie sized’. It had dual power supplies, adapter teaming, RAID
5 with a hot spare and a vastly oversized UPS. None of which could prevent the system falling foul of a $90
component. The fact is no matter how many fault tolerant measures are in place there is always an unknown
factor. In other words, the search for the Holy Grail of reliability, 100 percent uptime, is not possible. But increased availability can be achieved.

In part two of this article, we will look in more detail at some of the options available for fault
tolerance on server based systems, and evaluate their effectiveness in relation to investment.

Drew Bird (MCT, MCNI) is a freelance instructor and technical writer. He has been working in the IT
industry for 12 years and currently lives in Kelowna, B.C., Canada.

Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs

FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020

SEE ALL
ARTICLES

Beyond Backups: The Next Steps For Fault Tolerance

Drew Bird

Company

Categories

Beyond Backups: The Next Steps For Fault Tolerance

RELATED NEWS AND ANALYSIS

Drew Bird

Company

Categories