Tuesday, September 17, 2024

Beyond Backups: The Next Steps For Fault Tolerance

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

For many organizations, particularly smaller ones, the concept of fault tolerance extends only as far as doing a nightly tape backup. In many cases, the reason cited for using just this measure is a lack of available funds, but perhaps more prevalent is simply a lacking full appreciation of the impact a downed server brings. Tape backups provide an insurance policy against one thing–data loss. They do not protect against downtime and, quite often constitute the slowest part of a system recovery process.

A fact that organizations must understand is the distinction between fault tolerance and data protection. The data is of value, obviously, and the hardware is of value, but the cost of downtime is somewhat harder to
determine. Backups protect the data, and the hardware is protected by virtue of its bring kept in a safe
location, but the prevention of downtime is a little more complex. Even those working in environments where
clustering and fail-over systems are used must consider downtime.

Many of the same organizations that just use tape backups will be very willing to implement fault tolerant
measures after a damaging event has occurred. As with most things, the benefit of hindsight is great. The
principle of fault tolerance is that an ounce of prevention is worth a pound of cure, and it should be viewed as an investment just like any other aspect of business. This is even truer nowadays as more companies find
themselves unable to function without the use of a server, and the price of hardware that can be used to
provide fault tolerance continues to fall.

I remember when teaching technical training courses some years ago, the downside of disk mirroring was
cited as the fact that it costs 50 percent of disk space. In today’s market, disk space is one of the cheapest
commodities we have. So should we all mirror our drives? In the absence of a RAID 5 array, I would say
yes, why not? Heck, for the sake of a few hundred bucks you could even consider implementing disk
duplexing, but more about that in part two.

For each fault tolerant step you consider, you must look at a number of factors. Possibly the biggest
consideration is the question of how likely a given component is to fail. I attended a seminar given by Intel
recently, where we were discussing a feature called Fault Resilient Booting (FRB) which is where if one
processor fails, the system will disable the failed processor and reboot. Someone had to ask the question, so
I did. How often does a processor fail? (In my 13 years on the job I have never, to my knowledge, had a
failed processor.) The answer was ‘very, very seldom’ though of course what would you expect someone
from Intel to say? Unless you are looking to create a supremely fault tolerant system, features that protect
against ‘very, very seldom’ occurrences must be weighed against those that protect a more susceptible
component. But that raises another question. What is a susceptible component?

Some years ago, when working for a major financial institution I arrived at work one Monday morning (have
you ever noticed how these things always happen on a Monday!) to find that three drives in the RAID array
of one of the servers had gone down. The cause? No, it wasn’t a faulty batch of drives–it was a faulty back
plane. This server was the full meal deal–‘biggie sized’. It had dual power supplies, adapter teaming, RAID
5 with a hot spare and a vastly oversized UPS. None of which could prevent the system falling foul of a $90
component. The fact is no matter how many fault tolerant measures are in place there is always an unknown
factor. In other words, the search for the Holy Grail of reliability, 100 percent uptime, is not possible. But increased availability can be achieved.

In part two of this article, we will look in more detail at some of the options available for fault
tolerance on server based systems, and evaluate their effectiveness in relation to investment.

Drew Bird (MCT, MCNI) is a freelance instructor and technical writer. He has been working in the IT
industry for 12 years and currently lives in Kelowna, B.C., Canada.

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Latest Articles