by David Strom
I spent last week visiting a data center tucked into an anonymous office park in Champaign, Ill. The data center is operated by Amdocs, a company that makes its money doing managed back office applications for telecom companies, such as Sprint, Metro PCS, and others. The visit was part of a general press briefing about what Amdocs is doing, but the term "single point of failure" kept coming up.
If you are going to host apps for telecom vendors, you have to know
what you are doing in terms of providing uptime. You need redundant
everything, from the plug that a router connects to for power to the
backup of the backup diesel generator that has to fire up when you
lose main AC power from the utility.
Actually, the most impressive part of the tour was the empty
"situation rooms" that Amdocs has built. They are empty because there
wasn't any crisis going on - each room is dedicated to a particular
customer and is where the account team gathers when they have a
problem to work on. Think "24" but with far nerdier people. And that
brings up a good point: what is the rest of CTU doing to protect the
other 300 million of us that aren't directly threatened by the current
plot? All the action is happening on the main stage. But I digress.
I started thinking about other IT managers who haven't completely
thought through this issue that I have met down through the years.
There was one manager at a very large financial services firm near
Washington DC that I interviewed a few years ago. Gazillions of
dollars a day pass through its computer networks, and as you might
imagine the firm had three Internet providers - not just two, but
three - to provide connectivity. Each provider had a separate path and
pole for their line from the firm's server room. Well, that sounded
all well and good until the day that a truck collision happened in the
Baltimore Harbor Tunnel - a main north-south artery about 50 miles
away. Trouble was all three of the Internet provider's lines went
through that tunnel and the firm was offline from the Internet until
they got things re-routed. Now they have four Internet providers, and
they got them to share their route maps (try doing this with yours,
and good luck) to make sure there was no single point of failure.
Another time I was helping another firm in Florida upgrade one of its
high-end network servers back in the late 1990s. This was a Tricord
server, which took an ordinary Intel CPU and wrapped it around all
sorts of redundant things: two power supplies, RAID hard drives, two
physical processors, separate memory, and so forth. We had to pull and
replace the network cards from this $40,000 server. This required
powering down the beast and opening it up. Sadly, the one thing that
wasn't redundant was the physical power plug that went from the server
into the wall - and the $25 part that the ordinary plug fit into went
south when we powered the unit down. It took a few white-knuckle hours
to locate a new part and get it over to us before we could bring the
Tricord up again. I bet no one thought that probably the least
sophisticated part in the whole machine was going to fail.
These days, you see lots of gear that have two physical power plugs,
and at Amdocs' data center they have two separate power paths just in
case one goes out. That means taking that path back to a generator and
line conditioning gear too.
Here is a story from my own mistakes, lest you think I am just harping
on my subjects here. Several years ago, I was running this email list
server on a friend's Linux server that was in his California basement.
The friend is one of the original Internet heavyweights, and knows his
systems and has plenty of backups. However, the day came when a lot of
flooding in his area knocked out all of his Internet connections, and
I wasn't able to access my list. Well, I thought I had all sorts of
backup procedures in place and had saved copies of the server list
configuration, so I could bring it up on someone else's server.
However, I had neglected to do one simple task - make a copy of the
names of everyone on my list. Now I do. You would think something this
simple would not have eluded me but you would think wrong.
So single point of failure: it is easier to say than to do. And when
you see what Amdocs had to do to deliver on this maxim, you would be
impressed.