Most businesses will find their best value having many SPOFs in place. Our real goal is reliability at appropriate cost. Redundancy, as we have seen, is no substitute for reliability, it is simply a tool that we can use to achieve reliability.
The theory that many people follow when virtualizing is that they take their virtualization host and say "This host is a SPOF, so I need to have two of them and use High Availability features to allow for transparent failover!"
This is spurred by the leading virtualization vendor making their money firstly by selling expensive HA add-on products and secondly by being owned by a large storage vendor - so selling unnecessary or even dangerous additional shared storage is a big monetary win for them. It could easily be the reason that they championed the virtualization space from the beginning. Redundant virtualization hosts with shared storage sounds great but can be extremely misguided for several reasons.
The first reason is that removing the initial SPOF, the virtualization host, is replaced with a new SPOF, the shared storage. This accomplishes nothing. Assuming that we are using comparable quality servers and shared storage all we've done is move where the risk is, not change how big it is.
The likelihood of the storage system failing is roughly equal to the likelihood of the original server failing. But in addition to shuffling the SPOF around like in a shell game, we've also done something far, far worse – we have introduced chained or cascading failure dependencies.
In our original scenario we had a single server. If the server stayed working we are good, if it failed we were not. Simple. Now we have two virtualization hosts, a single storage server (SAN, NAS, whatever) and a network connecting them together. The risk of the shared storage failing is approximately equal to our total system risk in the original scenario.
But now we have the additional dependencies of the network and the two front-end virtualization nodes. Each of these components is more reliable than the fragile shared storage (anything with mechanical drives is going to be fragile). But that they are lower risk is not the issue, the issue is that the risks are combinatorial.
If any of these three components (storage, network or the front end nodes) fail then everything fails. The solution to this is to make the shared storage redundant on its own and to make the network redundant on its own.
With enough work we can overcome the fragility and risk that we introduced by adding shared storage but the shared storage on its own is not a form of risk mitigation but is a risk itself, which must be mitigated. The spiral of complexity begins and the cost associated with bringing this new system up on par with the reliability of the original, single server system can be astronomic.
Now that we have all of this redundancy we have one more risk to worry about. Managing all of this redundancy, all of these moving parts, requires a lot more knowledge, skill and preparation than does managing a simple, single server. We have moved from a simple solution to a very complex one.
In my own anecdotal experience the real dangers of solutions like this come not from the hardware failing but from human error. Not only has little been done to avoid human error, causing this new system to fail. But we've added countless points where a human might accidentally bring down the entire system, redundancy and all.
I've seen it first hand; I've heard the horror stories. The more complex the system the more likely a human is going to accidentally break everything.
It is critical that as IT professionals that we step back and look at complete systems and consider reliability and risk and think of redundancy simply as a tool to use in the pursuit of reliability.
Redundancy itself is not a panacea. Neither is simplicity. Reliability is a complex problem to tackle. Avoiding simplistic replacements is an important first step in moving from covering up reliability issues to facing and solving them.