In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from a attitude best described as “don’t put your eggs in one basket.”
I can see where this concern arises. Virtualization allows for many guest operating systems to be contained in a single physical system, which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once.
This sounds bad, but perhaps it’s not as bad as we would first presume.
The fear behind the “eggs and baskets” worry is that we should not put all of our resources at risk at the same time. This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities. In the case of eggs (or money) we are talking about an interchangeable commodity. One egg is as good as another. A set of eggs is naturally redundant.
If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat. Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case.
Putting our already redundant eggs into multiple baskets allows us to hedge our bets. Yes, carrying two baskets means that we have less time to pay attention to either one, so it increases the risk of losing some of the eggs – but reduces the chances of losing all of the eggs.
In the case of eggs, a wise proposition indeed. Likewise, a smart way to prepare for your retirement.
This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization. Servers, however, are not like eggs.
Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough. Typically servers each play a unique role and all are relatively critical to the functioning of the business.
If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist. (When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.)
IT services in a business are usually, at least to some degree, a “chain dependency.” That is, they are interdependent and the loss of a single service may impact other services. This is true either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (an office worker needs the file server working to collaborate).
In these cases, the loss of a single critical service such as email, network authentication or file services may create a disproportionate loss of working ability. If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases.
This is not always true – in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon. Even if people can remain working, they are likely far less productive than usual.
When dealing with physical servers, each server represents its own point of failure. So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers. Each server that we add brings with it its own risk.
If each failure has an outage factor of 2.5 – that is, financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages.
I use the concept of factors and averages here to make this easy. Determining the length of an average outage or impact of an average outage is not necessary, as we only need to determine relative impact in this case to compare the scenarios.
It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.
With virtualization we have the obvious ability to consolidate. In this example we will assume that we can collapse all ten of these existing servers down into a single server. When we do this we often trigger the “all our eggs in one basket” response.