SMB Virtualization: All Your Eggs in One Basket?

In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from a attitude best described as “don’t put your eggs in one basket.”

I can see where this concern arises. Virtualization allows for many guest operating systems to be contained in a single physical system, which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once.

This sounds bad, but perhaps it’s not as bad as we would first presume.

The fear behind the “eggs and baskets” worry is that we should not put all of our resources at risk at the same time. This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities. In the case of eggs (or money) we are talking about an interchangeable commodity. One egg is as good as another. A set of eggs is naturally redundant.

If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat. Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case.

Putting our already redundant eggs into multiple baskets allows us to hedge our bets. Yes, carrying two baskets means that we have less time to pay attention to either one, so it increases the risk of losing some of the eggs – but reduces the chances of losing all of the eggs.

In the case of eggs, a wise proposition indeed. Likewise, a smart way to prepare for your retirement.

This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization. Servers, however, are not like eggs.

Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough. Typically servers each play a unique role and all are relatively critical to the functioning of the business.

If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist. (When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.)

IT services in a business are usually, at least to some degree, a “chain dependency.” That is, they are interdependent and the loss of a single service may impact other services. This is true either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (an office worker needs the file server working to collaborate).

In these cases, the loss of a single critical service such as email, network authentication or file services may create a disproportionate loss of working ability. If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases.

This is not always true – in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon. Even if people can remain working, they are likely far less productive than usual.

When dealing with physical servers, each server represents its own point of failure. So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers. Each server that we add brings with it its own risk.

If each failure has an outage factor of 2.5 – that is, financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages.

I use the concept of factors and averages here to make this easy. Determining the length of an average outage or impact of an average outage is not necessary, as we only need to determine relative impact in this case to compare the scenarios.

It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.

Virtualization and Consolidation

With virtualization we have the obvious ability to consolidate. In this example we will assume that we can collapse all ten of these existing servers down into a single server. When we do this we often trigger the “all our eggs in one basket” response.

But if we run some risk analysis we will see that this is usually just fear and uncertainty and not a mathematically supported risk. If we assume the same risks as the example above then our single server will, on average, incur just a single total site outage, once per decade.

Compare this to the first example, which did the damage equivalent to two and a half total site outages – the risk of the virtualized, consolidated solution is only forty percent that of the traditional solution.

Now keep in mind that this is based on the assumption that losing someservices means a financial loss greater than the strict value of the service that was lost, which is almost always the case. Even if the service lost is no more than the loss of an individual service we are only at break even and need not worry.

In rare cases, impact from losing a single system can be less than its “slice of the pie,” normally because people are flexible and can work around the failed system – like if instant messaging fails and people simple switch to using email until instant messaging is restored. But these cases are rare and are normally isolated to a few systems out of many, with the majority of systems, say ERP, CRM and email, having disproportionally large impacts in the event of an outage.

So what we see here is that under normal circumstances moving ten services from ten servers to ten services on one server will generally lower our risk, not increase it – in direct contrast to the “eggs in a basket” theory. And this is purely from a hardware failure perspective. Consolidation offers several other important reliability factors, though, that can have a significant impact to our case study.

With consolidation we reduce the amount of hardware that needs to be monitored and managed by the IT department. Fewer servers means that more time and attention can be paid to those that remain. More attention means a better chance of catching issues early and more opportunity to keep parts on hand. Better monitoring and maintenance leads to better reliability.

Cost Savings?

Possibly the most important factor, however, with consolidation is that there is significant cost savings and this, if approached correctly, can provide opportunities for improved reliability. With the dramatic reduction in total cost for servers it can be tempting to continue to keep budgets tight and attempt to purely leverage the cost savings directly.

This is understandable and for some businesses may be the correct approach. But it is not the approach that I would recommend when struggling against the notion of eggs and baskets.

Instead by applying a more moderate approach – keeping significant cost savings but still spending more, relatively speaking, on a single server – you can acquire a higher end (read: more reliable) server, use better parts, have on-site spares, etc.

The cost savings of virtualization can often be turned directly into increased reliability, further shifting the equation in favor of the single server approach.

As I’ve noted before, one brick house is more likely to survive a wind storm than either one or two straw houses. Having more of something doesn’t necessarily make it the more reliable choice.

These benefits come purely from the consolidation aspect of virtualization and not from the virtualization itself. Virtualization provides extended risk mitigation features separately as well. System imaging and rapid restores, as well as restores to different hardware, are major advantages of most any virtualization platform. This can play an important role in a disaster recovery strategy.

Of course, all the concepts I’ve mentioned demonstrate that single box virtualization and consolidation can beat the legacy “one app to one server” approach and still save money – showing that the example of eggs and baskets is misleading and does not apply in this scenario. There should be little trepidation in moving from a traditional environment directly to a virtualized one based on these factors.

It should be noted that virtualization can then extend the reliability of traditional commodity hardware, providing mainframe-like failover features that are above and beyond what non-virtualized platforms are able to provide. This moves commodity hardware more firmly into line with the larger, more expensive RISC platforms.

These features can bring an extreme level of protection but are often above and beyond what is appropriate for IT shops that initially migrate from a non-failover, legacy hardware server environment. High availability is a great feature but is often costly and very often unnecessary, especially as companies move from, as we have seen, relatively unreliable environments in the past to more reliable environments today.

Given that we have already increased reliability over what was considered necessary in the past there is a very good chance that an extreme jump in reliability is not needed now. But due to the large drop in the cost of high availability, it is quite possible that it will be cost justified where previously it could not be.

Is Virtualization Still Too New?

In the same vein, virtualization is often feared because it is seen as a new, unproven technology. This is certainly untrue but there is an impression of this in the small business and commodity server space.

In reality, though, virtualization was first introduced by IBM in the 1960s and every since then has been a mainstay of high end mainframe and RISC servers – those systems demanding the best reliability. In the commodity server space, virtualization was a larger technical challenge and took a very long time before it could be implemented efficiently enough to make it effective to use in the real world.

But even in the commodity server space virtualization has been available since the late 1990s and so is approximately fifteen years old today, which is very far past the point of being a nascent technology – in the world of IT it is positively venerable.

Commodity platform virtualization is a mature field with several highly respected, extremely advanced vendors and products. The use of virtualization as a standard for all or nearly all server applications is a long established and accepted “enterprise pattern” and one that now can easily be adopted by companies of any and every size.

Virtualization, perhaps counter-intuitively, is actually a very critical component of a reliability strategy. Instead of adding risk, virtualization can almost be approached as a risk mitigation platform – a toolkit for increasing the reliability of your computing platforms through many avenues.