This week Amazon had a relatively massive unplanned outage in their EC2 service. The result was a cascade of failing services by companies using this service, ranging from technology publications to enterprise service providers.
This serves as a reminder that cloud services, while often thought of as a utility, actually aren’t. And that the necessary redundancies, while often seemingly built in, haven’t been fully tested. That means the cloud itself should have a fall-back plan for critical services.
Nature of the Cloudburst
Note that there has been no security breech in this instance. This is a failure of service and very similar to what would happen if you had a power outage.
So while my premise is that this not being as reliable a utility, but really isn’t true: in much of the world, power is actually less reliable than Amazon’s service is.
It does suggest a similar approach to the problem, though, and one consistent with any service where the reliability can’t be adequately assured for the class of service required by the company.
In the case of power, if you need a higher level of reliability than what the electrical utility can provide you put in place backup generating potential adequate to the task.That way you can assure the reliability you need even if the utility can’t meet the requirement.
In fact in some parts of the world it isn’t that uncommon to forgo the utility altogether and live off their own power generation capability. This is very similar to companies choosing a private over a public cloud solution for their business. And given that most services can’t yet provide the reliability needed for mission critical applications, that’s why large enterprise providers like EMC, IBM, and HP continue to do great business with their private cloud offerings.
The solution is that if you want to save money using the public cloud for services that require a higher service level than you believe the provider can deliver, you have to provide redundancy. This isn’t that different than having a hot backup site in case of a natural disaster.
And while it clearly will eat into the savings of using a cloud service, the result also keeps you more intimate with the solution and likely provides a much better path if you need to switch cloud providers. In short, you can fail over into the backup system, switch providers and enter into test, leaving the backup system as primary until you’re ready to cut over to the new service.It could actually give you added flexibility in terms of solutions providers.
Cloud Computing “Airbags”
Think of the cloud failover solution like airbags in a car: they are expensive components required of every automobile that are rarely used and often painful – but a lot less painful than being tossed through the window or having your chest or head crushed.
The failover solution doesn’t have to take the entire peak load of the business, just enough of that load to keep the company operating until the cloud service recovers. And over time these services will themselves improve in redundancy and recovery speed.
What you want to avoid is what is happening today in terms of companies being partially or completely shut down.Customers will stay with you if they have wait times (at least they will if they understand the problem is short-lived) but you’ll lose them if they can’t connect at all.
Like any other critical service, having a fall-back plan that is tested and ready to execute can make you and your group look brilliant when others are failing.
And that alone is worth the added cost.
Wrapping Up: A Warning
Use this Amazon failure as a warning that cloud services are not bulletproof and that they are likely to fail at any time for the same reasons any complex system can fail. They will then take many, most, or all of their customers off-line with them.
Design in redundancies with adequate failover and you’ll look like a hero when something like this happens. And while this comes with cost, it is a vastly lower cost than having your CEO wonder if his IT department needs new leadership.
In the end the Amazon failure is a reminder to us all that systems, even cloud services, need redundancies within your control and that failing to put those redundancies in place can be career limiting.