This serves as a reminder that cloud services, while often thought of as a utility, actually arent. And that the necessary redundancies, while often seemingly built in, havent been fully tested. That means the cloud itself should have a fall-back plan for critical services.
So while my premise is that this not being as reliable a utility, but really isnt true: in much of the world, power is actually less reliable than Amazons service is.
It does suggest a similar approach to the problem, though, and one consistent with any service where the reliability cant be adequately assured for the class of service required by the company.
In the case of power, if you need a higher level of reliability than what the electrical utility can provide you put in place backup generating potential adequate to the task. That way you can assure the reliability you need even if the utility cant meet the requirement.
In fact in some parts of the world it isnt that uncommon to forgo the utility altogether and live off their own power generation capability. This is very similar to companies choosing a private over a public cloud solution for their business. And given that most services cant yet provide the reliability needed for mission critical applications, thats why large enterprise providers like EMC, IBM, and HP continue to do great business with their private cloud offerings.
And while it clearly will eat into the savings of using a cloud service, the result also keeps you more intimate with the solution and likely provides a much better path if you need to switch cloud providers. In short, you can fail over into the backup system, switch providers and enter into test, leaving the backup system as primary until youre ready to cut over to the new service. It could actually give you added flexibility in terms of solutions providers.
The failover solution doesnt have to take the entire peak load of the business, just enough of that load to keep the company operating until the cloud service recovers. And over time these services will themselves improve in redundancy and recovery speed.
What you want to avoid is what is happening today in terms of companies being partially or completely shut down. Customers will stay with you if they have wait times (at least they will if they understand the problem is short-lived) but youll lose them if they cant connect at all.
Like any other critical service, having a fall-back plan that is tested and ready to execute can make you and your group look brilliant when others are failing.
And that alone is worth the added cost.
Design in redundancies with adequate failover and youll look like a hero when something like this happens. And while this comes with cost, it is a vastly lower cost than having your CEO wonder if his IT department needs new leadership.
In the end the Amazon failure is a reminder to us all that systems, even cloud services, need redundancies within your control and that failing to put those redundancies in place can be career limiting.