Amazon Describes Cloud Outage Fix - Apologizes

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Sometimes it only takes a single “human” error to cause major repercussions for many in an increasingly cloud-driven world.

That was the case with Amazon’s (NASDAQ: AMZN) cloud computing service outage in late April, according to a lengthy post mortem that the company posted online.

Cloud Storage and Backup Benefits

Protecting your company’s data is critical. Cloud storage with automated backup is scalable, flexible and provides peace of mind. Cobalt Iron’s enterprise-grade backup and recovery solution is known for its hands-free automation and reliability, at a lower cost. Cloud backup that just works.

SCHEDULE FREE CONSULT/DEMO

Additionally, Amazon apologized — as much for poor customer communications as for the outage itself — and promised a ten-day service credit to all users running in the affected portion of Amazon’s cloud.

The outage, which began on April 21 and lasted for more than two days in some instances, impacted an undisclosed number of customers to the Amazon Web Services (AWS) cloud computing offering.

A few, though the company is not saying how many, lost some data in the process.

The result was a lesson in how cloud computing can have its drawbacks as well as its advantages.

Amazon’s problem started while it was trying to upgrade network capacity at its AWS datacenter for the eastern region located in northern Virginia.

“During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Store] network to allow the upgrade to happen,” the post mortemsaid.

Instead of doing the cloud equivalent of blocking off one direction of a superhighway and rerouting the traffic through the other lanes, however, AWS staff accidentally routed that traffic off the freeway and onto a secondary road not built for the volume.

What happened next only partly embraces the highway analogy, but almost immediately traffic came to a screeching halt on many, if not all, of the roads in the area.

The system was designed to handle such mishaps but not something of this magnitude and, like a multi-car pileup on an icy road, the result was a cascading failure of many customers’ services.

Among the sites that went downwere Quora, Reddit, and Foursquare.

What magnified the impact, according to the post mortem, was the cluster-based system’s automatic attempts to reconnect all of the affected sites with uncorrupted versions of their data — which hampered recovery efforts.

In some cases, even customers who had architected their applications to provide redundancy outside of one so-called “availability zone” in order to protect them from just such an occurrence found that the problem extended to their redundant clusters as well.

One AWS customer that didn’t have that problem was cloud-based file exchange service ShareFile.

“ShareFile’s system is set up so that when Amazon experienced the outage, the affected servers were automatically dropped from ShareFile’s server farm without any human intervention and the upload and download success rates were normal,” a ShareFile spokesperson said in an email to InternetNews.com.

However, that specifically requires that customers’ developers architect their systems to handle some of the more sophisticated features of AWS themselves, analysts told InternetNews.comon April 25.

Amazon’s post mortem made similar points but didn’t attach any blame other than to itself– although that may be small solace for some customers.

“Though we recovered nearly all of the affected database instances, 0.4 percent of single-availability zone database instances in the affected Availability Zone had an underlying EBS storage volume that was not recoverable,” the post mortem added.

As part of its ongoing efforts, Amazon intends to provide customers with better tools and more information in real time.

Amazon also said it is working to fix some previously unidentified “bugs” in the way the failover system works, and to expand network capacities to handle higher volumes of traffic if such a “re-mirroring storm” should happen again.

“Last, but certainly not least, we want to apologize. We know how critical our services are to our customers businesses and we will do everything we can to learn from this event and use it to drive improvement across our services,” the post mortem concluded.

Stuart J. Johnston is a contributing editor at InternetNews.com, the news service of Internet.com, the network for technology professionals. Follow him on Twitter @stuartj1000.