Datamation content and product recommendations are
editorially independent. We may make money when you click on links
to our partners.
Learn More
Sometimes it only takes a single “human” error to cause major repercussions for many in an increasingly cloud-driven world.
That was the case with Amazon’s (NASDAQ: AMZN) cloud computing service outage in late April, according to a lengthy post mortem that the company posted online.
Cloud Storage and Backup Benefits
Protecting your company’s data is critical. Cloud storage with automated backup is scalable, flexible and provides peace of mind. Cobalt Iron’s enterprise-grade backup and recovery solution is known for its hands-free automation and reliability, at a lower cost. Cloud backup that just works.
SCHEDULE FREE CONSULT/DEMO
Additionally, Amazon apologized — as much for poor customer communications as for the outage itself — and promised a ten-day service credit to all users running in the affected portion of Amazon’s cloud.
The outage, which began on April 21 and lasted for more than two days in some instances, impacted an undisclosed number of customers to the Amazon Web Services (AWS) cloud computing offering.
A few, though the company is not saying how many, lost some data in the process.
The result was a lesson in how cloud computing can have its drawbacks as well as its advantages.
Amazon’s problem started while it was trying to upgrade network capacity at its AWS datacenter for the eastern region located in northern Virginia.
“During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Store] network to allow the upgrade to happen,” the post mortemsaid.
Instead of doing the cloud equivalent of blocking off one direction of a superhighway and rerouting the traffic through the other lanes, however, AWS staff accidentally routed that traffic off the freeway and onto a secondary road not built for the volume.
What happened next only partly embraces the highway analogy, but almost immediately traffic came to a screeching halt on many, if not all, of the roads in the area.
The system was designed to handle such mishaps but not something of this magnitude and, like a multi-car pileup on an icy road, the result was a cascading failure of many customers’ services.
Among the sites that went downwere Quora, Reddit, and Foursquare.
What magnified the impact, according to the post mortem, was the cluster-based system’s automatic attempts to reconnect all of the affected sites with uncorrupted versions of their data — which hampered recovery efforts.
In some cases, even customers who had architected their applications to provide redundancy outside of one so-called “availability zone” in order to protect them from just such an occurrence found that the problem extended to their redundant clusters as well.
One AWS customer that didn’t have that problem was cloud-based file exchange service ShareFile.
“ShareFile’s system is set up so that when Amazon experienced the outage, the affected servers were automatically dropped from ShareFile’s server farm without any human intervention and the upload and download success rates were normal,” a ShareFile spokesperson said in an email to InternetNews.com.
However, that specifically requires that customers’ developers architect their systems to handle some of the more sophisticated features of AWS themselves, analysts told InternetNews.comon April 25.
Amazon’s post mortem made similar points but didn’t attach any blame other than to itself– although that may be small solace for some customers.
“Though we recovered nearly all of the affected database instances, 0.4 percent of single-availability zone database instances in the affected Availability Zone had an underlying EBS storage volume that was not recoverable,” the post mortem added.
As part of its ongoing efforts, Amazon intends to provide customers with better tools and more information in real time.
Amazon also said it is working to fix some previously unidentified “bugs” in the way the failover system works, and to expand network capacities to handle higher volumes of traffic if such a “re-mirroring storm” should happen again.
“Last, but certainly not least, we want to apologize. We know how critical our services are to our customers businesses and we will do everything we can to learn from this event and use it to drive improvement across our services,” the post mortem concluded.
Stuart J. Johnston is a contributing editor at InternetNews.com, the news service of Internet.com, the network for technology professionals. Follow him on Twitter @stuartj1000.
-
Ethics and Artificial Intelligence: Driving Greater Equality
FEATURE | By James Maguire,
December 16, 2020
-
AI vs. Machine Learning vs. Deep Learning
FEATURE | By Cynthia Harvey,
December 11, 2020
-
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
-
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
-
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
-
Top 10 AIOps Companies
FEATURE | By Samuel Greengard,
November 05, 2020
-
What is Text Analysis?
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
-
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
-
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
-
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
-
Top 10 Chatbot Platforms
FEATURE | By Cynthia Harvey,
October 07, 2020
-
Finding a Career Path in AI
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
-
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
-
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
-
Top 10 Machine Learning Companies 2021
FEATURE | By Cynthia Harvey,
September 22, 2020
-
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
-
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
-
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
-
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
-
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
SEE ALL
CLOUD ARTICLES