SHARE

Microsoft to Credit Customers for Azure Leap Day Bug

Microsoft suffered a widespread outage on its Azure cloud platform on February 29 and it’s trying to make amends to its customers with a 33 percent credit for “affected billing month(s),” according to the company. Given Azure’s global footprint, the bug may have stretched into the next day, March 1, for some customers. Azure customers, […]

Written By

Pedro Hernandez

Mar 12, 2012

4 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Microsoft suffered a widespread outage on its Azure cloud platform on February 29 and it’s trying to make amends to its customers with a 33 percent credit for “affected billing month(s),” according to the company. Given Azure’s global footprint, the bug may have stretched into the next day, March 1, for some customers.

Azure customers, “regardless of whether their service was impacted,” should see the credits applied to the next billing period.

Although cloud outages are hardly new — Amazon, Google and Microsoft have all suffered “unplanned downtime” — they continue to pose a particularly thorny challenge for cloud providers. A widespread outage on a cloud the size of Microsoft’s can have the knock-on effect of downing the online services of hundreds or thousands of businesses and startups.

Microsoft’s mea culpa provides an uncharacteristically transparent look into the factors that went into the Leap Year outage and the steps that the company is taking to prevent it and similar occurrences in the future.

In an Azure Blog post, Bill Laing, corporate vice president of Microsoft’s server and cloud division, explained how Azure’s infrastructure lost its footing on that ill-fated day. It boils down to a date-based software bug that affected Azure’s Access Control Service, Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. (Windows Azure Storage or SQL Azure were unaffected.)

According to Laing, the Leap Day outage was caused by how security certificates are managed by the Azure’s virtual machine “guest agents” (GA), “host agents” (HA) and the fabric controllers that oversee clusters of 1,000 servers each. Laing writes, “When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date.”

The problem for that setup is that Leap Day occurs once every four years.

“The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail,” he writes.

Domino Effect

“When a GA fails to create its certificates, it terminates,” Laing writes. “The HA has a 25-minute timeout for hearing from the GA. When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.”

From there, entire clusters were teetering on the brink. After a prolonged period of inaccessibility, the fabric controller called for human intervention, first for the affected servers and eventually for the fabric controller itself. Eventually, large swaths of Azure’s infrastructure was affected and it wasn’t fully brought back online until early on March 1.

Laing says that this trial by fire has given Microsoft clearer insights into Azure’s cloud configuration and management shortcomings. To prevent a Leap Day bug or other mishap from having such a widespread effect in the future, the company is taking new steps to strengthen Azure.

These include improved testing and better code analysis tools that will look out for time-related bugs. Microsoft has already analyzed its own code, says Laing. The company is also working to improve its fault isolation technology to better distinguish whether failures stem from hardware or software — in this case the fabric controllers incorrectly attributed the error to faulty hardware.

Pedro Hernandez is a contributor to the IT Business Edge Network, the network for technology professionals. Follow him on Twitter @ecoINSITE.

Ethics and Artificial Intelligence: Driving Greater Equality

FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020

SEE ALL
FEATURES ARTICLES

Pedro Hernandez

Pedro Hernandez is a contributor to Datamation, eWEEK, and the IT Business Edge Network, the network for technology professionals. Previously, he served as a managing editor for the Internet.com network of IT-related websites and as the Green IT curator for GigaOM Pro.

Microsoft to Credit Customers for Azure Leap Day Bug

Domino Effect

Pedro Hernandez

Company

Categories

Microsoft to Credit Customers for Azure Leap Day Bug

Domino Effect

RELATED NEWS AND ANALYSIS

Pedro Hernandez

Company

Categories