Datamation Logo

Microsoft to Credit Customers for Azure Leap Day Bug

March 12, 2012
Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Microsoft suffered a widespread outage on its Azure cloud platform on February 29 and it’s trying to make amends to its customers with a 33 percent credit for “affected billing month(s),” according to the company. Given Azure’s global footprint, the bug may have stretched into the next day, March 1, for some customers.

Azure customers, “regardless of whether their service was impacted,” should see the credits applied to the next billing period.

Although cloud outages are hardly new — Amazon, Google and Microsoft have all suffered “unplanned downtime” — they continue to pose a particularly thorny challenge for cloud providers. A widespread outage on a cloud the size of Microsoft’s can have the knock-on effect of downing the online services of hundreds or thousands of businesses and startups.

Microsoft’s mea culpa provides an uncharacteristically transparent look into the factors that went into the Leap Year outage and the steps that the company is taking to prevent it and similar occurrences in the future.

In an Azure Blog post, Bill Laing, corporate vice president of Microsoft’s server and cloud division, explained how Azure’s infrastructure lost its footing on that ill-fated day. It boils down to a date-based software bug that affected Azure’s Access Control Service, Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. (Windows Azure Storage or SQL Azure were unaffected.)

According to Laing, the Leap Day outage was caused by how security certificates are managed by the Azure’s virtual machine “guest agents” (GA), “host agents” (HA) and the fabric controllers that oversee clusters of 1,000 servers each. Laing writes, “When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date.”

The problem for that setup is that Leap Day occurs once every four years.

“The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail,” he writes.

Domino Effect

“When a GA fails to create its certificates, it terminates,” Laing writes. “The HA has a 25-minute timeout for hearing from the GA. When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.”

From there, entire clusters were teetering on the brink. After a prolonged period of inaccessibility, the fabric controller called for human intervention, first for the affected servers and eventually for the fabric controller itself. Eventually, large swaths of Azure’s infrastructure was affected and it wasn’t fully brought back online until early on March 1.

Laing says that this trial by fire has given Microsoft clearer insights into Azure’s cloud configuration and management shortcomings. To prevent a Leap Day bug or other mishap from having such a widespread effect in the future, the company is taking new steps to strengthen Azure.

These include improved testing and better code analysis tools that will look out for time-related bugs. Microsoft has already analyzed its own code, says Laing. The company is also working to improve its fault isolation technology to better distinguish whether failures stem from hardware or software — in this case the fabric controllers incorrectly attributed the error to faulty hardware.

Pedro Hernandez is a contributor to the IT Business Edge Network, the network for technology professionals. Follow him on Twitter @ecoINSITE.

  SEE ALL
FEATURES ARTICLES
 

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Datamation Logo

Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.

Advertisers

Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.

Advertise with Us

Our Brands


Privacy Policy Terms & Conditions About Contact Advertise California - Do Not Sell My Information

Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.