Microsoft suffered a widespread outage on its Azure cloud platform on February 29 and it’s trying to make amends to its customers with a 33 percent credit for “affected billing month(s),” according to the company. Given Azure’s global footprint, the bug may have stretched into the next day, March 1, for some customers.
Azure customers, “regardless of whether their service was impacted,” should see the credits applied to the next billing period.
Although cloud outages are hardly new — Amazon, Google and Microsoft have all suffered “unplanned downtime” — they continue to pose a particularly thorny challenge for cloud providers. A widespread outage on a cloud the size of Microsoft’s can have the knock-on effect of downing the online services of hundreds or thousands of businesses and startups.
Microsoft’s mea culpa provides an uncharacteristically transparent look into the factors that went into the Leap Year outage and the steps that the company is taking to prevent it and similar occurrences in the future.
In an Azure Blog post, Bill Laing, corporate vice president of Microsoft’s server and cloud division, explained how Azure’s infrastructure lost its footing on that ill-fated day. It boils down to a date-based software bug that affected Azure’s Access Control Service, Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. (Windows Azure Storage or SQL Azure were unaffected.)
According to Laing, the Leap Day outage was caused by how security certificates are managed by the Azure’s virtual machine “guest agents” (GA), “host agents” (HA) and the fabric controllers that oversee clusters of 1,000 servers each. Laing writes, “When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date.”
The problem for that setup is that Leap Day occurs once every four years.
“The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail,” he writes.
“When a GA fails to create its certificates, it terminates,” Laing writes. “The HA has a 25-minute timeout for hearing from the GA. When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.”
From there, entire clusters were teetering on the brink. After a prolonged period of inaccessibility, the fabric controller called for human intervention, first for the affected servers and eventually for the fabric controller itself. Eventually, large swaths of Azure’s infrastructure was affected and it wasn’t fully brought back online until early on March 1.
Laing says that this trial by fire has given Microsoft clearer insights into Azure’s cloud configuration and management shortcomings. To prevent a Leap Day bug or other mishap from having such a widespread effect in the future, the company is taking new steps to strengthen Azure.
These include improved testing and better code analysis tools that will look out for time-related bugs. Microsoft has already analyzed its own code, says Laing. The company is also working to improve its fault isolation technology to better distinguish whether failures stem from hardware or software — in this case the fabric controllers incorrectly attributed the error to faulty hardware.