Microsoft Azure's Outage Lasts for 10.5 Hours - Here's Why

Jace Dela Cruz, Tech Times 05 June 2023, 07:06 am

Microsoft Azure experienced a significant outage lasting approximately 10.5 hours, affecting Azure DevOps services in the South Brazil (SBR) region.

The outage occurred due to a typo bug in the snapshot deletion job, which unintentionally deleted the Azure SQL Server instead of the intended Azure SQL Database.

This led to the deletion of all seventeen production databases for the scale unit, resulting in the inability to process customer traffic.

SWITZERLAND-DAVOS-SUMMIT-POLITICS-ECONOMY

(Photo : FABRICE COFFRINI/AFP via Getty Images)
This photograph taken on January 19, 2023 shows the logo of the American corporation Microsoft displayed during the World Economic Forum (WEF) annual meeting in Davos.

Were There Data Loss?

Fortunately, no data loss was experienced during the outage. The issue was detected within 20 minutes, and the on-call engineers promptly engaged to resolve the problem. However, several factors contributed to the extended recovery time.

Firstly, since customers cannot restore Azure SQL Servers themselves, the Azure SQL team had to be involved in the restoration process.

This process includes identifying the need for the Azure SQL on-call engineer and restoring the server, which took approximately one hour, according to Eric Mattingly, Principal Software Eng Manager at Microsoft Azure.

Secondly, restoring the databases added extra time due to their backup configurations. While some databases were configured with Geo-zone-redundant backup, others were created before this feature was available and only had Zone-redundant backup.

As a result, the restoration process included copying the data to the paired region, increasing the recovery time depending on the database sizes.

Moving forward, Microsoft Azure said it would ensure that all database backups are configured as Geo-zone-redundant across all scale units.

Lastly, even after the databases were restored, the entire scale unit remained inaccessible due to complications with the web servers. Recycling the w3wp processes on the servers caused periodic warm-up tasks, which encountered errors and resulted in extended warm-up times.

This affected the health probe of the web servers, leading to a disruption in customer traffic from the load balancer. To address this, Microsoft Azure implemented measures to gradually unblock users and allow the web servers to warm up properly.

"Towards the end of the outage window, we blocked all traffic to the scale unit with our Resource Utilization feature to allow all web servers to warm-up and successfully enter the load balancer," Mattingly said in a statement.

"This resulted in users receiving rate limit and usage errors. Once all databases were healthy, we gradually unblocked users to ramp the customer traffic up to normal levels."

Preventing Similar Incidents

Microsoft Azure has since taken steps to prevent similar incidents and improve the resilience of their services. They have fixed the bug in the snapshot deletion job, created comprehensive tests, and implemented Azure Resource Manager Locks to prevent accidental deletions.

Additionally, they ensure that all Azure SQL Database backups are configured with Geo-zone redundancy and segregate snapshot databases from production databases.

Mattingly apologized to all the customers impacted by the outage and assured them of the measures being taken to prevent future occurrences.