- Google Cloud’s API service ere to blame for widespread outage
- Most regions were back online in 40 minutes, but some took even longer
- The company has promised to protect against future outages and improve communication
Following Google Cloud’s recent widespread outage, which took sites like Spotify, Cloudflare and Discord offline, the company released its detailed report sharing exactly why it failed customers.
The company says the root cause was a code issue in Service Control – part of the company’s API management and policy checking system.
Specifically, invalid automated quota update and a lack of proper error handling triggered a global crash loop, with 503 errors seen across not only Google Cloud services, but services using its APIs.
Google Cloud outage caused by API issue
The outage affected the Google Cloud infrastructure, as well as other popular Google Workspace apps like Drive, Docs, Gmail and Calendar. However, third-party sites accessing Google Cloud’s API, including popular music streaming platform Spotify which boasts of 678 users, as well as some Cloudflare services, were also affected.
“On May 29, 2025, a new feature was added to Service Control for additional quota policy checks,” the company wrote in its incident report. “The issue with this change was that it did not have appropriate error handling nor was it feature flag protected.”
Google Cloud boasted that its Site Reliability Engineering team had started triaging the incident within two minutes, having identified the root cause within 10 minutes. “The red-button [to disable the serving path] was ready to roll out ~25 minutes from the start of the incident,” Google said, with the rollout complete within 40 minutes.
Although smaller regions recovered relatively quickly, larger regions like us-central-1 took longer to come back online – around two hours and 40 minutes in the case of this particular region.
In its mini incident report issues on the day of the outage, Google Cloud promised to “do better.” Its more detailed report promises the usual responses going forward, such as improving static analysis and testing practices, auditing and modularizing Service Control’s architecture to contain future incidents, but the company has also pledged to “improve [its] external communications” to better inform customers, ensuring that its communications infrastructure remains online even during such outages in the future.
You might also like
https://cdn.mos.cms.futurecdn.net/UJ5CFPQLDaMmXUqcw3CEXh.jpg
Source link