On 2019-03-06 between 02:07 and 05:36 UTC, a large percentage of API requests made against our US East API cluster, which is used by customers mainly located in the eastern USA and Europe, failed with an HTTP 502 error. The problem was due to a defect in a Google Cloud Networking deployment that broke the Load Balancing service and was resolved after the Google Cloud engineering team repaired the fault.
Root Cause
A new rollout of HTTP LB software in Google Cloud region us-east4 was defective and resulted in our API cluster’s frontend Load Balancer to return an elevated rate of HTTP 502 errors.
After verifying our API service was fully operational within our network we escalated the issue with Google Cloud support and their engineering team identified and rolled back the defective update to their HTTP LB service.
Our internal monitoring systems missed the elevated 502 error rate since these errors were occurring prior to reaching our backend services. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.