Elevated 502 errors for our us-east data collection API

Incident Report for Customer.io Status

Postmortem

Overview

On 2019-03-06 between 02:07 and 05:36 UTC, a large percentage of API requests made against our US East API cluster, which is used by customers mainly located in the eastern USA and Europe, failed with an HTTP 502 error. The problem was due to a defect in a Google Cloud Networking deployment that broke the Load Balancing service and was resolved after the Google Cloud engineering team repaired the fault.‌

Root Cause

A new rollout of HTTP LB software in Google Cloud region us-east4 was defective and resulted in our API cluster’s frontend Load Balancer to return an elevated rate of HTTP 502 errors.

Resolution

After verifying our API service was fully operational within our network we escalated the issue with Google Cloud support and their engineering team identified and rolled back the defective update to their HTTP LB service.

Post-incident Remediation

Our internal monitoring systems missed the elevated 502 error rate since these errors were occurring prior to reaching our backend services. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Posted Mar 07, 2019 - 16:46 UTC

Resolved

Our cloud provider confirmed that a recent rollout of their HTTP LB software was the cause of these elevated errors. They rolled back the update and we confirmed recovery at 2019-03-06 05:36:00 UTC. Over the past hour we have not seen any regression in the eastern API servers.

We will publish a post mortem for this incident within the next 48 hours.

Posted Mar 06, 2019 - 06:31 UTC

Monitoring

We've finished redeploying our eastern data collection API and can confirm the 502 errors are no longer being returned. We are working with our cloud provider to determine the root cause of the elevated errors and will continue monitoring to ensure the issue doesn't return.

Posted Mar 06, 2019 - 05:50 UTC

Update

We're continuing to investigate the issue with our eastern API cluster. Our SRE team has redeployed our API service on the affected cluster. We'll update the incident by 2019-03-06 05:30:00 UTC.

Posted Mar 06, 2019 - 05:16 UTC

Update

Our SRE team is continuing to investigate the elevated failures to our eastern API servers. We'll update by 2019-03-06 05:15:00 UTC.

Posted Mar 06, 2019 - 04:52 UTC

Investigating

We are currently investigating elevated rates of failure to our edge API servers located in the eastern United States. Our SRE team is investigating the failures and we'll provide an update by 2019-03-06 04:50:00 UTC.

Posted Mar 06, 2019 - 04:35 UTC

This incident affected: Data Collection.