Increased number of API failures in our EU data centre

Incident Report for CustomerIO

Postmortem

Incident Summary

On Thursday, September 9th between 9:48 and 13:29 UTC our API endpoint hosted in EU (https://track-eu.customer.io)) experienced an elevated rate of failures. During that time period there was a rate of failure between 3% and 4% for all API calls. The affected calls were getting 502 HTTP errors.

The incident affected only our data collection API in EU. The rest of our services were unaffected.

Root Cause

This incident was caused by a bug in our service that stores and forwards API requests to our back-end infrastructure. An account migration from our US data-centre to our EU data-centre triggered a bug that generated zero sized API requests to our EU API endpoint. These requests were causing our services to crash and restart with the result of failing to handle incoming requests during the restart. These failures were sporadic and irregular in frequency

Resolution and Recovery

The team was notified about the issue at September 9th 09:54 UTC and started investigating. The cause of the issue was determined at 12:40 UTC and a fix was deployed at 13:29 UTC to gracefully handle zero sized requests resolving the issue. The team continued its investigation to determine the source of these requests. There were additional fixes developed and deployed to prevent similar issues from affecting our services. The duration of this incident, 3.5 hours until resolution, was outside our target SLO, and we are working on improving this.

Posted Sep 17, 2021 - 18:37 UTC

Resolved

This incident has been resolved.

Posted Sep 09, 2021 - 14:42 UTC

Monitoring

A fix is applied and we are monitoring the service. Failures have stopped.

Posted Sep 09, 2021 - 13:32 UTC

Identified

The cause of these failures is identified and we are working on a fix

Posted Sep 09, 2021 - 12:45 UTC

Investigating

We are currently investigating this issue.

Posted Sep 09, 2021 - 12:21 UTC

This incident affected: Data Collection.