On Thursday, September 9th between 9:48 and 13:29 UTC our API endpoint hosted in EU (https://track-eu.customer.io)) experienced an elevated rate of failures. During that time period there was a rate of failure between 3% and 4% for all API calls. The affected calls were getting 502 HTTP
errors.
The incident affected only our data collection API in EU. The rest of our services were unaffected.
This incident was caused by a bug in our service that stores and forwards API requests to our back-end infrastructure. An account migration from our US data-centre to our EU data-centre triggered a bug that generated zero sized API requests to our EU API endpoint. These requests were causing our services to crash and restart with the result of failing to handle incoming requests during the restart. These failures were sporadic and irregular in frequency
The team was notified about the issue at September 9th 09:54 UTC and started investigating. The cause of the issue was determined at 12:40 UTC and a fix was deployed at 13:29 UTC to gracefully handle zero sized requests resolving the issue. The team continued its investigation to determine the source of these requests. There were additional fixes developed and deployed to prevent similar issues from affecting our services. The duration of this incident, 3.5 hours until resolution, was outside our target SLO, and we are working on improving this.