Partial Data Processing Outage - Denial of Service
Incident Report for Customer.io
Postmortem

Incident Summary

On 2020/01/10 one of our backend databases was unavailable for approx. 10 minutes due to elevated unintentional denial of service levels of API traffic from a single customer. Our SRE team identified the issue at 2020/01/09 23:51 UTC, at 2020/01/10 00:06 UTC the problem caused the affected database to become unavailable. Our incident response team resolved the issue at 2020/01/10 00:16 UTC.

This period of downtime impacted customers with a workspace on the affected backend database (roughly 1/6 of our users). The downtime caused a delay in inbound API call processing and outbound message sending. All delayed processing was completed successfully once the downtime was resolved. No customer data was lost.

Root Cause

Abnormally elevated rates of API calls resulted in a single backend database locking up from resource exhaustion.

Resolution and Recovery

Our team simultaneously deployed a hotfix that disabled API access for the account causing the denial of service and restored the backend database that had locked up. Once the database was online and servicing requests the downtime was resolved.

Corrective and Preventative Measures

Our team has been in communication with the customer that caused this downtime and we're working directly with them to improve their integration with Customer.io. Additionally our team is exploring avenues to put safeguards in place to prevent a re-occurrence of this issue.

Posted Jan 10, 2020 - 22:11 UTC

Resolved
This incident has been resolved.
Posted Jan 10, 2020 - 00:16 UTC