On 2020/01/10 one of our backend databases was unavailable for approx. 10 minutes due to elevated unintentional denial of service levels of API traffic from a single customer. Our SRE team identified the issue at 2020/01/09 23:51 UTC, at 2020/01/10 00:06 UTC the problem caused the affected database to become unavailable. Our incident response team resolved the issue at 2020/01/10 00:16 UTC.
This period of downtime impacted customers with a workspace on the affected backend database (roughly 1/6 of our users). The downtime caused a delay in inbound API call processing and outbound message sending. All delayed processing was completed successfully once the downtime was resolved. No customer data was lost.
Abnormally elevated rates of API calls resulted in a single backend database locking up from resource exhaustion.
Our team simultaneously deployed a hotfix that disabled API access for the account causing the denial of service and restored the backend database that had locked up. Once the database was online and servicing requests the downtime was resolved.
Our team has been in communication with the customer that caused this downtime and we're working directly with them to improve their integration with Customer.io. Additionally our team is exploring avenues to put safeguards in place to prevent a re-occurrence of this issue.