The management interface is temporarily unaccessible and sending has stopped
Incident Report for Customer.io
Postmortem

Overview

Between 2018-12-21 05:29 UTC and 2018-12-21 10:08 UTC there was a pause in the processing of incoming data for approximately 25% of our customers. For any affected customers, the Customer.io app was unavailable and no data was processed, or messages were delivered. For the remainder of our customers there was no service impact. No data was lost during the outage, any any queued data was processed with a small delay after 10:08 UTC with all backlogged data processed by 11:05 UTC.

Root Cause

A kernel bug was triggered by mysql on the affected server. This bug caused mysql to lockup and required us to restart the affected server.

Resolution

We first attempted to restart mysql on the crashed server. When this failed we brought the server down completely and restarted it. On restart mysql was able to recover and data began to process again. We did not lose data during the crash, it was delayed pending the recovery.

Post-incident Remediation

We encountered this kernel bug previously on a different database server and upgraded the kernel of all our database servers in response. Given the repeat crash seen in this incident we've improved our response plans for handling this class of database server failure. We're also investigating alternatives such as a vendor specific kernel and rolling back to earlier kernel versions. We'll test and evaluate these options while continuing to monitor and respond in the event of a failure of this type.

Posted Jan 01, 2019 - 00:30 UTC

Resolved
The backlog has been processed. Back to preparing for the winter holidays - campaigns and all! 🎄
Posted Dec 21, 2018 - 11:12 UTC
Monitoring
Everything is working correctly again, no data was lost, and the backlog is currently being processed. Our engineering team is monitoring the situation.

We will provide a final update at UTC 11:22 UTC.
Posted Dec 21, 2018 - 10:24 UTC
Update
We are continuing to investigate. We will provide an update at UTC 10:15.
Posted Dec 21, 2018 - 09:16 UTC
Investigating
Started: December 21, 2018 6:49am UTC
Severity: high - customers are unable to use the Customer.io management interface and sending has stopped
Who is affected: still investigating the impact

The cause of the outage is currently unknown and is being investigated at this time. We will provide an update at UTC 09:15.
Posted Dec 21, 2018 - 08:12 UTC
This incident affected: Email Sending and Management Interface.