App is down for some customers
Incident Report for Customer.io
Postmortem

Overview

Between 2018-12-04 17:27 UTC and 2018-12-04 18:12 UTC there was a pause in the processing of incoming data for approximately 25% of our customers. For any affected customers, the Customer.io app was unavailable and no data was processed, or messages were delivered. For the remainder of our customers there was no service impact. No data was lost during the outage, any any queued data was processed with a small delay after 18:12 UTC.

Root Cause

A kernel bug was triggered by mysql on the affected server. This bug caused a mysql deadlock and required us to restart the affected server.

Resolution

We first attempted to restart mysql on the crashed server. When this failed we brought the server down completely and restarted it. On restart mysql was able to recover and data began to process again. We did not lose data during the crash, it was delayed pending the recovery.

Post-incident Remediation

We're planning an upgrade to the kernel of our database servers given the bug encountered today. In addition, we're re-examining our timelines for failover to the hot standby.

Posted 9 months ago. Dec 04, 2018 - 20:48 UTC

Resolved
No further problems detected. Sorry for the inconvience.
Posted 9 months ago. Dec 04, 2018 - 18:58 UTC
Update
Broken infrastructure has been repaired. Everything should be running again. Further updates at 18:45 UTC.
Posted 9 months ago. Dec 04, 2018 - 18:11 UTC
Investigating
The application is down for roughly 25% of our customer base. We've investigating. Further updates at 18:15 UTC.
Posted 9 months ago. Dec 04, 2018 - 17:43 UTC
This incident affected: Data Processing, Email Sending, and Management Interface.