Between 2018-12-04 17:27 UTC and 2018-12-04 18:12 UTC there was a pause in the processing of incoming data for approximately 25% of our customers. For any affected customers, the Customer.io app was unavailable and no data was processed, or messages were delivered. For the remainder of our customers there was no service impact. No data was lost during the outage, any any queued data was processed with a small delay after 18:12 UTC.
A kernel bug was triggered by mysql on the affected server. This bug caused a mysql deadlock and required us to restart the affected server.
We first attempted to restart mysql on the crashed server. When this failed we brought the server down completely and restarted it. On restart mysql was able to recover and data began to process again. We did not lose data during the crash, it was delayed pending the recovery.
We're planning an upgrade to the kernel of our database servers given the bug encountered today. In addition, we're re-examining our timelines for failover to the hot standby.