Between 2018-12-21 05:29 UTC and 2018-12-21 10:08 UTC there was a pause in the processing of incoming data for approximately 25% of our customers. For any affected customers, the Customer.io app was unavailable and no data was processed, or messages were delivered. For the remainder of our customers there was no service impact. No data was lost during the outage, any any queued data was processed with a small delay after 10:08 UTC with all backlogged data processed by 11:05 UTC.
A kernel bug was triggered by mysql on the affected server. This bug caused mysql to lockup and required us to restart the affected server.
We first attempted to restart mysql on the crashed server. When this failed we brought the server down completely and restarted it. On restart mysql was able to recover and data began to process again. We did not lose data during the crash, it was delayed pending the recovery.
We encountered this kernel bug previously on a different database server and upgraded the kernel of all our database servers in response. Given the repeat crash seen in this incident we've improved our response plans for handling this class of database server failure. We're also investigating alternatives such as a vendor specific kernel and rolling back to earlier kernel versions. We'll test and evaluate these options while continuing to monitor and respond in the event of a failure of this type.