Between 2018-10-30 14:00 UTC and 2018-10-30 14:20 there was a pause in the processing of incoming data and for a subsets of our customers. No data was lost during that time, the requests that took place during that time window got processed with a small delay after 14:20 UTC. There was a very small number of requests that were stored in a corrupted copy of our database. These requests were recovered and processed later that day at about 16:50 UTC.
- 2018-10-30 14:00 UTC: A single database shard crashes due to a host hardware error and is restarted restarted. Customer.io SRE team is notified by our alerting system and the on call SRE responds at 14:01.
- 2018-10-30 14:03 UTC: The database shard is back online and the on call SRE inspects it to make sure services are running and no data was affected. It is noticed that our ingress service, which is responsible for the ingestion of inbound data from our edge API servers, is not working. After inspecting the logs, we determine that there is data corruption on disk that prevents the service from starting and processing incoming data
- 2018-10-30 14:13 UTC: We publish a service status update to notify our customers of the issue.
- 2018-10-30 14:19 UTC: After discussion with the backend team it is decided to restart the service in a clean state so we can start processing incoming data again. The corrupted data gets saved to be inspected and restored as soon as we figure out the nature of the corruption.
- 2018-10-30 14:29 UTC: We confirm that data processing has restarted and everything is working as expected. We update our service status to notify our customers.
- 2018-10-30 14:30 UTC: Backend team starts working on figuring out the nature of the corruption and a way to process the requests that were stored on the disk prior of the crash.
- 2018-10-30 16:50 UTC: Backend team restores the corrupted data and queues up processing. The incident is resolved.
A host hardware issue with our data center provider caused the server crash. This unexpected restart led to on disk data corruption for our ingress service. Upon restart of the service there was no way to process the corrupted data and processing was stopped.
Initially the corrupted data was copied and removed from the disk so the ingress service could be started again in a clean state and resume data processing. The corrupted data was examined, the nature of the corruption was determined and all data was recovered and sent for processing with a small delay.
Post Incident Remediation
The SRE and backend teams worked together to determine the exact causes that led to the corruption of data. Two bugs were identified and fixed:
- A race condition in our service startup scripts that can allow a service to start before all filesystems are successfully mounted on the server.
- LevelDBs default behavior when calling Put() is not to sync the data on disk. This can cause data loss on the case of a server or OS crash.