Here's a timeline of the outage customers of Customer.io experienced on Thursday August 13th.
At approximately 14:30 eastern, one of the nodes in a 12 node database cluster became unresponsive. Our datacenter staff took the node offline, identified a failed motherboard as the root cause and began a repair operation.
After losing the node, the database started trying to heal itself by moving data to other nodes in the cluster to account for the missing node.
Due to load on the database from new data being inserted, the database was struggling to both heal and serve the full traffic at the same time. Symptoms reported by customers were things like they were unable to log in, or were seeing time outs in the application.
After allowing the normal healing process process to run, our ops team decided to intervene and was working on:
At this point service was intermittent as the datastore healing was affecting availability.
Datacenter staff brought the missing node back online. However, the normal healing process wasn't working even with the node back. Our operational steps that we were using to keep the service available also stopped working. The cluster wasn't able to make more progress healing.
Even with intervention, we were unable to ensure healing process of the cluster.
After several hours examining the logs, we identified that when the nodes were trying to come online they would run out of memory and crash. We increased the available memory for each node and this successfully solved the issue during boot.
The cluster was back up but not fully healed. We didn't want to risk further outages and allowed the cluster to heal before allowing new traffic to hit the cluster.
We were then able to process the queues. Most accounts were back to real-time processing in about 30 minutes. Larger accounts with big backlogs took up to 90 minutes.
We continued to monitor the cluster. Everything was back to normal and performant at this time.
Given the typical usage pattern for Customer.io: (a constant stream of data and a high volume of data being collected every day), the performance under stress as well as recovery from failure of our data stores is critical.
Load typically peaks in the afternoon eastern during the work week. A single node failure should not make the cluster inoperable during the repair process. However, in this case coinciding with peak load it did cause a service interruption and outage.
One of the challenges in general with data stores for a multi-tenant system like Customer.io is that recovery and repair can take a long time due to the volume of data we're dealing with.
Our back end team is working on strategies that reduce the impact of service outages across the platform as well as shorten the time to recovery in the event of an outage. We'll be sharing more specifics about these changes to infrastructure soon.
Thanks for the continued support and understanding. Feel free to reach out if you have any questions - colin @ customer.io