Issues accessing the application

Incident Report for Customer.io Status

Postmortem

Here's a timeline of the outage customers of Customer.io experienced on Thursday August 13th.

14:30 EDT - A motherboard fails

At approximately 14:30 eastern, one of the nodes in a 12 node database cluster became unresponsive. Our datacenter staff took the node offline, identified a failed motherboard as the root cause and began a repair operation.

After losing the node, the database started trying to heal itself by moving data to other nodes in the cluster to account for the missing node.

Due to load on the database from new data being inserted, the database was struggling to both heal and serve the full traffic at the same time. Symptoms reported by customers were things like they were unable to log in, or were seeing time outs in the application.

15:30 EDT - Ops team intervenes

After allowing the normal healing process process to run, our ops team decided to intervene and was working on:

  1. Keeping up with traffic / requests
  2. Healing the datastore
  3. Trying to figure out how best to recover from the lost node.

At this point service was intermittent as the datastore healing was affecting availability.

16:45 EDT - Motherboard replaced, server is back online

Datacenter staff brought the missing node back online. However, the normal healing process wasn't working even with the node back. Our operational steps that we were using to keep the service available also stopped working. The cluster wasn't able to make more progress healing.

17:00 - 19:30 EDT - Automated healing process is failing

Even with intervention, we were unable to ensure healing process of the cluster.

19:30 EDT - Root cause identified

After several hours examining the logs, we identified that when the nodes were trying to come online they would run out of memory and crash. We increased the available memory for each node and this successfully solved the issue during boot.

19:40 EDT Cluster is operational

The cluster was back up but not fully healed. We didn't want to risk further outages and allowed the cluster to heal before allowing new traffic to hit the cluster.

21:30 EDT Cluster is up to date

We were then able to process the queues. Most accounts were back to real-time processing in about 30 minutes. Larger accounts with big backlogs took up to 90 minutes.

23:00 EDT Cluster fully repaired

We continued to monitor the cluster. Everything was back to normal and performant at this time.

Summary Thoughts

Given the typical usage pattern for Customer.io: (a constant stream of data and a high volume of data being collected every day), the performance under stress as well as recovery from failure of our data stores is critical.

Load typically peaks in the afternoon eastern during the work week. A single node failure should not make the cluster inoperable during the repair process. However, in this case coinciding with peak load it did cause a service interruption and outage.

One of the challenges in general with data stores for a multi-tenant system like Customer.io is that recovery and repair can take a long time due to the volume of data we're dealing with.

Our back end team is working on strategies that reduce the impact of service outages across the platform as well as shorten the time to recovery in the event of an outage. We'll be sharing more specifics about these changes to infrastructure soon.

Thanks for the continued support and understanding. Feel free to reach out if you have any questions - colin @ customer.io

Posted Aug 17, 2015 - 20:28 UTC

Resolved

We've been monitoring the database overnight and everything is operational. The event backlog has also fully cleared. We'll be writing a postmortem soon.
Posted Aug 14, 2015 - 16:00 UTC

Monitoring

After processing a back log of events, most accounts should be operating in real-time again. High volume accounts may require another 60 - 90 minutes to catch up. We'll be monitoring the progress and will follow up with a postmortem.
Posted Aug 14, 2015 - 01:11 UTC

Update

It’s now possible to log in to the application. We're still doing work to make the cluster fully operational. Currently you'll be able to make changes to your account, but event processing is still queued. This means new event processing is backlogged and no emails are sending, but nothing is lost. It will only be delayed until full service is restored.
Posted Aug 13, 2015 - 23:18 UTC

Update

The faulty node (1 out of 12) has had a motherboard replacement and is back online. However, we're still working to bring the cluster back online. All events are queuing (including emails) and nothing will be lost, but processing will be delayed until we're back online.
Posted Aug 13, 2015 - 22:27 UTC

Identified

We've tracked down the issue to a server fault on a node in our main database cluster. As the cluster attempted to recover, bad things happened and it instead became unavailable. We're working to bring everything back online and the system is currently unstable and intermittently inaccessible.
Posted Aug 13, 2015 - 19:24 UTC

Investigating

We're currently experiencing trouble accessing one of our servers, so logging in is not possible. Our developers are already working on the issue.
Posted Aug 13, 2015 - 18:44 UTC