Extended outage since 11:41 pm EDT
Incident Report for Customer.io

What happened:

We have an 8.5 TB installation of FoundationDB that we use to store the following data for Customer.io accounts:

  • Customer Attributes
  • Customer Activity Summaries
  • Customer Segment Memberships
  • Email Delivery Information
  • Segment Configuration

We started to notice sluggish performance 2 weeks ago during peak hours. To address this we added more capacity to the cluster and also deleted unused data by "vacuuming" the cluster.

While these two processes are run, the cluster's performance decreased to the point that we were unable to keep up with peak load. After several days trying different strategies, we determined we were unable to increase capacity, delete old data and keep up with current load.

A newer version of FoundationDB had some fixes that increased performance of these processes and we attempted the upgrade yesterday evening during our low period in the day at ~ 12:30 am EDT. After bringing down the cluster, we upgraded FoundationDB and attempted to bring the cluster back up.

An error prevented the cluster from returning to "operational".

Impact to Customer.io Services

Which services went down?

When the cluster was unable to return, that made Email Sending, Data Processing, the Management Interface, and Unsubscribe functionality inoperable.

Which services stayed up?

Data collection and the Javascript tracker did not have downtime. We queued events during the outage for processing when the service returned.

The outage lasted for 12 hours. From 12:31 AM EDT Monday night until ~ 11:30 AM EDT Tuesday morning. When the database cluster came up we returned 90% of accounts to normal status within an hour.

How we resolved the issue

After realizing we were unable to bring the cluster up ourselves, we contacted FoundationDB. We currently have a support contract that gives us coverage during business hours EDT. We had some uncertainty post-acquisition by Apple with how responsive the FoundationDB team would be. However, to their credit we received excellent help from the team.

After providing some logs and diagnostic information, FoundationDB engineers identified the issue as data corruption on the data used to orchestrate the cluster (not your data). They were swift in providing instructions about what to disable in order to bring the cluster back up in a healthy state. To do this we had to temporarily turn off some important, but non-critical features.

Short term fixes

We need to perform some additional maintenance on the cluster before turning back on the important, but non-critical features like failover and self-healing. To do this, over the next few week we'll be adding newer, larger machines in to the cluster and cycling out existing nodes in the cluster. Cycling out a node clears out the data and creates a fresh, uncorrupted set of FoundationDB data. We'll be doing this for all nodes currently in the cluster.

Longer term fixes

The FoundationDB acquisition by Apple meant that although we like the technology, we can't continue to use the database long term. After the acquisition was announced we immediately started planning to migrate away. At that point we had already begun doubling the engineering team and hiring to fill "Senior Scaling Engineer" positions.

We have one Senior Scaling Engineer starting this week. Another Senior Scaling Engineer starts in a month.

We've had good results testing a strategy to replace FoundationDB with PostgreSQL and had begun to implement that strategy prior to the outage. We've contracted with PostgreSQL experts to help us address any architecture and scaling concerns as well as to establish monitoring and best practices for maintaining the service.

Rather than using a distributed datastore to house all customer data, the new strategy would give each user their own PostgreSQL database. This allows us to address load for each customer individually and while there's larger operational overhead, it provides flexibility to increase performance for a specific customer by moving them to dedicated servers if needed.

What are your questions?

If you have questions or concerns, feel free to reach out. My email is colin@customer.io. However, I'd ask that you email win@customer.io and address it to me so that the rest of the team can stay in the loop.

Colin Nederkoorn

Posted over 2 years ago. Apr 28, 2015 - 23:45 UTC

We've been monitoring the service for the past several hours and are ready to call things back to normal. Following up with a post mortem later today.
Posted over 2 years ago. Apr 28, 2015 - 20:46 UTC
We're still working through a backlog of data on larger accounts. We're adding more capacity to process data faster as we approach the peak time of the day for new data.
Posted over 2 years ago. Apr 28, 2015 - 17:08 UTC
You should now be able to log in to https://fly.customer.io. We're still working through the backlog of data.
Posted over 2 years ago. Apr 28, 2015 - 16:15 UTC
We've re-enabled processing. Emails are being sent. We're processing data that was queued during the downtime. The 13 million attribute changes and 18 million events are being processed at a rate of about 5 thousand per second. Those queues will take approximately 1 hr.
Posted over 2 years ago. Apr 28, 2015 - 15:43 UTC
We’re able to read from & write to the database again. Forward progress, but many more steps before the service is fully operational.
Posted over 2 years ago. Apr 28, 2015 - 15:26 UTC
We've received advice from our database vendor and are currently acting on that advice to attempt to restore the database.
Posted over 2 years ago. Apr 28, 2015 - 14:36 UTC
We've been unable to bring up one of our data stores after performing a routine upgrade. We've contacted the vendor for assistance.
Posted over 2 years ago. Apr 28, 2015 - 09:10 UTC
We're working through issues related to the upgrade of one of the data stores. This is causing delays in returning services to full operation.
Posted over 2 years ago. Apr 28, 2015 - 06:37 UTC
We're performing several major upgrades tonight and will be offline for roughly 15-30 minutes.

This will cause delays in data processing, email delivery, and availability of the our user interface.
Posted over 2 years ago. Apr 28, 2015 - 03:41 UTC