We have an 8.5 TB installation of FoundationDB that we use to store the following data for Customer.io accounts:
We started to notice sluggish performance 2 weeks ago during peak hours. To address this we added more capacity to the cluster and also deleted unused data by "vacuuming" the cluster.
While these two processes are run, the cluster's performance decreased to the point that we were unable to keep up with peak load. After several days trying different strategies, we determined we were unable to increase capacity, delete old data and keep up with current load.
A newer version of FoundationDB had some fixes that increased performance of these processes and we attempted the upgrade yesterday evening during our low period in the day at ~ 12:30 am EDT. After bringing down the cluster, we upgraded FoundationDB and attempted to bring the cluster back up.
An error prevented the cluster from returning to "operational".
Which services went down?
When the cluster was unable to return, that made Email Sending, Data Processing, the Management Interface, and Unsubscribe functionality inoperable.
Which services stayed up?
The outage lasted for 12 hours. From 12:31 AM EDT Monday night until ~ 11:30 AM EDT Tuesday morning. When the database cluster came up we returned 90% of accounts to normal status within an hour.
After realizing we were unable to bring the cluster up ourselves, we contacted FoundationDB. We currently have a support contract that gives us coverage during business hours EDT. We had some uncertainty post-acquisition by Apple with how responsive the FoundationDB team would be. However, to their credit we received excellent help from the team.
After providing some logs and diagnostic information, FoundationDB engineers identified the issue as data corruption on the data used to orchestrate the cluster (not your data). They were swift in providing instructions about what to disable in order to bring the cluster back up in a healthy state. To do this we had to temporarily turn off some important, but non-critical features.
We need to perform some additional maintenance on the cluster before turning back on the important, but non-critical features like failover and self-healing. To do this, over the next few week we'll be adding newer, larger machines in to the cluster and cycling out existing nodes in the cluster. Cycling out a node clears out the data and creates a fresh, uncorrupted set of FoundationDB data. We'll be doing this for all nodes currently in the cluster.
The FoundationDB acquisition by Apple meant that although we like the technology, we can't continue to use the database long term. After the acquisition was announced we immediately started planning to migrate away. At that point we had already begun doubling the engineering team and hiring to fill "Senior Scaling Engineer" positions.
We have one Senior Scaling Engineer starting this week. Another Senior Scaling Engineer starts in a month.
We've had good results testing a strategy to replace FoundationDB with PostgreSQL and had begun to implement that strategy prior to the outage. We've contracted with PostgreSQL experts to help us address any architecture and scaling concerns as well as to establish monitoring and best practices for maintaining the service.
Rather than using a distributed datastore to house all customer data, the new strategy would give each user their own PostgreSQL database. This allows us to address load for each customer individually and while there's larger operational overhead, it provides flexibility to increase performance for a specific customer by moving them to dedicated servers if needed.
If you have questions or concerns, feel free to reach out. My email is firstname.lastname@example.org. However, I'd ask that you email email@example.com and address it to me so that the rest of the team can stay in the loop.