On October 30th, 2025, between 17:12 UTC and 22:10 UTC, some customers with our US data center may have experienced delays in data processing, message sending and user interface errors. This issue did not affect data ingestion and no data was lost.
A kernel-level issue caused file system corruption for a database in our cluster which resulted in a database crash. Using internal recovery procedures, we restored service and confirmed full data integrity.
A filesystem related fault occurred on one of the production databases that stores customer journey information caused the service to fail when reading certain data.
Under normal conditions, a replica database would maintain continuity of service. However, in this instance, the database experienced replication lag which prevented an immediate failover. This resulted in a longer recovery period.
Engineers repaired and validated the affected database, ensuring data integrity throughout the recovery process. Using recent copies of the data, the team restored the impacted areas, verified system health, and restarted dependent services to confirm normal operation. Following the restoration, a replica database was rebuilt.
To ensure greater resilience, the team is enhancing database monitoring to detect early signs of filesystem related faults, updating database recovery strategies based on learnings from this incident, and increasing the frequency of recovery drills to validate new procedures. These measures are being tracked and prioritized within our internal development process.