On Tuesday August 28th at 21:15 UTC we experienced an incident with our backend data processing services. Our engineering team was alerted and began investigating the incident to determine why the service had crashed within 15 minutes of the first alert.
Our team determined that an edge case bug in our account cleanup process led to a canceled account being deleted prior to in-progress data processing jobs being canceled or completed. The presence of in-progress jobs for this deleted account led to the data processing service to halt.
Once our team identified the issue we manually removed the jobs associated with the deleted account and brought service back online. We confirmed service restoration at 21:53 UTC.
During the incident data in-app data processing of segment builds and data updates were interrupted and we saw an elevated rate of 500 errors from our app due to the crashed data processing service. Our data collection API at track.customer.io was unaffected and all API calls were recorded.
To remediate this edge case and prevent future incidents our engineers deployed changes to the account cleanup process to account for any in-progress jobs to ensure they are canceled prior to full account deletion.