Incident Summary
On December 8th, 2025, beginning at 17:38 UTC, some customers experienced delays in data processing and message delivery. Normal functionality was fully restored at 18:19 UTC, for a total duration of 41 minutes. No data was lost.
A failure during the startup of an internal processing service prevented it from becoming fully operational, leading to reduced throughput and increased retry activity in upstream components.
Root Cause
During startup, one of our processing services loads information about message queues before beginning normal operation. An unexpected queue state left over from a previous configuration caused the service to encounter an error during this process, leading it to restart repeatedly without successfully completing initialization.
Because this service is responsible for handing off work to downstream processors, its unavailability resulted in a drop in throughput and a rise in retry traffic. The elevated retries added load to our underlying data layer and contributed to further delays.
Resolution and Recovery
Engineers identified the failing service, corrected the underlying queue state, and restored the service to full operation. Once stabilized, normal processing resumed and retry volumes returned to expected levels. The system was monitored to confirm full recovery.
Corrective and Preventative Measures
To prevent recurrence, the team is improving validation during service startup to better handle unexpected queue conditions, refining deployment procedures to detect stalled services sooner, and enhancing monitoring for repeated restart patterns. These improvements are being incorporated into ongoing reliability work.
We apologize for any disruption this caused.