On 2018-10-22 at 14:46:23 UTC, we suffered a 162 minute major incident affecting a single database shard. The root cause was a bug in our account cleanup process that resulted in leftover jobs for our queue workers that were impossible to complete. This caused resource saturation and exhaustion on a single database shard resulting in delayed processing of outbound messages.
Customer data was not lost or compromised during the incident, only delayed.
The processing backlog was caused due to shutting down the a service responsible for fetching unprocessed work. This was necessary due to memory exhaustion caused by this service. The memory exhaustion was the result of our worker config being set with too low a value for gRPC message size combined with an increase in work due to a bug in our account cleanup process that left inbound data processing enabled for deleted accounts.
Invalid work was cleared from our worker queues manually and the worker config was updated to accept larger gRPC message sizes. Additionally, the affected database server was temporarily scaled up to have more memory for processing the backlog.
The bug in account cleanup was identified and fixed so that deleted accounts have their ingress queues disabled automatically. This prevents future incidents from occurring due to deleted accounts with leftover work.