Message sending is paused for some customers
Incident Report for Customer.io
Postmortem

Overview

On 2018-10-22 at 14:46:23 UTC, we suffered a 162 minute major incident affecting a single database shard. The root cause was a bug in our account cleanup process that resulted in leftover jobs for our queue workers that were impossible to complete. This caused resource saturation and exhaustion on a single database shard resulting in delayed processing of outbound messages.

Customer data was not lost or compromised during the incident, only delayed.

Incident Timeline

  • 2018-10-22 14:46 UTC: An SRE notices unusually high memory usage on a database server.
  • 2018-10-22 14:52 UTC: We restart the service that is consuming memory to see if the issue is a fast or slow leak
  • 2018-10-22 15:37 UTC: Incident is created on our status page once we identify that the issue is severe.
  • 2018-10-22 16:24 UTC: Our SRE team restarts the affected database server with increased memory to help process backlogged work.
  • 2018-10-22 16:41 UTC: We restart one of our worker services that was persisting invalid jobs.
  • 2018-10-22 16:56 UTC: The affected database shard is back to processing messages in real-time. We still have a backlog of delayed messages from during the incident to clear.
  • 2018-10-22 17:28 UTC: The processing backlog clears resolving the incident.
  • 2018-10-22 17:34 UTC: Our engineering team deploys a configuration change that prevents one cause of the incident from re-occurring.
  • 2018-10-26 15:08 UTC: We deploy a bug fix for the other root cause of this incident preventing future invalid jobs being leftover from account cleanup.

Root Cause

The processing backlog was caused due to shutting down the a service responsible for fetching unprocessed work. This was necessary due to memory exhaustion caused by this service. The memory exhaustion was the result of our worker config being set with too low a value for gRPC message size combined with an increase in work due to a bug in our account cleanup process that left inbound data processing enabled for deleted accounts.

Resolution

Invalid work was cleared from our worker queues manually and the worker config was updated to accept larger gRPC message sizes. Additionally, the affected database server was temporarily scaled up to have more memory for processing the backlog.

Post Incident Remediation

The bug in account cleanup was identified and fixed so that deleted accounts have their ingress queues disabled automatically. This prevents future incidents from occurring due to deleted accounts with leftover work.

Posted Nov 17, 2018 - 00:57 UTC

Resolved
The backlog has been processed. Back to speedy delivery for awesome messages. 😊
Posted Oct 22, 2018 - 17:28 UTC
Monitoring
New messages are being sent promptly again and we're processing the backlog.

We'll follow up as soon as all the messages have been sent.
Posted Oct 22, 2018 - 16:56 UTC
Identified
We’re continuing to investigate the issue delaying message delivery for some accounts. We’re also deploying additional resources to assist in processing the backlog of messages.

We will provide another update by 4:40 pm UTC.
Posted Oct 22, 2018 - 16:12 UTC
Investigating
Some of our accounts are currently not sending messages. Data processing is unaffected and our engineering team is on the case.

We will provide another update by 4:06 pm UTC.
Posted Oct 22, 2018 - 15:37 UTC
This incident affected: Email Sending.