US infrastructure delays with a subset of deliveries sending

Incident Report for CustomerIO

Postmortem

Duration: 2h 14m, Nov 24 8:08 AM UTC - 10:22 AM UTC]

Severity: P2

Impact: Reduced message processing rates across US infrastructure

What Happened

On November 24th, our message rendering service experienced degraded performance during a period of high traffic. The service, which renders your messages for delivery, encountered memory constraints that caused intermittent service restarts and slower processing rates for priority queues across our US infrastructure.

Customer Impact

Affected Services: All message types including campaigns, transactional messages, and journey workflows

Date: November 24, 2025

Geographic Scope: US-based workspaces
Performance Impact: Message processing slowed to approximately 50% of normal capacity during peak impact
Duration: 2 hours 14 mins of degraded performance
Data Integrity: No messages were lost; all queued messages were successfully processed

Root Cause

Our message rendering service runs on an auto-scaling infrastructure that automatically adjusts capacity based on workload. During this incident, sudden traffic spikes caused individual servers to consume memory faster than our auto-scaling could compensate. When servers reached memory limits, they restarted automatically (as designed for resilience), but these rolling restarts reduced our overall processing capacity during a time of peak demand, creating a compound effect.

Resolution

Immediate fix: We deployed updated code to our rendering service that better manages memory consumption during traffic bursts, preventing the cascade of restarts that degraded performance.

Why this works: The update implements more efficient memory allocation patterns and adds throttling mechanisms that prevent any single traffic burst from overwhelming individual servers, regardless of auto-scaling speed.

What We're Doing to Prevent This

Smarter Resource Management (Completed)

* Deployed code optimizations that prevent memory exhaustion during traffic spikes
* Implemented per-node workload throttling to maintain stability

Improved Auto-scaling (In Progress - Q1 2026)

* Tuning our auto-scaling to be more predictive rather than reactive
* Increasing baseline capacity to handle larger bursts without scaling delays

Better Early Warning (In Progress)

* Adding memory pressure alerts that trigger before critical thresholds
* Implementing graduated responses to traffic spikes \(pre-scaling based on queue depth trends\)

Our Commitment

While our platform maintained data integrity throughout this incident (no messages were lost), we understand that processing delays impact your customer engagement timing. We are actively working to ensure all customers’ messages are processed and delivered as efficiently and reliably as possible.

Questions?

Your Customer Success Manager has details specific to your workspace's impact during this incident. For technical questions or to discuss our infrastructure roadmap, please reach out to your account team.

Posted Dec 11, 2025 - 16:12 UTC

Resolved

We confirm the incident has been resolved. The system is fully operational.

Posted Nov 24, 2025 - 08:51 UTC

Update

Transactional queue backlog is process, priority queue is still ongoing

Posted Nov 24, 2025 - 08:45 UTC

Identified

We identified the issue, backlog is processing now

Posted Nov 24, 2025 - 08:38 UTC

Investigating

Our team is aware of an issue with Campaign and Transactional deliveries going out. We are currently looking into this.

Posted Nov 24, 2025 - 08:15 UTC

This incident affected: Message Sending.