App access issues

Incident Report for Customer.io Status

Postmortem

During a scheduled maintenance of our infrastructure, we upgraded our etcd cluster. While the upgrade was successful some of the sanity tests that followed caused a resource exhaustion on our SQL servers (reached max_connections limit). This affected most of our services and for a brief period of time, our web interface was inaccessible. API and data collection was not affected. Our team quickly responded and the services were fully restored at about 09:50 UTC.

Incident timeline: * 09:20 UTC: Upgrade started * 09:30 UTC: Upgrade finished successfully, sanity tests performed. * 09:32 UTC: Above tests caused resource exhaustion on our SQL servers. Incident acknowledged and our SRE team started working on restoring the services. * 09:40 UTC: Services restored for most of our customers. * 09:50 UTC: Services fully restored.

Posted Jun 22, 2018 - 10:48 UTC

Resolved

This incident has been resolved.

Posted Jun 22, 2018 - 10:44 UTC

Monitoring

Everything is back to normal, no dark clouds on the horizon, we're just checking once again to make sure all services are working correctly.

We will provide a final update by 10:30 UTC.

Posted Jun 22, 2018 - 09:58 UTC

Update

We are continuing to investigate this issue.

Posted Jun 22, 2018 - 09:53 UTC

Investigating

Started: 9:32 UTC

You might currently see errors when accessing your Customer.io account or viewing pages inside the app. Our site reliability engineers are currently investigating and we'll provide another update by 10:20 UTC.

Posted Jun 22, 2018 - 09:46 UTC

This incident affected: Management Interface.