During a scheduled maintenance of our infrastructure, we upgraded our etcd cluster. While the upgrade was successful some of the sanity tests that followed caused a resource exhaustion on our SQL servers (reached max_connections
limit). This affected most of our services and for a brief period of time, our web interface was inaccessible. API and data collection was not affected. Our team quickly responded and the services were fully restored at about 09:50 UTC.
Incident timeline: * 09:20 UTC: Upgrade started * 09:30 UTC: Upgrade finished successfully, sanity tests performed. * 09:32 UTC: Above tests caused resource exhaustion on our SQL servers. Incident acknowledged and our SRE team started working on restoring the services. * 09:40 UTC: Services restored for most of our customers. * 09:50 UTC: Services fully restored.