Web App Degraded Performance
Incident Report for Intercom
Postmortem

We want to provide some detail about the problem we experienced yesterday. First we'd like to apologize to all our customers for the disruption to our services. We make significant efforts to keep Intercom working well at all times, and yesterday’s events were far from the level of service that we are committed to providing.

Intercom was mostly down or impaired for over 2 hours. Starting just after 14:10 UTC, our MySQL (RDS Aurora) read-replica hosts for our main application database went to near 100% CPU utilisation. We tried a number of things to get to a stable state, including restarting the RDS Aurora cluster and rolling many back code deploys and feature flag changes. Ultimately we got control of the situation by identifying the problem queries, killing them automatically and disabling the parts of the Intercom application that was creating them. By 17:00 UTC, the error rates across all of Intercom’s services were back to normal. Around that time we identified an unnecessary association in our ORM that was causing the queries in question and removed it.

Subsequent investigations have identified the root cause as being a change in the execution plan being used by MySQL for the query. The query itself was quite frequent, but the execution plan change meant that it was no longer using an index, instead scanning every row in a large table. This caused a massive increase in IO and CPU utilization on the database hosts, increasing latencies for almost all our services, causing them to be effectively down. The trigger for the query execution plan change is still being established, however it was not caused by any direct change to Intercom such as a deployment or configuration change, and this contributed to the time to resolve this issue.

We're continuing to dig into yesterday's outage, so we can learn how we to improve our operational response and avoid this type of problem entirely in future. Once again we'd like to apologize to our customers for the outage and any disruption it caused, and please do get in touch with us if you have any concerns or questions about the outage.

Brian Scanlan, Engineering Manager

Posted Jan 20, 2017 - 17:16 UTC

Resolved
We have worked through the backlog of incoming data and returned to being fully operational. Data will updated in the UI at normal rates. Delivery times for inbound & outbound email notifications have also returned to normal.
Posted Jan 19, 2017 - 23:54 UTC
Update
Our web app and API endpoints are back up, we are going to continue and monitor their status.

Conversation email notifications and outbound notifications will be delayed. While user updates are being accepted by our API endpoints they will not be surfaced in the UI until our user data pipeline is fully re-enabled and the backlog is worked through. We're working to resolve this and we'll continue to post updates.
Posted Jan 19, 2017 - 18:38 UTC
Monitoring
Our web app and API endpoints are back up, we are going to continue and monitor their status.

Conversation email notifications and outbound notifications will be delayed. While user updates are being accepted by our API endpoints they will not be surfaced in the UI until our user data pipeline is fully re-enabled and the backlog is worked through. We're working to resolve this and we'll continue to post updates.
Posted Jan 19, 2017 - 17:24 UTC
Investigating
Our web app and APIs are currently down. We are working to resolve this.
Posted Jan 19, 2017 - 15:27 UTC
Identified
Our web app is back up, but our API endpoints are down, we are working to resolve this.
Posted Jan 19, 2017 - 14:59 UTC
Investigating
Our web app is currently down. We are investigating and working to resolve this.
Posted Jan 19, 2017 - 14:43 UTC