Postmortem: Service disruption on April 29th, 2020

Kris Rasmussen
CTO, Figma

As reported on our status page and Twitter, a number of users were unable to access Figma at 12pm PST on April 29th, 2020. We know how disruptive it is when Figma is unavailable, especially in these times. We take downtime very seriously. While there is still more work to be done to ensure we’ve learned everything we can from this incident, we want to share details about what happened and what we did to resolve it.

What happened?

Long story short, we exceeded the CPU capacity that our Proxy servers require to successfully establish secure connections. Our Proxy servers are primarily responsible for two things:

  1. Routing requests to the appropriate backend services that power Figma.
  2. Negotiating secure TLS connections between you and Figma.

Figma clients maintain a secure connection to our Proxy servers at all times. This connection ensures you have a realtime view of what’s happening in your files and across your organization.

Just before this incident, we restarted our Proxy servers to patch an issue that had been released the day before. At that time, our traffic was significantly higher than when we normally restart these servers in the afternoon. As each proxy server was being restarted, clients attempted to reconnect in mass. The work required to negotiate a secure connection overloaded the CPU on some of our Proxy servers. This slowed down the time to establish a connection just long enough for clients to timeout the connection before it completed. The retries perpetuated the incident until enough clients backed off sufficiently for connections to be established successfully.

While we’ve been focused on scaling and hardening our backend services against sudden disconnects and reconnects, we missed that our Proxy servers were using significantly more CPU to establish secure connections in the brief moments following a restart than they do at other times.

How did we resolve it?

We were alerted to the incident within seconds of the first request failures by our monitoring systems.

Our first instinct was to look for errors in our dashboards that indicated which type of requests were the problem. To our surprise, there weren’t any obvious errors but the total number of requests was lower than normal.

This lead us to look closer at the Proxies themselves where we saw that many connections were stuck in a “waiting” state. Upon closer inspection of the logs, we found a stream of INFO messages indicating that clients were terminating the connections while attempting to establish a secure connection.

The combined number of “active” and “waiting” connections was proportionate to the number of “active” connections before the incident. So the proxies were not overloaded due to an increase in simultaneous connections, but there was a substantial increase in the rate of connections being closed and opened due to the clients that were timing out and reconnecting.

Once we confirmed that the proxies were overloaded, we launched additional capacity. By the time the additional servers came online, enough clients had already reconnected successfully that the service was able to recover on its own.

Why wasn’t everyone affected?

Some of our users in Europe and China connect to a separate fleet of Proxy servers outside the United States so we can route traffic more efficiently back to our servers. If you were in one of the regions where we route traffic through these proxy servers, then you were not affected.

Next steps

We doubled the number of Proxy servers in response to this incident. We’re also updating our capacity planning with new monitoring that will alert us when it’s time to add more Proxies going forward. These immediate changes will help to prevent a recurrence of this particular incident but aren’t enough on their own to mitigate the risk of similar classes of incidents.

We also need to rethink our reconnection strategy. If our clients had waited longer before attempting to reconnect, the service would have recovered more quickly. Based on the postmortem from this incident, we're working to create a roadmap of additional infrastructure work we can do to keep Figma stable if similar conditions emerge in the future.

If you have any questions about this incident, please don’t hesitate to get in touch at