As much as we love to build and ship new features, the performance and reliability of our platform is always our top priority. We take downtime seriously. We know that if Figma is unavailable, your work is disrupted, and we wanted to detail the service incidents that occurred on Tuesday, January 21, 2020 and Wednesday, January 22, 2020.
As reported on our status page and Twitter, a number of our users experienced intermittent issues or were unable to load Figma between the hours of 6 and 7 AM PST on January 21st, and at multiple points during the day on January 22nd.
These incidents do not live up to your expectations or our standards, and we apologize for the disruption they caused. In this post, we’ll dive into our analysis of the issue, share technical details that illustrate why we’re confident in our diagnosis, and discuss next steps we’re taking to prevent future incidents.
January 21, 2020
January 22, 2020
Tuesday’s incident was primarily caused by a long-running, expensive query. To prevent this issue from recurring, we will be improving our monitoring of expensive queries and setting tighter bounds on allowed running time.
The root cause of Wednesday’s incident was that a routine change in database statistics caused PostgreSQL 9 to mis-plan the execution of one of our queries, causing expensive table scans and writes to temporary buffers. This was exacerbated by concurrent aggressive autovacuuming operations, which were happening because of a vacuuming backlog from the long-running query that had been terminated on Tuesday.
The upgrade to PostgreSQL 11 mitigated both of these causes: improvements to the query planner eliminated the possibility of the bad query plan, and the performance characteristics of autovacuuming have been significantly improved between versions 9 and 11.
For a detailed technical explanation of our investigation, conclusions, and next steps, read on…
In the early stages of the incident, our leading theory was that the performance problems were primarily due to automatic vacuuming. Vacuuming is a routine process that helps maintain database health. Normally it runs automatically and with minimal impact on other database tasks, but in this case, the long-running query that had been cancelled on Tuesday produced a backlog of data to be vacuumed. This crossed the threshold for a more aggressive form of automatic vacuuming intended to prevent transaction ID wraparound. This autovacuum mode has a greater impact on locking and writes, especially in the version of PostgreSQL that we were running.
During the incident, we canceled autovacuum statements running on our largest tables and saw database metrics improve temporarily. However, the stability was short-lived. Eventually autovacuuming would resume, as would write IOPS and lock contention. Fully preventing aggressive autovacuuming required changing the
autovacuum_freeze_max_age database parameter and rebooting, which we eventually did. (This reboot was responsible for a portion of the downtime between 2 PM-3 PM.) While this brought relative stability to the site, write IOPS, write latency, and lock contention were still higher than normal.
The next mitigation we discussed was upgrading the version of PostgreSQL we were using. We had two reasons to consider an upgrade:
We had previously upgraded our staging database to PostgreSQL 11, tested it for several months without issues, and written a detailed plan for performing the upgrade in production. Therefore we felt confident about performing the upgrade even under less than ideal conditions. Given the risk of ongoing instability, we determined that it was best to do the upgrade in the evening rather than waiting until the weekend. Our preparation paid off: the upgrade was successful, and with site stability returned to normal, we could focus fully on diagnosing the root causes.
Because we continued to see performance issues even after disabling autovacuums, we concluded that while they may have been a contributing factor, they were not the only culprit. We turned our attention to a second factor: an unexpectedly expensive query. During the incident, we had observed that one particular statement with complex subqueries showed up repeatedly as being involved in lock contention. We restored a database snapshot from immediately prior to the database upgrade and examined the query plan for the problematic statement. Several facts about the plan stood out:
The query was a complex one, generated by our ORM layer. (It’s admittedly more complex than it needs to be, and one of our followup tasks will be to review and optimize this and similarly complex queries.) We removed parts of the query that proved non-essential to replicating the issue, reducing it to a query on the
roles table with nested sub-selects:
1 2 3 4 5 6 7
SELECT "roles"."id" FROM "roles" WHERE id IN ( SELECT "roles"."id" FROM "roles" WHERE roles.id IN (?) );
With this reduced query, we could quickly investigate hypothesis about the query planner behavior. We restored a database snapshot from several days prior to the incident and observed that the plan for this query on that date had a reasonable estimate (199) for the number of rows:
1 2 3 4 5
Nested Loop (cost=17055.00..18144.32 rows=199 width=8) (actual time=0.076..0.090 rows=3 loops=1) -> HashAggregate (cost=17054.43..17056.42 rows=199 width=8) (actual time=0.068..0.068 rows=3 loops=1) [...] -> Index Only Scan using roles_pkey on roles (cost=0.56..5.45 rows=1 width=8) (actual time=0.005..0.006 rows=1 loops=3) Index Cond: (id = roles_1.id)
The fact that the same query had wildly different row estimates at two different times suggested an issue with statistics. Statistics are a form of database metadata the query planner uses to estimate the relative costs of potential query plans. They are updated by the
ANALYZE command, which, like
VACUUM, is typically run as an automated background process. Looking at data from
pg_stat_user_tables, we observed that one of the tables involved in the query had completed an automatic
ANALYZE command at exactly the same time as the incident began. And indeed, after we manually ran
ANALYZE on this table on our test database, the query plan went from estimating 199 rows returned to estimating over 20 million:
1 2 3 4 5
Nested Loop (cost=16161.53..17255.46 rows=20682690 width=8) (actual time=0.098..0.115 rows=3 loops=1) -> HashAggregate (cost=16160.96..16162.96 rows=200 width=8) (actual time=0.081..0.081 rows=3 loops=1) [...] -> Index Only Scan using roles_pkey on roles (cost=0.56..5.44 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=3) Index Cond: (id = roles_1.id)
In the original, unreduced query, this bad row estimate had a cascading effect, triggering expensive table scans and writes to temporary buffers. Execution times stretched into multiple minutes.
We now had strong evidence that a change in table statistics had caused the query planner to switch from a good plan to one with catastrophically bad performance. We began to research query planner behavior, and a key turning point in our investigation was the discovery of this post from the PostgreSQL mailing list. The symptoms closely matched our own: an inner HashAggregate query crossing a threshold of 200 estimated rows, causing the total estimated number of rows to jump wildly.
Based on the discussion on the mailing list thread, the “deficiency in the query planner” hypothesis now seemed much more likely, and was bolstered by the fact that we were able to replicate a test case in the mailing list post on our test database, but not the upgraded database. Something had evidently been fixed between PostgreSQL 9.5 and 11, but we weren’t sure what. Using the official PostgreSQL Docker images, we bisected PostgreSQL versions. The test case replicated in 9.6.16, but not in 10.0. Reviewing the release notes for 10.0, the following change caught our eye: “Use uniqueness guarantees to optimize certain join types (David Rowley)”. Because the join in our query involved the primary key, we believed the uniqueness guarantee applied to our case. We located the relevant commit, and from it learned that the output of
EXPLAIN VERBOSE would include the text “Inner Unique: true” when this optimization was in use. We confirmed that on the upgraded database, the plan did include this indicator.
A final question we asked was why the query planner’s row estimate had crossed this critical threshold of 200 at all when the actual row count was only 3. By examining several PostgreSQL snapshots over time, we discovered that PostgreSQL had increasingly underestimated the selectivity of the innermost subquery. As this article explains, this can happen because
ANALYZE uses a statistical sample of rows, which may underestimate the number of unique values of a column if the same values for that column tend to be clustered together.
At this point, we felt confident we had located the root cause: a routine change in database statistics triggered a deficiency in the PostgreSQL 9 query planner’s row estimates, causing catastrophically expensive table scans and writes to temporary buffers. This was exacerbated by concurrent aggressive autovacuuming, which was triggered due to a vacuuming backlog from a long running query that had been terminated the previous day.
We’re also confident this issue will not recur. The upgrade to PostgreSQL 11 mitigated both of these causes: the query planner is able to take advantage of the uniqueness of the inner relation to avoid the problematic query plan, and the performance characteristics of autovacuuming have been significantly improved.
We learned a lot from these incidents and are already working on a number of follow up actions to improve our resilience to future incidents. These include:
In addition to these infrastructure improvements, we’re also revisiting our product priorities and front-loading some project work that was planned for later. To start, we’re kicking off an initiative to ensure you can continue to work on your already open files safely after going offline even if you close your browser before coming back online.
We’ll be sharing more details soon about this upcoming product work. In the meantime, if you have any questions about the service disruption, please don’t hesitate to get in touch at firstname.lastname@example.org.