Server-side sandboxing: Virtual machines



With so many sandboxing options, it’s daunting to choose the right one. Here, we dive into the virtual machine (VM) security model for sandboxing, including the engineering trade-offs to consider and how we use them at Figma to achieve security isolation.
A VM is a guest virtual computer that behaves like a real physical computer with its own memory, disk, and CPU.
Seccomp can restrict the system calls a program is allowed to make.
At Figma, rather than try to prevent security vulnerabilities entirely, we employ server-side sandboxing (also known as workload isolation), to minimize these security risks. In our overview of server-side sandboxing, we introduced the idea of application sandboxing, explained why it’s useful, and briefly discussed three different high-level technical approaches: VMs, containers, and secure computing mode (seccomp). For sandboxing using VMs, we mainly care about two security questions:
- Can a malicious job inside the VM break out of the VM?
- Even if a malicious job can’t break out of the VM, can it use the VM’s permissions to access other systems or otherwise do harm?
The VM security model
To answer these questions, we have to examine the two main relevant boundaries in the VM security model: the hypervisor and the VM’s own permissions.
A hypervisor manages VMs and allows you to run many of them at once on a single physical machine.
A VM escape is when a virtual program breaks out of the isolation of a VM and accesses the host system.
First, the hypervisor correctly separates both the host system from guest VMs, as well as the guest VMs from other guest VMs. For many sandboxing use cases, this security boundary is more than sufficient and unlikely to be the weak link in a broader security system. However, because hypervisors tend to mediate many operating system (OS) and hardware operations, they often expose a large and complex attack surface. Sometimes, vulnerabilities allow for guest VMs to take over the host, as in VM escapes, or to interact with or ascertain information about other guests.

Bare-metal instances are workloads that need access to the hardware or server.
Use cases that involve building specialized hypervisors—or otherwise require a thorough understanding of the hypervisor attack surface—are complicated, but there are many resources like this one that offer helpful overviews.
For example, consider that most large cloud Infrastructure as a service (IaaS) providers—like AWS or Microsoft Azure—rely on VMs as a key primitive to isolate tenants. If your system runs in this type of cloud provider and you don’t limit yourself to bare-metal instances, then your security model has to rely on the hypervisor security boundary to some extent anyway.
Second, VM escapes aside, simply placing a compromised workload inside a VM may be insufficient. You’ll need other controls if, for example, a malicious job in the VM can still make network calls to exfiltrate data or call other services with the VM’s credentials. While we often have limited ability to modify the hypervisor itself, we can configure the VM’s capabilities to reduce the blast radius of a malicious job taking over the VM. For example, you can (and should) restrict network access, minimize the VM’s permissions to contact other services, and limit the lifetime of the VM’s credentials—and of the VM itself.

Engineering considerations
In our introduction to server-side sandboxing In this three-part series, our security engineering team shares practical tips for deploying and operating application sandboxing techniques. First up: evaluating the many sandboxing options, and how to think about the trade-offs between them.Server-side sandboxing: An introduction
Environment
Since VMs can run full-fledged operating systems, they provide a significant advantage: Most workloads probably require little to no modification to execute in that environment. You may even require a full VM if you need to run certain workloads—like converting a text document to a PDF—that involve third-party software such as office productivity suites or web browsers.
Security and performance
Creating and running a full VM tends to have a greater impact on performance compared to other sandboxing primitives, like containers or seccomp. One of the main reasons is that VMs require granular isolation, and isolating every workload means a significant setup and teardown cost per workload, as well as increased latency to wait for a VM to become ready from cold start. Keeping a warmed-up pool of available VM workers helps to address the cold start problem, but adds overhead—it’s complex to implement and operate this kind of orchestration system. Also, the performance impact of a full VM often correlates to increased monetary costs compared to lighter weight options. You might ask yourself whether every individual workload needs to be isolated, or if some level of sandbox sharing across different workloads (i.e. those belonging to the same user) would be acceptable.
Development costs and friction
Since full VMs often allow you to run an entire OS, many types of workloads require little to no modification to run inside of a VM compared to more restrictive sandboxing techniques. Specialized VMs like Firecracker, a security-focused VM technology built by Amazon, may require a custom runtime to support your workload.
Maintenance and operational overhead
Overall, VMs usually don’t make debugging more painful or add much additional development cost for engineers actively modifying workload code. There are many well established ways to provide engineers with access to common debugging techniques, and the tooling is generally quite usable.
However, it can be complicated to manage, operate, and orchestrate a VM cluster or service for sandboxing. Doing this well may require reliability engineering knowledge and experience. For instance, managing VM orchestration directly often requires deep visibility into core internals of the underlying systems, particularly involved tracking on each VM’s state, and tuning for unexpected sources of performance degradation. Depending on the performance constraints, this orchestration system may also be responsible for complex routing decisions around which user requests to associate with which VMs. While VMs can work mostly out-of-the-box as a drop-in sandboxing solution in many situations, they often require additional development and maintenance costs.
How we use VMs at Figma
At Figma, for situations where virtualization-based sandboxing is appropriate, we rely on AWS Lambda, which uses Firecracker. While we lean on AWS Lambda to handle most of the details of virtualization management, we still have to make interesting, ongoing trade-offs. For example, to ensure that Lambda is a viable option for those who want to use them in a synchronous core serving path, AWS has chosen to reuse individual VMs from the same tenant for multiple requests: Firecracker offers reasonably quick VM boot times, but the overheads are still too high to pay on many core workflows. However, we have minimal control over routing at this level. For many workloads, this is a reasonable security trade-off.
At Figma, we use Lambdas to perform stateless handling for functionality such as fetching external images for use in the Figma canvas, or for fetching link metadata and converting or resizing associated images for link previews in FigJam. For these operations, low latency is important—when users add content to the Figma canvas, they expect to see it quickly. Further, much of the code involved in performing these operations comes in the form of large open-source libraries. It would both be expensive at the outset and require substantial ongoing maintenance work to maintain a fork to fit them into a less generic sandboxing primitive.
For the link metadata fetcher, we frequently need to resize images or convert them into a format that the Figma frontend understands and can render. To do this, we fetch data from the third-party link and run an image processing library in an AWS Lambda—which also means a Firecracker VM. As this execution environment runs with no special privileges in our AWS account outside our production VPC, exploitation of vulnerabilities in our link fetching logic or in ImageMagick won’t grant any ability to speak to Figma’s internal services or leverage the new position to touch anything outside of the Lambda execution environment.
Unlike our experience with nsjail, which we cover in detail here, we have to do relatively little sandbox tuning for these virtualization use cases. Since there is no allowlist of system calls (syscalls) to manage, there’s no concern about rarely evaluated codepaths. With that said, there are some gotchas that we always need to avoid. First, the Lambda environment includes a localhost HTTP listener that will give any caller the contents of the request that triggered the Lambda, or allow any caller within the Lambda environment to forge a response. While we can’t prevent code executing within the Lambda from calling this API, we don’t want an server-side request forgery (SSRF) vulnerability to give this power, so we needed to ensure our application code wouldn’t make requests to localhost. Second, Lambdas run within a cloud environment, not as “raw” compute: It’s easy to configure Lambdas so that they do have special privileges, whether by being within your internal network or by granting them identity and access management (IAM) permissions to touch other cloud resources.
With our sandboxing approach, we also had to iterate on performance and resource limit tuning. When we first implemented this Lambda, calls to it could take up to 10 seconds—a significantly higher latency than we wanted for this feature. As the Lambda runtime warmed up, our average latencies got lower, but we also had to invest direct engineering efforts into ensuring that we were minimizing startup and processing costs as much as possible and to minimize contention on the Lambda concurrency limit, which is a shared quota among all Lambdas in an AWS account and region.
There’s no one right approach to sandboxing, but understanding the trade-offs and engineering considerations can help you choose the best solution for your team. For more on different sandboxing primitives, read our overview of server-side sandboxing and our deep dive into containers and seccomp.
If you enjoy this type of technical work, consider working with us on the Figma security team!
Artwork by Pete Gamlen.
Related articles

