# Design This document describes how forkd implements fork-on-write for Firecracker microVMs and the constraints that drove the design. ## Overview ```mermaid flowchart LR subgraph PARENT_LIFE["1. Parent lifecycle (once per parent image)"] direction TB boot["Vm::boot(BootConfig)
Firecracker InstanceStart"] warm["userspace warm-up
Python imports, JIT, model load"] pause["Vm::pause
PATCH /vm Paused"] snapshot["Vm::snapshot_to
writes memory.bin + vmstate"] boot --> warm --> pause --> snapshot end subgraph FORK["2. Fork-out (per cohort of N children)"] direction TB spawn["Snapshot::restore_many_with
spawn N Firecracker procs in parallel"] load["PUT /snapshot/load on each
mmap(memory.bin, MAP_PRIVATE)"] place["place each PID into
/sys/fs/cgroup/forkd/child-i
memory.max = quota"] ns["each child runs inside
netns forkd-child-i"] spawn --> load --> place --> ns end subgraph RUN["3. Runtime per child"] direction TB cow["kernel CoWs diverged pages
shared pages stay shared"] agent["forkd-agent.py listens on :8888
(ping / exec / eval over TCP)"] cow --- agent end PARENT_LIFE --> FORK --> RUN classDef phase fill:#ffffff,stroke:#52606d,color:#1f2933; class PARENT_LIFE,FORK,RUN phase; ``` The kernel does the hard part (CoW page management). forkd's job is correctness, isolation, and orchestration. ## Runtime: Firecracker forkd builds on Firecracker rather than a container runtime or gVisor because: - **Snapshot/restore is first-class.** Firecracker's `/snapshot/load` with `MEMORY_LOAD_PRIVATE` is the exact primitive we need. - **KVM-backed.** Each child gets a hardware isolation boundary, not a syscall filter or namespace. - **Small.** ~5 MiB of resident memory per VM process before any guest state. - **Stable API.** Rust ecosystem, well-trodden by AWS Lambda. forkd uses upstream Firecracker — no vendored fork. ## Component layout ``` crates/ forkd-vmm Firecracker wrapper. BootConfig, Vm, Snapshot, ForkOpts, cgroup helpers, network namespace plumbing, raw HTTP/1.1 over Unix socket. forkd-cli `forkd` binary. CLI surface: snapshot, fork, run, exec, eval, ping. forkd-controller `forkd-controller serve`. REST API, persistent registry, audit log, /metrics, bearer-token auth, graceful shutdown. rootfs-init/ forkd-init.sh PID 1 inside the guest. Mounts pseudo-fs, fixes DNS, launches the agent. forkd-agent.py TCP server on :8888 (ping / exec / eval). sdk/python/ E2B-compatible Python SDK. ``` ## Hard problems and how forkd addresses them ### 1. Memory image backing Putting `memory.bin` on tmpfs invites OOM kill. Slow disk kills restore latency. forkd writes the image to ext4 by default and relies on the page cache; on hosts with hugepages provisioned (per `scripts/setup-host.sh`) the kernel transparently backs hot pages with 2 MiB pages. Future: explicit `memfd_create(MFD_HUGETLB)`-backed memory for high-N fork-out, where the savings on page-table size dominate. ### 2. RNG and TSC Children boot with the parent's RNG state and the parent's `tsc_offset`. Both are cryptographically broken if exposed externally. - **RNG**: Linux 5.20+ exposes `vmgenid`, a virtio-device-driven "generation counter" that the guest kernel watches; Firecracker bumps the counter on restore and the guest's CRNG re-seeds from /dev/hw_random automatically. forkd relies on this — no userspace daemon required. - **TSC**: Firecracker assigns a fresh `tsc_offset` on each restore via its `--rdtsc` handling. This is enabled by default. ### 3. MAC / IP collisions All children inherit the parent snapshot's MAC and guest IP. Without network isolation they would collide on the host bridge. forkd places each child in its own pre-provisioned network namespace (`forkd-child-1` … `forkd-child-N`). Each namespace has: - An independent tap (same name, same IP — different network stack). - A `veth` pair into a shared `forkd-br0` bridge for outbound NAT. - SNAT on egress so the bridge can reverse-route replies. See `scripts/netns-setup.sh` for the exact iptables rules. ### 4. Block device CoW Children need a writable rootfs but should share the base image. Today forkd uses Firecracker's built-in read-write attachment with overlayfs on the host. Each child gets a fresh upper-dir, lower dir is the shared rootfs. Writes are per-child; nothing persists post-exit. Future: `dm-thin` for production density beyond a few hundred concurrent children. ### 5. KSM aggressiveness Default KSM (`kernel same-page merging`) is too lazy — minutes to reach steady-state sharing. `scripts/setup-host.sh` tunes `pages_to_scan` and `sleep_millisecs` for forkd's workload. With CoW mmap, KSM is a backstop for divergent-but-similar pages; it doesn't need to do the heavy lifting. ### 6. OOM cascades If the host hits memory pressure and the OOM killer takes the parent process, every child loses its backing pages. forkd nudges each child's `oom_score_adj` up by +500 so the kernel picks a runaway child first. With per-child `memory.max` set via cgroup v2, runaway children are bounded before they push the host into global pressure. ### 7. Per-child resource quotas forkd creates one cgroup v2 leaf per child under `/sys/fs/cgroup/forkd/child-N/` and writes the Firecracker PID to `cgroup.procs`. Today only `memory.max` is wired into `ForkOpts`; cpu.max / io.max / pids.max land before 1.0. ### 8. Scheduling affinity Children must land on the same host as their parent (otherwise CoW becomes copy-everything across the wire). v0.1 is single-host only; multi-host scheduling is a v1.x problem and will require either a warm parent on each scheduling target or a fast snapshot-replication path. ## Authentication and audit The controller daemon optionally requires a bearer token (`--token-file`) on every request except `/healthz`. The check uses length-aware constant-time comparison to avoid trivial timing oracles. Every request is appended to a JSON-Lines audit log (`/var/log/forkd/audit.log` by default): RFC3339 timestamp, method, path, status, latency in microseconds, user-agent. Log rotation is out of process (`logrotate`, `vector`, the journal). ## Related work The sandbox-runtime space has been growing fast. forkd's contribution is the fork-from-warm primitive on a full Linux microVM, with an open-source operator surface (REST + auth + TLS + audit + metrics). This section sketches how that compares to the projects most worth benchmarking against. ### Tencent CubeSandbox [CubeSandbox](https://github.com/TencentCloud/CubeSandbox) is the closest open-source project to forkd in primitive choice: RustVMM- based microVMs, KVM isolation, Apache 2.0. The published P95 cold- start is "**<60 ms**" with per-instance memory overhead below 5 MiB, which beats forkd's pure cold-boot path (forkd's snapshot fork wins on the fan-out workload because it skips guest userspace warm-up, not because the VM boots faster). CubeSandbox's roadmap mentions "event-level snapshot rollback" with "high-frequency snapshot rollback at millisecond granularity, enabling rapid fork-based exploration environments from any saved state" — when that lands, the two projects will overlap meaningfully. Until then forkd's distinct value is that fork-from-warm exists today. ### Daytona [Daytona](https://github.com/daytonaio/daytona) is OCI-workspace oriented (Docker-compatible images, per-workspace kernel claim). They advertise "**<90 ms** spinning up... from code to execution" and a stateful-snapshot model for resume. There is no fork-from-warm primitive — each workspace is its own resource. License is **AGPL-3.0**, which is a meaningful difference for commercial users embedding the runtime in proprietary services. Daytona's polish at the workspace + agent-protocol layer is well ahead of forkd; the projects target different shapes of workload. ### Alibaba OpenSandbox [OpenSandbox](https://github.com/alibaba/OpenSandbox) is best thought of as an **abstraction layer** over Docker / Kubernetes / gVisor / Kata / Firecracker. It exposes a unified ingress gateway, per-sandbox egress policy, and multi-language SDKs (Python, Java, JS, .NET, Go). Apache 2.0, actively maintained. OpenSandbox does not itself implement fork-from-warm; if you want that on top of OpenSandbox, you'd plug a runtime that supports it. Conceptually forkd could be slotted in as such a runtime in a future integration. ### BoxLite [BoxLite](https://github.com/boxlite-ai/boxlite) bills itself as "the SQLite of sandbox" — an embeddable microVM runtime with no daemon, designed to scale from a laptop to the cloud without changing primitives. Each Box runs an OCI container inside a hardware-isolated VM (KVM on Linux, Hypervisor.framework on macOS), with seccomp on the host side and `allow_net` for egress policy per Box. Apache 2.0, primarily Rust + TypeScript + Go. The interesting contrast with forkd is the **stateful Box model**. BoxLite Boxes persist packages and files across stop/restart cycles: you `apt install` once and that survives the next boot. forkd's parent snapshot does something superficially similar — the parent's RAM survives — but the semantics differ: - BoxLite optimises for **one long-lived sandbox per workload**: start it, fill it with state, suspend, resume later, never rebuild. - forkd optimises for **N short-lived children per parent**: the parent is the template, every child gets a fresh CoW divergence, children die freely. Both designs avoid re-paying setup cost. They're complementary, not competing — BoxLite is the right call when each agent owns a persistent workspace; forkd is the right call when you fan out from a warmed template. BoxLite's cross-platform support (macOS via Hypervisor.framework) is a real differentiator forkd doesn't try to match — forkd is Linux+KVM only and unlikely to add macOS support because the snapshot-fork primitive depends on the host kernel's CoW semantics on `mmap(MAP_PRIVATE)`. ### E2B [E2B](https://github.com/e2b-dev/E2B) ships an open-source self-host path (Apache 2.0) and a managed service. The OSS infra repo uses Firecracker under the hood; specific spawn-time numbers are mostly quoted from the managed product. There is no fork-from-warm primitive in the open repo. forkd's Python SDK is **E2B wire-compatible at the `Sandbox` class level** so existing E2B agents can switch by import alone, with `sandbox.eval(...)` as the forkd-only extra that uses the warmed-PID-1 interpreter. ### Modal Modal is the only production system known to expose a fork-from-warm primitive ("Modal Sandbox", proprietary). They are not open source; forkd is the open-source analogue of that primitive specifically. Pricing, scheduling, and the full developer-platform layer remain their differentiator. ### Firecracker, Docker, gVisor These are runtimes, not full sandbox products. forkd builds on Firecracker directly. Docker (runc) and gVisor (runsc) are in our benchmark chart as honest reference points: they cold-boot every sandbox and pay the `import numpy` cost N times for an N-sandbox fan-out. ## What forkd does not do - Replace Modal or E2B as a SaaS. - Beat function-level snapshot runtimes (single-vCPU, serial I/O) on raw spawn time; forkd targets full Linux microVMs with networking and multi-vCPU. - Support arbitrary guest OSes. v0.1 targets Linux x86-64. ## Stability API versioning: the REST surface is at `/v1`. Breaking changes move to `/v2` and the previous major is supported for one minor release after the new one ships. On-disk snapshot format is currently tied to Firecracker's `Full` snapshot version; we do not promise forward compatibility across forkd versions until 1.0.