# v0.4: live-fork via userfaultfd write-protect

**Status:** IMPLEMENTED — the design described below is wired up end-to-end
on the user surface (Phases 6 + 7, May 2026). REST `mode: "live"`, CLI
`--live`, Python / TypeScript / MCP SDKs, and `forkd doctor` capability
checks all shipped via PRs
[#194](https://github.com/deeplethe/forkd/pull/194)–[#207](https://github.com/deeplethe/forkd/pull/207).
The vendored Firecracker dependency lives at
[deeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12](https://github.com/deeplethe/firecracker/tree/forkd-v0.4-mem-backend-shared-v1.12);
upstream proposal is open
([`FIRECRACKER-UPSTREAM-PROPOSAL.md`](./FIRECRACKER-UPSTREAM-PROPOSAL.md)).
Clean-parent bench (`bench/live-fork-pause-window.md`) still pending —
Phase 6 E2E saw pause_ms = 41-48 ms, but on a parent with pre-baked
guest Oopses contaminating the measurement.

The original DRAFT below is preserved verbatim as the architecture
record; the implementation tracks it closely.
**Tracking issue:** [#101](https://github.com/deeplethe/forkd/issues/101)

## Motivation

v0.3.4 BRANCH (diff snapshot) takes ~150–300 ms on ext4 + SSD, of
which essentially all is a *hard pause window* — the source VM cannot
execute guest code while `memory.bin` is being written. For an agent
that does interactive inference, 150 ms straddles the perceptible-delay
boundary. For an agent that BRANCHes often (speculative-execution
patterns, live-rollout evaluation), it compounds: every branch point
freezes the parent.

The pause is structural in v0.3.4. The daemon issues Firecracker's
`Snapshot.Create`, which:

1. Pauses the source VM (microseconds).
2. Writes `vmstate` JSON (KB-scale, microseconds).
3. Writes `memory.bin` (500 MiB+ for a typical Python+JIT parent,
   tens of milliseconds even on tmpfs, hundreds of milliseconds on
   ext4 — see `bench/pause-window/PROBE-multi-branch-anomaly.md` for
   the v0.3.4 fix story).
4. Resumes the source VM.

Step 3 dominates. As long as `memory.bin` is written synchronously
inside the pause, we can only optimize within the disk-write cost.
v0.3.4 squeezed out the ext4 metadata penalty via `posix_fallocate`;
that's about as far as the synchronous path can go.

## Goal

Reduce the BRANCH pause window from ~150 ms to **< 10 ms** by removing
the synchronous memory write entirely. The vCPU + device state dump
still requires a pause (KVM_GET_REGS, KVM_GET_SREGS, virtio
descriptor snapshotting, kvmclock fixup), but that's a few KB of state
and tens of microseconds, not hundreds of milliseconds.

Stretch goal: pause < 1 ms.

## Non-goals

- Cross-host BRANCH (deferred to v0.5).
- Non-Linux backends (libkrun port is its own multi-month effort).
- Reducing child-spawn latency (already ~20 ms/child, not the
  bottleneck — children just `mmap(MAP_PRIVATE)` the snapshot).
- Lazy-restore on the child side (children already inherit memory via
  CoW, the cost is in BRANCH not in spawn).

## Proposed approach

Three building blocks:

### 1. `memfd_create` for source RAM

Replace the current file-backed guest memory mmap with anonymous memfd.
This is necessary because `UFFDIO_WRITEPROTECT` is supported on
anonymous and shmem-backed VMAs but not on arbitrary
host-filesystem-backed mmaps. memfd is technically tmpfs-backed and
qualifies. (Reference: kernel commit `1df319f0837c`, "userfaultfd: wp:
add WP support for shmem".)

Practically this is a swap of the backing in `forkd-vmm`'s memory
setup — the guest still sees a contiguous physical address space, the
host backing just changes from a file to a memfd.

### 2. `UFFDIO_WRITEPROTECT` on the source memfd before BRANCH

Register a `userfaultfd` against the source's memory region, then
issue `UFFDIO_WRITEPROTECT` over the full guest physical address space
in one syscall. The source VM continues running. Any subsequent guest
write to a still-WP'd page traps into the userspace handler before
the write commits.

The WP-arming cost is approximately O(VMA size / page-table walk
cost). On tested kernels (6.14, 5.7+) this is sub-millisecond for
multi-GiB regions when THPs are split appropriately.

### 3. Async dirty-page copier

A handler thread polls the uffd file descriptor. For each WP fault:

```
1. Read the page out of the source memfd at (faulting_addr - base).
2. Append the page (with its offset) to the in-flight snapshot file.
3. Clear the WP bit for that page (UFFDIO_WRITEPROTECT with mode=0).
4. Wake the faulting thread (UFFDIO_WAKE).
```

In parallel, a *bulk copier* reads still-clean pages from the source
memfd directly (no faulting involved, the memfd is just memory) and
writes them to the snapshot file. The two flows coordinate through a
per-page state map (clean / dirty-copying / final) so each page is
written exactly once.

The snapshot file is therefore complete some time *after* the BRANCH
pause exits, but it represents the consistent point-in-time view from
the moment WP was armed.

### What the pause window contains

After the changes above, the BRANCH critical section reduces to:

- vCPU dump: `KVM_GET_REGS` + `KVM_GET_SREGS` + a few model-specific
  registers, microseconds.
- Device state dump: virtio descriptor heads, MMIO state, microseconds.
- WP arming: `UFFDIO_WRITEPROTECT` over the whole RAM region, target
  sub-millisecond.
- kvmclock + TSC offset snapshot for guest time continuity, microseconds.

Total: well under 10 ms, and most of it independent of guest RAM size.

## Alternatives considered

### A) Status quo: pause-based snapshot

What we have today. Simple, robust, well-understood. Cost: ~150 ms
pause per BRANCH on ext4 + SSD. Becomes prohibitive when BRANCHing
>1/s, which is exactly the speculative-execution pattern this project
exists to enable.

### B) Pre-copy (à la live migration)

Iteratively dirty-track pages via `KVM_GET_DIRTY_LOG` and copy them in
rounds while the source keeps running, ending with a small "stop and
copy" final pass. This is the standard cross-host VM migration design
(Clark et al. NSDI 2005).

Downsides for our use case:

- `KVM_GET_DIRTY_LOG` requires `KVM_MEM_LOG_DIRTY_PAGES` to be set on
  memslots, which has its own per-`KVM_RUN` overhead.
- The "convergence" problem: if the guest's dirty rate exceeds copy
  bandwidth, pre-copy never finishes. Some agent workloads
  (`memset`-heavy initialization, large allocations during training)
  hit this regime.
- More implementation surface than uffd_wp.

### C) Full memcpy-out-then-snapshot

Pause briefly, `memcpy()` the entire guest RAM into a second buffer,
resume the guest, then async-write the buffer to disk. Pause cost:
memcpy time, roughly 5 ms/GiB on modern DDR. Memory cost: 2× peak
RAM usage.

The 2× RAM cost is a dealbreaker for the AI fan-out use case, where
parent VMs are routinely 4-8 GiB and the host already runs many of
them.

### D) Block-device CoW (LVM, dm-snapshot, btrfs reflink)

Snapshot the underlying block device, not the RAM. Doesn't apply:
guest RAM lives in memfd/file mappings, not on a block device. The
disk-backed virtio-blk *content* could be CoW'd this way, but that's
a separate problem from RAM snapshots.

uffd_wp is the right choice because it's the only mechanism that
gives us per-page lazy copy with no pause for clean pages and no
second memory buffer.

## Open questions

These are genuine unknowns. Reach out via issue if you have
experience here:

1. **Behavior of `UFFD_WP` on memfd-backed VMAs under `KVM_RUN`.** Are
   there any KVM paths that bypass userspace faulting and access
   guest memory directly (e.g., for MMIO emulation, virtio descriptor
   walking, kvmclock updates from the host side)? If so, do those
   paths get `UFFD_WP` write-faults, or do they silently violate the
   WP invariant? My current reading of `kvm_main.c` is that
   `gfn_to_hva_*` paths *do* go through the WP, but I haven't
   verified empirically.

2. **Interaction with transparent hugepages.** If the source memfd is
   backed by THPs, `UFFD_WP` works at the 4 KiB level — does the
   kernel split the hugepage on the first WP-fault, or does it WP the
   whole 2 MiB region? Splitting on each fault could be expensive
   for sparse-write workloads. May need to disable THP for source
   VMAs explicitly.

3. **vCPU dirty-bitmap vs uffd_wp.** KVM tracks its own dirty pages
   via `KVM_GET_DIRTY_LOG`. Is there value in combining both (e.g.,
   pre-write the KVM-dirty subset eagerly, then arm WP only on the
   clean remainder) or does uffd_wp on the whole region subsume it?
   The combined approach saves faults for the hottest pages but
   doubles the bookkeeping.

4. **Snapshot file format compatibility.** v0.3.4's snapshot is
   `vmstate JSON + memory.bin (contiguous raw 4 KiB pages)`. v0.4
   needs either (a) sparse memory.bin with page offsets, or
   (b) a chunked/segmented memory.bin format. Leaning (a) since
   stock Firecracker's restore expects contiguous; (b) breaks
   restore compatibility.

5. **Children spawned mid-BRANCH.** A child could in principle start
   `mmap`'ing the snapshot file before all dirty pages have been
   flushed, since the parent's pre-BRANCH state is consistent the
   moment WP is armed. Implementation requires the snapshot reader
   to block on in-flight pages with proper synchronization. Out of
   scope for v0.4 first cut, but a fast follow.

## Implementation phases

### Phase 1: standalone PoC (Week 1-2)

A separate Rust binary, not yet integrated with forkd. Allocates a
1 GiB memfd, populates with patterns, registers uffd, arms WP, forks
a writer process that randomly writes the memfd, captures faults,
copies dirty pages to a snapshot file, validates that the snapshot is
a consistent point-in-time view. Goal: prove the kernel mechanics work
as expected outside the KVM context.

### Phase 2: integrate into `forkd-uffd` crate (Week 3-4)

Extend the existing `crates/forkd-uffd/` (currently used for
restore-side lazy paging) with a snapshot-side WP path. Plumb the new
flow through `forkd-controller::branch_sandbox`. Add a `--live-fork`
feature flag (default off) so the v0.3.4 pause-based path remains
available during stabilization.

### Phase 3: pause-window benchmarking (Week 5)

Reproduce the v0.3.4 multi-BRANCH sweep
(`bench/pause-window/sweep-diff.sh`) but with `--live-fork`. Target:
pause < 10 ms across all 10 consecutive BRANCHes. Compare *distribution*,
not just mean — the v0.3.4 fix was a story about tail behavior.

### Phase 4: hardening (Week 6-7)

Edge cases to specifically test:

- Write-heavy guest (`stress-ng --vm 1 --vm-bytes 90%` running inside).
- NUMA cross-node guest RAM (force memfd allocations across nodes).
- Concurrent BRANCHes on different parents (shared uffd handler thread
  pool? Or one handler per BRANCH?).
- Kernel < 5.7 (no `UFFD_WP`) — graceful detection + fallback to
  v0.3.4 pause-based path.
- THP enabled/disabled.
- Memory pressure during BRANCH (host actively swapping).

### Phase 5: launch (Week 8)

- Switch `--live-fork` to default-on after a stabilization pass.
- Write up the implementation as a post-mortem-style article (same
  cadence as the v0.3.4 ext4 story).
- Ship v0.4.
- File any upstream kernel/Firecracker issues discovered along the way.

## Risks

- **Kernel < 5.7 doesn't have `UFFDIO_WRITEPROTECT`.** Mitigation:
  detect at startup, fall back to v0.3.4 path, document minimum
  supported kernel. Ubuntu 20.04 LTS has 5.4 — that's a real
  deployment hit. Possible workaround: backport detection so 5.4
  users transparently get v0.3.4 behavior.

- **Write-fault storms.** A guest scribbling all of RAM during BRANCH
  generates one fault per page. At 4 KiB pages × 1 GiB RAM that's
  262,144 faults. Each fault is microseconds of kernel + userspace
  work; bound is ~1 s to drain — *worse* than v0.3.4 pause for this
  pathological case. Mitigation: measure, document the regime, add
  a "give up, fall back to pause" escape hatch when fault rate
  exceeds threshold.

- **Snapshot consistency under uffd_wp ordering.** Need careful proof
  that the snapshot represents a consistent point-in-time even with
  async page copying. Plan: write a model + property test using
  `loom` or similar to fuzz the page-state machine.

- **Restore-time regression.** The new snapshot format (if it ends up
  different from v0.3.4) might restore slower. Need to bench both
  paths under the same workload before declaring v0.4 a win
  end-to-end.

## References

- Linux kernel docs: `Documentation/admin-guide/mm/userfaultfd.rst`
- `userfaultfd(2)`, `ioctl_userfaultfd(2)` man pages
- CRIU lazy-migration implementation:
  [github.com/checkpoint-restore/criu](https://github.com/checkpoint-restore/criu)
  (especially `criu/lib/uffd.c`)
- Firecracker UFFD restore support:
  [github.com/firecracker-microvm/firecracker](https://github.com/firecracker-microvm/firecracker)
  (`src/vmm/src/persist.rs`)
- "Live Migration of Virtual Machines" — Clark et al., NSDI 2005
  (the original pre-copy paper, for the alternative-design comparison)
- forkd v0.3.4 ext4 fix retrospective:
  [`bench/pause-window/PROBE-multi-branch-anomaly.md`](./bench/pause-window/PROBE-multi-branch-anomaly.md)
- Tracking issue:
  [#101](https://github.com/deeplethe/forkd/issues/101)