# Pause-window: v0.3 phase 1 results (diff snapshots)

**Status:** Phases 1a (primitive + sidecar measurement), 1b (real
`"diff": true` BRANCH path), 1c (agent-workload threshold), and 1d
(multi-BRANCH via previous-output chain) all landed. Phase 1d ships
in v0.3.1; v0.3.0 had the diff path restricted to first-BRANCH-only.

## Headlines

- **Idle source, 4 GiB SSD: pause 29 s → 205 ms = 143 ×.** Best
  case, included for comparability with prior art (CodeSandbox 2 s
  clone demo etc.). Phase 1b sweep.
- **Typical agent workload (2 GiB source, 30-300 MiB dirty):
  6-15 × pause reduction.** What you'll actually see in production
  fan-out. Phase 1c sweep.
- **Crossover at ~50-65 % source RAM dirty.** Above that, Full and
  Diff converge; pick Full. Phase 1c sweep.

## Phase 1b: 5-size pause sweep (idle source)

The phase 1b real-mode A/B (5 memory sizes × 3 trials × 2 modes ×
2 backends = 60 trials):

| Source memory | SSD Full | **SSD Diff** | SSD speedup | tmpfs Full | tmpfs Diff | tmpfs speedup |
|---:|---:|---:|---:|---:|---:|---:|
| 256 MiB | 1807 ms | 241 ms | 7.5 × | 172 ms | 200 ms | 0.86 × |
| 512 MiB | 3414 ms | 226 ms | 15.1 × | 178 ms | 149 ms | 1.2 × |
| 1024 MiB | 6902 ms | 229 ms | 30.1 × | 324 ms | 194 ms | 1.7 × |
| 2048 MiB | 14508 ms | 222 ms | 65.4 × | 630 ms | 199 ms | 3.2 × |
| 4096 MiB | **29322 ms** | **205 ms** | **143 ×** | 1190 ms | 190 ms | 6.3 × |

Source pause-window is now essentially **constant at ~200 ms regardless
of source memory size**, because Diff's only cost is the
control-plane round-trip plus the small write of the dirty pages
(~900 KB for an idle source). Full pause scales linearly with memory
× storage bandwidth.

Caveats up front (details below):
- These are **idle-source** numbers (3 s settle). Real workloads with
  larger dirty footprints see proportionally smaller wins.
- Diff mode is **restricted to first BRANCH per sandbox** in v0.3.0
  (Firecracker's dirty bitmap is cleared on every snapshot). Multi-
  BRANCH support needs a per-sandbox shadow file, deferred.
- 256 MiB on tmpfs is a wash — diff's control-plane floor exceeds
  a fast-storage memcpy. Use Full for small-memory + fast-storage.
- Total BRANCH API latency is unchanged on SSD (the memory.bin copy
  still runs ~30 s in the background). Only **source downtime**
  shrinks. Right trade-off for live BRANCH from a running agent;
  wash for create-then-BRANCH-once.

## Phase 1a: the primitive in isolation

forkd v0.2 BRANCHes a running source by pausing it, writing the full
`memory.bin` to disk, and resuming. The pause is bandwidth-bound on
the snapshot-write step: 4.26 s ± 0.41 s on SATA SSD for a 513 MiB
source, scaling linearly with source RAM
([`RESULTS-v0.2.md`](./RESULTS-v0.2.md)).

v0.3 phase 1 swaps that for Firecracker's **Diff snapshot** mode,
which writes only the pages dirtied since the previous snapshot (or
since restore). Phase 1a took a Diff alongside the existing Full to
measure its cost in isolation — the numbers below predicted what
phase 1b's real diff-mode BRANCH would deliver. The phase 1b table
above is the actual user-visible cost; the phase 1a table here is the
underlying primitive cost.

Phase 1a numbers, idle source, 3 trials per cell:

| Source memory | SSD Full mean | SSD Diff mean | **SSD speedup** | tmpfs Full mean | tmpfs Diff mean | **tmpfs speedup** |
|---:|---:|---:|---:|---:|---:|---:|
| 256 MiB | 2198 ms | 267 ms | **8.2 ×** | 317 ms | 225 ms | 1.4 × |
| 512 MiB | 4053 ms | 233 ms | **17.4 ×** | 362 ms | 209 ms | 1.7 × |
| 1024 MiB | 7654 ms | 267 ms | **28.7 ×** | 539 ms | 236 ms | 2.3 × |
| 2048 MiB | 14993 ms | 242 ms | **62.0 ×** | 1097 ms | 223 ms | 4.9 × |
| 4096 MiB | 30414 ms | 239 ms | **127.3 ×** | 1394 ms | 268 ms | 5.2 × |

Raw data: [`diff-sweep-ssd.csv`](./diff-sweep-ssd.csv) and
[`diff-sweep-tmpfs.csv`](./diff-sweep-tmpfs.csv). 3 trials per cell;
SETTLE_SECS=3 between source spawn and BRANCH.

## What you're seeing

**Diff time is roughly constant** because the source is idle. The
dirty footprint reported in `diff_physical_bytes` is ~900 KiB across
all sizes — that's Linux kernel runtime overhead (init, timekeeping,
internal allocator activity) accumulating over 3 s. **The
diff-to-logical compression ratio drops from 0.34 % at 256 MiB to
0.02 % at 4 GiB**: the bigger the source, the smaller the fraction
of its memory the dirty bitmap covers.

**Full time scales linearly with memory** because writing the full
memory.bin is bandwidth-bound. The SSD column tracks 148 MB/s fsync
throughput (matches the `dd conv=fsync` floor measured in
`RESULTS-v0.2.md`). The tmpfs column tracks ~3 GB/s memcpy bandwidth.

**Diff floor is ~200-270 ms** even at 256 MiB — that's the
control-plane cost (PUT /snapshot/create round-trip, vCPU state
harvest, sparse file write of the tiny dirty pages). This floor
doesn't shrink with source memory.

## The caveat that matters

These numbers are the **best case**. Idle-source diffs are tiny, so
Diff timing approaches the control-plane floor. **Real fan-out
workloads — agents that have been running for 30 s and dirtied
maybe 100 MB of working set — will see proportionally smaller
speedups**, because the diff write itself becomes the bottleneck
again.

Back-of-envelope for 100 MB dirty footprint on SSD:
- Diff cost ≈ control-plane (~200 ms) + write 100 MB / 148 MB/s
  ≈ 200 + 676 = ~880 ms.
- Full cost (4 GiB source) ≈ 30 s.
- Speedup: ~34 ×.

Still a huge win for fan-out, but not the **127 ×** the idle bench
shows. Phase 1b's measurement will inject a real workload (an agent
allocating and touching a buffer between BRANCHes) and re-measure.

## When does Diff *not* help?

- **First BRANCH on a long-running source.** Firecracker's dirty
  bitmap starts populated at restore time — every page touched since
  the source booted from snapshot counts as dirty until the first
  snapshot clears it. A source that's been running for an hour can
  have a near-full dirty set on its first Diff, degrading to Full
  performance. Subsequent Diffs are fast (the bitmap was cleared).
- **Sources with high memory churn** (large workloads, ML inference
  with KV-cache turnover, browsers under heavy use). Dirty footprint
  per BRANCH approaches full memory, so Diff loses its advantage.
- **One-shot BRANCH** (create source, BRANCH once, discard). The
  Full path is one operation; Diff requires keeping a base around
  for the merge. Phase 1b's shadow-file machinery is amortized
  across multiple BRANCHes, not a one-shot win.

## Phase 1b: real diff-mode BRANCH (`"diff": true`)

The phase 1a numbers above used the `measure_diff` sidecar — they
measure how long a Diff snapshot WOULD take, while the user still
paid the Full pause. Phase 1b ships the actual diff-mode BRANCH:
`POST /v1/sandboxes/:id/branch` with `"diff": true` parallelizes the
source-tag memory.bin copy with the source running, takes a Diff
snapshot during pause, resumes the source, and merges the diff onto
the (already-copied) snapshot output. **The pause-window is the Diff
window — nothing else.**

15 trials per backend (5 sizes × 3 trials) per mode (Full vs Diff)
on fresh sources. Phase 1b restricts diff BRANCH to the first BRANCH
per sandbox (Firecracker clears the dirty bitmap on every
snapshot/create, so a second Diff would miss pages dirtied before
BRANCH 1 — see "First-BRANCH-only restriction" in the design doc).

### User-visible pause_ms — Full vs Diff (n=3 per cell)

| Source memory | SSD Full | SSD Diff | **SSD speedup** | tmpfs Full | tmpfs Diff | **tmpfs speedup** |
|---:|---:|---:|---:|---:|---:|---:|
| 256 MiB | 1807 ms | 241 ms | **7.5 ×** | 172 ms | 200 ms | 0.86 × |
| 512 MiB | 3414 ms | 226 ms | **15.1 ×** | 178 ms | 149 ms | 1.2 × |
| 1024 MiB | 6902 ms | 229 ms | **30.1 ×** | 324 ms | 194 ms | 1.7 × |
| 2048 MiB | 14508 ms | 222 ms | **65.4 ×** | 630 ms | 199 ms | 3.2 × |
| 4096 MiB | 29322 ms | 205 ms | **143 ×** | 1190 ms | 190 ms | **6.3 ×** |

Raw data: [`diff-real-sweep-ssd.csv`](./diff-real-sweep-ssd.csv) and
[`diff-real-sweep-tmpfs.csv`](./diff-real-sweep-tmpfs.csv). Sweep
script: [`sweep-diff-real.sh`](./sweep-diff-real.sh).

### What changed vs phase 1a

The phase 1a numbers were the THEORETICAL diff cost (the Diff sidecar
inside the still-Full pause window). Phase 1b's numbers are the
ACTUAL pause cost the user experiences with `"diff": true`. They
match phase 1a's projections within measurement noise:

- 4 GiB SSD phase 1a: 239 ms diff. Phase 1b: 205 ms pause. Match.
- 4 GiB tmpfs phase 1a: 268 ms diff. Phase 1b: 190 ms pause. Match.

The match confirms the architecture works: source pauses for the
diff window, then resumes; the cp + apply_diff happens off the
critical path.

### What 256 MiB tmpfs is telling us

The tmpfs 256 MiB cell shows diff (200 ms) being SLOWER than full
(172 ms). At small memory + fast storage, Firecracker's control-plane
floor for taking a Diff snapshot (~190 ms — call setup, sparse-file
allocation, vCPU state harvest) exceeds the cost of just memcpy'ing
256 MiB to tmpfs. **Diff is the wrong tool when source memory is
small AND the storage backend is fast.** Recommendation: leave the
default at Full; opt into Diff via the request body when source is
≥512 MiB and snapshot_root is on real disk.

### Where the time actually goes in diff mode

For 4 GiB SSD diff mode, the user sees `pause_ms = 205`. The
breakdown:

- Source pause window: 205 ms (this is `pause_ms`).
- Background memory.bin copy: ~30 s (runs in parallel with source).
- Post-resume apply_diff merge: ~10 ms (962 KB of diff data onto the
  pre-copied 4 GiB base).
- Total BRANCH wall-clock (sandbox-create returns to caller): ~30 s,
  bottlenecked by the copy.

**Source downtime drops 143 ×; total BRANCH API latency is unchanged.**
That's the right trade-off for forkd's killer use case (live BRANCH
from a long-running agent where TCP connections and timers matter)
and a wash for create-then-BRANCH-once-and-discard (where total time
is what matters).

## Phase 1c: agent-workload threshold — where does Diff stop winning?

The phase 1a/1b numbers above are **idle-source best case** (3 s
settle, ~12-15 MiB dirty footprint coming from kernel init + runtime
overhead). A real fan-out workflow has the source running for some
time before BRANCH, dirtying more memory. At some dirty-page
threshold Diff's write cost catches up with Full's write cost and the
speedup collapses. **Phase 1c finds that threshold.**

Experiment: a guest-internal workload (`dirtier.py`) allocates
`--dirty-mib N` MiB as a `bytearray` and writes one non-zero byte
per 4 KiB page — exactly setting N MiB of KVM dirty bits. The
orchestrator (`sweep-agent.sh`) execs it, polls for a marker on
stdout, then BRANCHes. 3 trials per cell on a `mem-2048` source,
SATA SSD snapshot_root. Raw data:
[`agent-sweep-ssd.csv`](./agent-sweep-ssd.csv).

### Pause vs dirty footprint (mem-2048 SSD, mean ms, n=3)

| Dirty (MiB) | Full pause | Diff pause | **Speedup** | Measured diff size |
|---:|---:|---:|---:|---:|
| 0 (idle) | 13746 | 594 | **23.1 ×** | 12.2 MiB |
| 10 | 15207 | 673 | 22.6 × | 22.2 MiB |
| 50 | 13734 | 921 | 14.9 × | 62.8 MiB |
| 100 | 13803 | 1253 | 11.0 × | 113.7 MiB |
| 250 | 14527 | 2398 | **6.1 ×** | 266.6 MiB |
| 500 | 14090 | 5728 | 2.5 × | 521.0 MiB |
| 1000 | 14403 | 10708 | **1.3 ×** | 1029.5 MiB |

### Reading the curve

- **Full pause is flat** at ~14 s. 2048 MiB / 148 MB/s SATA fsync
  bandwidth = 13.8 s, matches the measurement. Full always writes
  every page regardless of dirty state.
- **Diff pause scales linearly with dirty footprint.** Slope is
  ~10 ms per dirtied MiB, exactly the SSD write bandwidth plus a
  ~500 ms control-plane floor (call round-trip + vCPU state harvest).
  Linear regression: `diff_ms ≈ 500 + 10.2 × dirty_mib`.
- **Crossover at ~1 GiB dirty** on this 2 GiB source — Diff catches
  Full when dirty footprint ≈ 65 % of source memory. Above that,
  Full is faster (no extra control-plane round-trip).
- **Diff_physical_bytes ≈ dirty_mib + 12 MiB** of fixed overhead
  (Python interpreter, dirtier process, kernel runtime activity
  during the dirty loop). Predictable enough to budget for.

### Practical guidance

| Workload | Dirty MiB | Recommend |
|---|---:|---|
| Just-spawned source, BRANCH immediately | <30 | **Diff** (15-23 ×) |
| Short agent run (few ReAct steps, 5-30 s) | 30-100 | **Diff** (11-15 ×) |
| Medium agent run (multi-minute, modest state) | 100-300 | **Diff** (6-11 ×) |
| Heavy agent run (many minutes, large buffers) | 300-700 | Diff (2-6 ×; still wins) |
| Memory-saturating workload | >700 (>35 % of source) | **Full** is comparable or faster |

The thresholds shift by source size: a 4 GiB source crosses over at
~2.6 GiB dirty; a 512 MiB source at ~330 MiB. Rule of thumb: **opt
into Diff whenever you expect dirty footprint to be <50 % of source
RAM at BRANCH time.** That covers essentially all realistic
fan-out scenarios where the source has been alive for seconds-to-
minutes, not hours.

### What this means for the 143× headline

The phase 1b 4 GiB SSD 143× number was measured on a 3-second-idle
source (~900 KiB dirty). Phase 1c's curve says that's the **asymptote**,
not the typical experience. For the modal "spawn → run agent for
30 s → BRANCH" workflow, the realistic speedup is **10-25 ×** — still
a category change, but not 143 ×.

The honest framing: phase 1's win is "**source pause drops by 10-25 ×
for typical agent workloads, up to 143 × for idle sources, declining
to 1× as the source dirties >50 % of its RAM**." Diff is the right
default for fan-out; Full remains the right tool when you know the
source has churned through most of its memory.

### v0.3.0's first-BRANCH-only restriction — lifted in v0.3.1 (phase 1d)

Phase 1b (v0.3.0) restricted diff mode to a sandbox's first BRANCH.
Firecracker clears the dirty bitmap on every snapshot/create, so:

- BRANCH 1 (Full or Diff): dirty bitmap cleared.
- BRANCH 2 (Diff): dirty bitmap captures only pages dirtied between
  BRANCH 1 and BRANCH 2 — applying that to source_tag/memory.bin
  (boot state) loses everything dirtied between restore and
  BRANCH 1.

**Phase 1d (v0.3.1) lifts this** without a separate per-sandbox
shadow file. The insight: each BRANCH's output (`snap_dir/memory.bin`)
is, by construction, source's state at that BRANCH's pause time —
exactly the base the next diff needs. The daemon tracks
`SandboxInfo.last_branch_memory_path` and uses it as the cp source
on the next diff BRANCH (falling back to source_tag/memory.bin with
a logged warning if the user has deleted the intermediate snapshot).

See [`docs/design/diff-snapshots.md`](../../docs/design/diff-snapshots.md)
§ "Multi-BRANCH diff: the previous-output chain (phase 1d)".

## Phase 1d: multi-BRANCH diff — N consecutive BRANCHes on the same sandbox

The phase 1d ship lifts the v0.3.0 single-BRANCH restriction.
Verification: 3 trials × 5 consecutive `diff: true` BRANCHes per
sandbox, mem-2048 SSD, 3 s gap between BRANCHes. Raw data:
[`multi-branch-sweep.csv`](./multi-branch-sweep.csv).

### pause_ms and diff size per BRANCH (mean of 3 trials)

| BRANCH | pause_ms | diff_physical_bytes |
|---:|---:|---:|
| 1 | 288 | 1.16 MB |
| 2 | 263 | 0.53 MB |
| 3 | 1321 | 0.39 MB |
| 4 | 1389 | 0.55 MB |
| 5 | 1446 | 0.41 MB |

### What's confirmed

- **All 5 BRANCHes succeed.** In v0.3.0 the second BRANCH with
  `diff: true` would have 400'd. The previous-output chain handles
  correctness.
- **Diff sizes are small** (0.4–1.2 MB per BRANCH) — Firecracker's
  per-snapshot bitmap clear is correctly captured by the chain;
  each BRANCH's diff covers only "since last BRANCH," not since
  restore.
- **Aggregate downtime: 14×** vs Full. 5 × 14 s = 70 s of source
  pause if these had been Full BRANCHes; multi-BRANCH diff totals
  ~4.7 s of pause across the same 5 BRANCHes.

### What was anomalous (RESOLVED in v0.3.4)

BRANCH 1-2 pause was ~280 ms; BRANCH 3-5 jumped to ~1.3-1.5 s on the
same source. After 5 rounds of probing
([`PROBE-multi-branch-anomaly.md`](./PROBE-multi-branch-anomaly.md))
the root cause turned out to be **ext4** — delayed allocation +
writeback throttle (`wbt_wait`) + multi-block allocator + block-bitmap
checksumming, all compounding per BRANCH as each 500 MiB+ memory.bin
write triggered increasing ext4 metadata work.

**Fixed in v0.3.4** by `posix_fallocate`-ing the destination memory.bin
to its full size before either the diff-mode background copy or
Firecracker's `/snapshot/create` writes to it (PR #152). Measured on
the same source / hardware / 10-BRANCH sweep:

| BRANCH | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| before | 350 | 250 | 1300 | 1400 | 1500 | 2700 | 1500 | 1800 | 2700 | 1500 |
| after  | 585 | 286 | 344 |  161 |  369 |  153 |  189 |  162 |  324 |  174 |

BRANCH 6 from 2700 → 153 ms = **17.6×**. Median BRANCH 3-10 from
~1700 → ~200 ms ≈ **8.5×**. The post-fix curve matches a tmpfs
control to within noise, confirming the fix neutralizes the ext4
metadata overhead.

![v0.3.4 before-after: multi-BRANCH pause flattened](./v0.3.4-before-after.png)

#146 closed.

The first-BRANCH-only restriction is gone in v0.3.1.

## Methodology notes

- 5 source memory sizes: 256 / 512 / 1024 / 2048 / 4096 MiB. Built
  via `forkd snapshot --mem-size-mib N --tag mem-N ...` from the
  `langgraph-react` rootfs (Python 3.12 + requests).
- Daemon spawned with `enable_diff_snapshots: true` baked into
  `forkd_vmm::ForkOpts` for daemon-path sources — required by
  Firecracker for the resulting VM to admit Diff `/snapshot/create`
  calls.
- 3 trials per (memory, backend) cell. SETTLE_SECS=3.
- SSD: `--snapshot-root ~/.local/share/forkd/snapshots` on an
  Ubuntu 24.04 host's root filesystem (148 MB/s fsync).
- tmpfs: `--snapshot-root /dev/shm/forkd-snapshots` after copying the
  5 source snapshots into `/dev/shm`.
- Phase 1a sweep script:
  [`sweep-diff.sh`](./sweep-diff.sh) — measure_diff sidecar on top
  of Full BRANCHes.
- Phase 1b sweep script:
  [`sweep-diff-real.sh`](./sweep-diff-real.sh) — `"diff": true` A/B
  against `"diff": false`. Each trial is a fresh source.

## See also

- [`RESULTS-v0.2.md`](./RESULTS-v0.2.md) — v0.2 baseline + prewarm fix.
- [`docs/design/diff-snapshots.md`](../../docs/design/diff-snapshots.md)
  — the phase 1 design.
- [`ROADMAP.md`](../../docs/ROADMAP.md) § "Cut pause-window without
  forking Firecracker" — the v0.3 plan this measurement is the first
  data point of.