---
name: dev-benchmark
description: Capsem benchmarking with capsem-bench. Use when running benchmarks, adding new benchmark categories, interpreting results, or investigating performance regressions. Covers all 7 benchmark categories (disk, rootfs, startup, http, throughput, snapshot, all), the JSON output format, and how to add new benchmarks.
---

# Benchmarking

## Quick start

```bash
just bench                          # Run all benchmarks in VM (~2 min)
just run "capsem-bench snapshot"    # Snapshot benchmarks only
just run "capsem-bench disk"        # Disk I/O only
just test                           # Full validation including benchmarks
```

## capsem-bench

Python tool that runs inside the VM. Rich tables to stderr (human), structured JSON saved to `/tmp/capsem-benchmark.json` (machine).

**Location:** `guest/artifacts/capsem_bench/` (Python package, invoked via `capsem-bench` shell wrapper)

### Benchmark categories

| Category | Command | What it measures |
|----------|---------|-----------------|
| disk | `capsem-bench disk` | Sequential/random I/O on scratch disk (write/read throughput, IOPS) |
| rootfs | `capsem-bench rootfs` | Read-only rootfs performance (sequential + random 4K reads) |
| startup | `capsem-bench startup` | Cold-start latency for python3, node, claude, gemini, codex |
| http | `capsem-bench http [URL] [N] [C]` | HTTP throughput through MITM proxy (requests/sec, latency percentiles) |
| throughput | `capsem-bench throughput` | 100MB download through MITM proxy (end-to-end MB/s) |
| snapshot | `capsem-bench snapshot` | Snapshot create/list/changes/revert/delete via MCP (ms per op at 10/100/500 files) |
| all | `capsem-bench` | All of the above |

### Snapshot benchmarks

Tests the full MCP snapshot pipeline end-to-end (guest CLI -> MCP server -> vsock -> host gateway -> filesystem). Measures at 3 workspace sizes (10, 100, 500 files):

- **create**: Populate workspace, create named snapshot via MCP
- **list**: List all snapshots with change diffs
- **changes**: List changed files since checkpoint
- **revert**: Revert a single file from snapshot
- **delete**: Delete the snapshot

Key metrics: per-operation latency in ms. Regressions in `create` usually mean the clone or hash stage got slower. Use `RUST_LOG=capsem=debug` to see per-stage breakdown (clone_ws_ms, clone_sys_ms, hash_ms).

### JSON output format

```json
{
  "version": "0.3.0",
  "timestamp": 1711561234.5,
  "hostname": "capsem",
  "disk": { "seq_write_mbps": 450, ... },
  "rootfs": { ... },
  "startup": { "python3": { "min_ms": 45, "mean_ms": 48, "max_ms": 52 }, ... },
  "http": { "rps": 120, "p50_ms": 42, ... },
  "throughput": { "throughput_mbps": 85, ... },
  "snapshot": {
    "10_files": { "create_ms": 120, "list_ms": 50, ... },
    "100_files": { "create_ms": 250, ... },
    "500_files": { "create_ms": 800, ... }
  }
}
```

### Environment variables

- `CAPSEM_BENCH_DIR`: Test directory for disk benchmarks (default: `/root`)
- `CAPSEM_BENCH_SIZE_MB`: Write test size in MB (default: 256)

## Investigating slowness

### Snapshot performance

1. Run snapshot benchmark: `just run "capsem-bench snapshot"`
2. Check per-stage timing: `RUST_LOG=capsem=debug just run "capsem-bench snapshot"` -- look for `snapshot_into_slot timing` log lines showing `clone_ws_ms`, `clone_sys_ms`, `hash_ms`
3. Check session data: `just inspect-session` -- MCP tool usage section shows avg duration per snapshot operation
4. Query detailed durations: `just query-session "SELECT tool_name, duration_ms FROM mcp_calls WHERE tool_name LIKE 'snapshot%' ORDER BY duration_ms DESC LIMIT 20"`

Common causes:
- **clone_ws_ms high**: Large workspace, or APFS clonefile falling back to byte copy
- **hash_ms high**: Many files in workspace (walkdir overhead), or slow filesystem
- **compact slow**: Merging many snapshots with overlapping files

### Disk I/O regression

1. Run: `just run "capsem-bench disk"`
2. Compare sequential write/read throughput against baseline
3. Check if VirtioFS mode changed (block mode has different I/O characteristics)

### Adding a new benchmark

1. Create a new module in `guest/artifacts/capsem_bench/` (e.g., `mytest.py`) with a `mytest_bench()` function that returns a dict and prints a Rich table
2. Add the mode name to `VALID_MODES` in `__main__.py`
3. Wire it into `main()` with the `if mode in ("name", "all"):` pattern (lazy import)
4. Update this skill and the benchmarking doc page

## Host-side lifecycle benchmark

Profiles individual VM lifecycle operations from the host. Runs outside the guest via pytest, not via `capsem-bench`.

```bash
uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py -xvs
```

**Location:** `tests/capsem-serial/test_lifecycle_benchmark.py`

### Operations measured

| Operation | What it times |
|-----------|--------------|
| provision | HTTP POST `/provision` to service (VM creation + process spawn) |
| exec_ready | First `echo ready` exec succeeds (VM boot + vsock handshake) |
| exec | Simple `echo ok` on a running VM |
| delete | HTTP DELETE `/delete/{name}` (VM teardown + cleanup) |

### Output

- Per-run breakdown printed to stdout
- Summary table with min/mean/max per operation
- JSON saved to `benchmarks/lifecycle/data_{version}.json` (committed to git for historical tracking)

### Regression gates

Every operation must complete in under 1.2 seconds. The test runs 3 cycles and asserts each individual operation stays under the gate.

## Host-side fork benchmark

Profiles fork (image creation) and boot-from-image. Same test file, separate test function.

```bash
uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py::test_fork_benchmark -xvs
```

### Operations measured

| Metric | What it measures | Gate |
|--------|-----------------|------|
| fork | `POST /fork/{id}` — APFS clonefile of rootfs overlay + workspace | < 500ms |
| image_size | Actual disk usage of forked image (blocks, not logical size) | < 12MB |
| boot_provision | `POST /provision` with `image` param — clone image into new session | < 1200ms |
| boot_ready | First exec succeeds on the image-booted VM | < 1200ms |
| pkg_survived | Packages installed via apt survive fork (rootfs overlay) | must pass |
| ws_survived | Files written to /root/ survive fork (VirtioFS workspace) | must pass |

### Output

- Per-run breakdown with timing + survival status
- Summary table with min/mean/max + gate thresholds
- JSON saved to `benchmarks/fork/data_{version}.json` (committed to git for historical tracking)

### When to run

- After changes to fork/image code (`capsem-core/src/image.rs`)
- After changes to VirtioFS session layout (`capsem-core/src/lib.rs`)
- After changes to disk usage reporting (`session/maintenance.rs`)
- After changes to boot-from-image path in `capsem-service` or `capsem-process`
- Before cutting a release

### When to run (lifecycle)

- After changes to boot path (`capsem-process`, `capsem-init`, `capsem-core/vm/boot.rs`)
- After changes to VM teardown / delete path
- After changes to the service daemon (`capsem-service`)
- Before cutting a release

## Tests

- In-VM benchmark test: `just run "capsem-bench all"`
- In-VM availability: `test_utilities.py::test_utility_available[capsem-bench]`
- Host-side lifecycle: `uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py::test_lifecycle_benchmark -xvs`
- Host-side fork: `uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py::test_fork_benchmark -xvs`
- Both host-side: `uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py -xvs`
- Full run: `just bench` or `just test`

## Benchmark data directory

Host-side benchmarks save versioned JSON to `benchmarks/` (committed to git):

```
benchmarks/
  fork/data_0.16.1.json          # Fork speed, image size, data survival
  lifecycle/data_0.16.1.json     # Provision, exec-ready, exec, delete
```

These data files feed the documentation benchmark page at `docs/src/content/docs/benchmarks/results.md`. Before a release, run both benchmarks and update the results page with the new numbers. See `/release-process` for the full checklist.