---
name: dev-testing
description: Capsem testing policy and workflow. Use whenever running tests, writing new tests, or verifying changes work. Covers the three test tiers (unit, smoke, full), TDD red-green-refactor, adversarial security testing, coverage policy, and the mandatory end-to-end VM validation. For VM-specific tests see dev-testing-vm, for hypervisor tests see dev-testing-hypervisor, for frontend tests see dev-testing-frontend.
---

# Testing

## Test tiers

Three tiers, fast to thorough. Every change must pass all three before it ships.

| Command | What | VM? |
|---------|------|-----|
| `just test` | Everything: unit tests (llvm-cov, warnings-as-errors for service crates) + cross-compile + frontend + all Python integration tests + injection + benchmarks | Yes |
| `just smoke` | Quick end-to-end: repack + sign + boot + capsem-doctor + MCP + service integration (~30s) | Yes |

`just test` is the single source of truth. There is no "fast" tier that skips integration tests -- that's how the "Connection refused" bug shipped while tests said green. Individual `test-*` recipes exist for targeted debugging but `just test` is the gate.

## TDD workflow

Write tests first:
1. Write failing tests that capture expected behavior
2. Verify they fail for the right reason
3. Write minimal implementation to pass them
4. Refactor

Without a failing test first, it's easy to write tests that pass by accident or don't actually verify the behavior you intended.

## Parallel tests as dogfooding (n=4 is non-negotiable)

`just test` runs the python suite under `pytest -n 4 --dist=loadfile`. Four real VMs boot simultaneously. **This is the canary, not just a speed-up.** We ship Capsem as a multi-VM sandbox for AI agents -- if our own test suite cannot safely boot 4 concurrent VMs, real users running an agent farm will hit the exact same bug. Treat any concurrency flake as a Capsem-side bug, not a test-tuning problem:

- "Suspend timed out" under load -> service IPC handling is racy, not "bump the timeout"
- "Session did not become ready" -> Apple VZ resource serialization, VirtioFS lock contention, or service handling concurrent provisions; investigate, don't suppress
- Two tests both want the same VM name -> name-collision bug in `validate_vm_name` / registry, not "isolate test names better"
- Stale socket between tests -> service didn't reap a child cleanly, real production bug

Anti-patterns when a test flakes under `-n 4`:
- Adding `time.sleep()` to "let things settle" -- masking a race
- Bumping the per-test timeout -- buying time for a real bug to manifest in prod instead of CI
- Marking the test `serial` so it runs alone -- defeating the dogfooding signal

The host has plenty of headroom (48 GB RAM, 14 cores; 4 VMs at 2 GB / 2 CPU each = 8 GB / 8 cores). If concurrency surfaces a flake, fix the product, then re-run. Bumping `-n` higher (8, 12) is the natural follow-on once n=4 is stable -- real users will run more.

### Orphan processes across runs are a product bug (not a test bug)

If a previous `just test -n 4` run was interrupted (ctrl-C, pytest-xdist worker death, host crash) and the NEXT run flakes with "vm-ready never asserted", UDS "connection refused", or mysterious HTTP 500s -- the cause is companion processes from the interrupted run still alive under PID 1. `pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)"` will make the flake vanish, but that is cleanup-after-the-fact. The fix is on the COMPANION side: every spawned companion (gateway, tray, and any new one) must use `capsem-guard::install(parent_pid, lock_path)` to enforce (a) refuse-standalone, (b) singleton, (c) self-exit on parent death. See `/dev-rust-patterns` lesson 18. Regression tests live in `tests/capsem-service/test_companion_lifecycle.py` -- never remove them; when adding a new companion, extend that file.

**Never `pkill -f capsem-` with a broad pattern** during test debugging: `capsem-` matches `--crate-name capsem-core` in running rustc/cargo invocations and will SIGKILL the compiler mid-build. Use a binary-path pattern like `pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)"` instead.

### When `-n 1` is actually the right answer: multi-service-only gotchas

One narrow class of concurrency bug belongs at `-n 1`, not `-n 4`: **bugs that only exist when two `capsem-service` processes run on the same host**. Apple's Virtualization.framework does not tolerate overlapping `saveMachineStateToURL` / `restoreMachineStateFromURL` calls on sibling VMs, and we serialize with a per-service `tokio::sync::Mutex` (`ServiceState::save_restore_lock`). That lock is in-process, so it only serializes VMs inside one service. Production always has exactly one service per host per user, so the lock is sufficient in real deployments.

`tests/capsem-mcp/test_stress_suspend_resume.py` runs under pytest-xdist, which spawns one `capsem-service` per worker. At `-n 2+`, worker A's service can't see worker B's lock, and you re-expose the bug that never happens in production. This is the one case where the "n=4 dogfoods concurrency" rule doesn't apply -- the concurrency being tested would never happen outside the test harness. Keep this harness at `-n 1`. Full context and the failure signature live in `docs/src/content/docs/gotchas/concurrent-suspend-resume.md`.

This is NOT a blanket license to run any flaky test at `-n 1`. If you're tempted to demote another test, first ask: *"Would this failure occur in production with one capsem-service and N VMs?"* If yes, it belongs at `-n 4`; fix the product.

## Adversarial testing

Capsem is a security product. Every security-relevant feature needs tests that actively try to break invariants. Think like an attacker:
- Can a corp-blocked domain be snuck through another provider's list?
- Does an overlapping wildcard in allow+block always deny?
- Does malformed input (empty strings, unicode, huge payloads, invalid JSON) get rejected?
- Can path traversal escape the VirtioFS sandbox?
- Can a guest process modify its own binaries?

Stress-test boundary conditions. Write tests for the attacks you'd attempt yourself.

### Security invariants to verify in tests

When touching security-relevant code, check these invariants have test coverage:

| Invariant | What to test | Where |
|-----------|-------------|-------|
| VirtioFS share is `guest/` only | `session_dir/guest/` exists, symlinks resolve, host-only files (`session.db`, `serial.log`) are outside the share | `capsem-core::lib::tests` |
| UDS sockets are 0600 | After bind, verify permissions exclude other users | `capsem-process` |
| Process env is cleared | `env_clear()` called, only allowlisted vars passed | `capsem-service` spawn tests |
| No `process::exit` on guest I/O | Control channel close causes loop break, not exit | `capsem-process` |
| Sensitive logs are 0600 | `serial.log` created with restricted permissions | `capsem-process` |
| Gateway auth on all routes | Every route except `GET /` returns 401 without token | `capsem-gateway::auth::tests` |
| Auth rate limiting | 429 after threshold, resets after window | `capsem-gateway::auth::tests` |
| CORS rejects external origins | Only localhost/127.0.0.1/tauri allowed | `capsem-gateway::tests` |
| Body size limit | 413 for >10MB payloads | `capsem-gateway::proxy::tests` |
| VM ID validation | Path traversal (`../`), dots, spaces, null bytes rejected | `capsem-gateway::terminal::tests` |
| Rootfs read-only | squashfs mounted ro, guest binaries 555 | `capsem-doctor` in-VM tests |
| Suspend reports errors | IPC failure and timeout both return 500, not silent success | `capsem-service` tests |

## Test fixture anti-pattern: masking races with polling

If all test fixtures wait/poll before asserting, the tests will never catch server-side race conditions. For every endpoint that talks to a VM socket, write at least one test that calls it IMMEDIATELY after provision (no `wait_exec_ready`, no `ready_vm` fixture). The server must handle readiness internally.

**Pattern to avoid** (masks the bug -- server never needs wait logic because client always waits):
```
fixture calls provision -> fixture polls wait_exec_ready -> test calls exec
```

**Required test pattern** (catches the bug -- if server doesn't wait, test fails):
```
test calls provision -> test immediately calls exec -> server handles wait
```

See `tests/capsem-service/test_svc_exec_ready.py` for the regression tests that enforce this.

### wait_exec_ready is a single call, not a loop

`wait_exec_ready` (in `tests/helpers/service.py`, `tests/helpers/mcp.py`, `tests/capsem-gateway/test_gw_e2e.py`) makes one exec call with the server-side timeout passed through. The server's `handle_exec` calls `wait_for_vm_ready` internally, which polls until the VM is ready. Do NOT add client-side retry loops -- that creates a double-wait where each retry can block for the full server timeout (30s client retries x 30s server wait = pathological cascade). One wait, one place.

### Exec latency regression gate

`tests/capsem-serial/test_boot_timing.py::test_exec_latency_under_1_5_seconds` asserts that provision-to-first-exec completes in under 1.5s. If this test fails, investigate boot time (process.log boot_timeline spans), not the wait mechanism.

## Where tests live

- **Rust unit: sibling `tests.rs` file, not inline `mod tests { ... }`.** See the next subsection.
- Rust integration: `crates/capsem-core/tests/`
- In-VM diagnostics: `guest/artifacts/diagnostics/test_*.py` (see dev-testing-vm)
- Hypervisor: KVM + Apple VZ tests (see dev-testing-hypervisor)
- Frontend: `frontend/src/lib/__tests__/` (see dev-testing-frontend)
- Python (builder): `tests/test_*.py`
- Python integration (service daemon): `tests/capsem-*/` directories, each with its own conftest.py and pytest marker

### Rust unit tests: sibling `tests.rs` pattern

**Every Rust module keeps its unit tests in a sibling `tests.rs`, not an inline `mod tests { ... }` block.** The parent module declares:

```rust
// foo.rs  OR  foo/mod.rs
// ... production code ...

#[cfg(test)]
mod tests;
```

and the tests go in `tests.rs` in the same directory:

```rust
// tests.rs -- sibling of foo.rs or child of foo/
use super::*;

#[test]
fn roundtrip() { ... }
```

**Why.** Inline `#[cfg(test)] mod tests { ... }` blocks are appended at the bottom of prod files and commonly hit 50–99% of the file's line count. That means every Read, grep, and scroll to reach production code walks past thousands of test lines first. Several modules in this codebase hit 4,000+ lines that way before extraction. Agents and humans both read faster when prod code isn't buried.

**Mechanics.**
- `tests.rs` is a submodule of the parent file -- `use super::*;` works, private items are visible, `#[cfg(test)]` on the `mod tests;` declaration still gates compilation.
- For files that don't yet have a sibling directory (e.g. `lib.rs`, `foo.rs`), put `tests.rs` next to them in the same `src/` directory.
- For files that are already `foo/mod.rs`, put `tests.rs` inside `foo/`.
- Attributes on the inline `mod tests` block (e.g. `#[allow(unused_imports)]`) move onto the declaration: `#[cfg(test)]\n#[allow(unused_imports)]\nmod tests;`.

**Extraction recipe** (for any remaining inline `mod tests { ... }`):
1. Move the block body (everything between the outer `{` and `}`) into a new sibling `tests.rs`.
2. Dedent one indentation level so contents read as top-level items.
3. Replace the old inline block with `#[cfg(test)] mod tests;` (plus any attributes that were on the original).
4. `cargo test -p <crate>` -- should pass identically.

**When to push back.** If you see a new PR or agent output adding an inline `mod tests { ... }` block, request it be moved to `tests.rs` before merge. Exceptions are narrow: tiny helper modules under ~50 lines total where inline tests plus prod code fit on one screen, or a module that's already a test-only helper.

## Integration test suites

All Python integration tests live under `tests/capsem-*/` and use pytest markers. Each suite has a dedicated `just` recipe.

| Suite | Directory | Marker | VM? | What it tests |
|-------|-----------|--------|-----|---------------|
| Service API | `capsem-service/` | `integration` | Yes | HTTP endpoints: provision, list, info, exec, logs, file I/O, delete |
| CLI | `capsem-cli/` | `integration` | Yes | CLI subcommands via subprocess |
| MCP | `capsem-mcp/` | `mcp` | Yes | MCP server black-box (stdio, tool routing) |
| Session DB | `capsem-session/` | `session` | Yes | Telemetry: net/model/tool/mcp/fs/snapshot events |
| Snapshots | `capsem-snapshots/` | `snapshot` | Yes | Auto/manual snapshots, revert |
| Isolation | `capsem-isolation/` | `isolation` | Yes | Multi-VM filesystem + network isolation |
| Security | `capsem-security/` | `security` | Yes | Binary perms, codesigning, asset integrity, env blocklist |
| Config | `capsem-config/` | `config` | Yes | Limits, resource bounds, hot-reload |
| Bootstrap | `capsem-bootstrap/` | `bootstrap` | No | Setup flow, dev tools, asset checks |
| Stress | `capsem-stress/` | `stress` | Yes | 5 concurrent VMs, rapid create/delete |
| Build chain | `capsem-build-chain/` | `build_chain` | Yes | cargo build -> codesign -> pack -> manifest -> boot |
| Guest | `capsem-guest/` | `guest` | Yes | Network, services, filesystem, env inside guest |
| Cleanup | `capsem-cleanup/` | `cleanup` | Yes | Process killed, socket removed, session dir removed |
| Codesign | `capsem-codesign/` | `codesign` | No | All binaries signed, entitlements present (FAIL not skip) |
| Serial | `capsem-serial/` | `serial` | Yes | Console logs, boot timing < 30s |
| Session lifecycle | `capsem-session-lifecycle/` | `session_lifecycle` | Yes | DB exists, schema, events, survives shutdown |
| Config runtime | `capsem-config-runtime/` | `config_runtime` | Yes | CPU/RAM applied in guest, blocked domains |
| Recipes | `capsem-recipes/` | `recipe` | No | just run-service, just doctor, cargo build |
| Recovery | `capsem-recovery/` | `recovery` | Yes | Stale socket/instances, orphaned process, double service |
| Rootfs artifacts | `capsem-rootfs-artifacts/` | `rootfs` | No | Artifact files, build context, doctor consistency |
| Session exhaustive | `capsem-session-exhaustive/` | `session_exhaustive` | Yes | Per-table data validation, cross-table FK integrity |
| Install | `capsem-install/` | `install` | No | Native installer: layout, auto-launch, service install, setup wizard, update, uninstall, lifecycle, reinstall, error paths |

Composite recipe: `just test-vm` runs build-chain + guest + cleanup + codesign + serial + session-lifecycle + config-runtime + recovery. `just test-install` runs the install suite in Docker with systemd. `just test` runs everything.

## Test matrix: what runs where

### Rust crate CI coverage

| Crate | Tests | CI macOS | CI Linux | Smoke | Full |
|-------|------:|:--------:|:--------:|:-----:|:----:|
| capsem-core | ~1695 | Yes | Yes | No | Yes |
| capsem-agent | ~71 | Yes | No | No | Yes |
| capsem-logger | ~47 | Yes | Yes | No | Yes |
| capsem-proto | ~132 | Yes | Yes | No | Yes |
| capsem-gateway | ~38 | Yes | No | No | Yes |
| capsem-service | ~109 | Yes | Yes | No | Yes |
| capsem (CLI) | ~140 | Yes | Yes | No | Yes |
| capsem-mcp | ~67 | Yes | Yes | No | Yes |
| capsem-tray | ~47 | Yes | No | No | Yes |
| capsem-process | ~62 | Yes | No | No | Yes |
| capsem-app | ~35 | Check | No | No | Yes |

### Python integration suite tier map

| Suite | Marker | VM? | CI | Smoke | Full |
|-------|--------|:---:|:--:|:-----:|:----:|
| capsem-bootstrap | `bootstrap` | No | Run | No | Yes |
| capsem-codesign | `codesign` | No | Run | No | Yes |
| capsem-rootfs-artifacts | `rootfs` | No | Run | No | Yes |
| capsem-mcp | `mcp` | Yes | Collect | Yes | Yes |
| capsem-service | `integration` | Yes | Collect | Yes | Yes |
| capsem-cli | `integration` | Yes | Collect | Yes | Yes |
| capsem-gateway | `gateway` | Yes | Collect | Yes | Yes |
| capsem-e2e | `e2e` | Yes | Collect | No | Yes |
| capsem-session | `session` | Yes | Collect | No | Yes |
| capsem-session-lifecycle | `session_lifecycle` | Yes | Collect | No | Yes |
| capsem-session-exhaustive | `session_exhaustive` | Yes | Collect | No | Yes |
| capsem-security | `security` | Yes | Collect | No | Yes |
| capsem-isolation | `isolation` | Yes | Collect | No | Yes |
| capsem-snapshots | `snapshot` | Yes | Collect | No | Yes |
| capsem-config | `config` | Yes | Collect | No | Yes |
| capsem-config-runtime | `config_runtime` | Yes | Collect | No | Yes |
| capsem-guest | `guest` | Yes | Collect | No | Yes |
| capsem-cleanup | `cleanup` | Yes | Collect | No | Yes |
| capsem-stress | `stress` | Yes | Collect | No | Yes |
| capsem-recovery | `recovery` | Yes | Collect | No | Yes |
| capsem-serial | `serial` | Yes | Collect | No | Yes |
| capsem-lifecycle | `integration` | Yes | Collect | No | Yes |
| capsem-build-chain | `build_chain` | Yes | Collect | No | Yes |
| capsem-recipes | `recipe` | No | Run | No | Yes |
| capsem-install | `install` | No | Yes (Docker) | No | Yes |

"Run" = tests execute in CI. "Collect" = imports verified (`--collect-only`) but tests skip (need VM). "Yes (Docker)" = runs in dedicated Docker+systemd CI job.

### Coverage targets

| Component | Floor | Enforced | Where |
|-----------|------:|:--------:|-------|
| Rust workspace | 70% | `--fail-under-lines 70` | CI (`cargo llvm-cov`), `just test` |
| Python builder | 90% | `--cov-fail-under=90` | CI (`pytest`), `just test` |
| capsem-service | 80% | Codecov component | `codecov.yml` |
| capsem-mcp | 80% | Codecov component | `codecov.yml` |
| capsem-gateway | 80% | Codecov component | `codecov.yml` |
| capsem (CLI) | 80% | Codecov component | `codecov.yml` |

## Coverage

- Rust: `cargo llvm-cov` via `just test` (floor: 70% line coverage)
- Python: `--cov-fail-under=90`
- `codecov.yml` maps components to code paths. Update it when files or directories are added, moved, or renamed.

## Fast debug with capsem MCP tools

When the capsem MCP server is configured, Claude Code has direct VM control via MCP tools -- no shell commands or just recipes needed. This is the fastest way to test changes interactively because you stay in the conversation loop: create a VM, run commands, inspect results, fix code, repeat.

### The tools

| Tool | What it does |
|------|-------------|
| `capsem_create` | Spin up a fresh VM (returns VM id). Named VMs are persistent. |
| `capsem_run` | One-shot: boot temp VM, exec command, destroy, return output |
| `capsem_exec` | Run a command inside a running guest |
| `capsem_stop` | Stop VM (persistent: preserve state; ephemeral: destroy) |
| `capsem_resume` | Resume a stopped persistent VM |
| `capsem_read_file` | Read a file from the guest filesystem |
| `capsem_write_file` | Write a file into the guest |
| `capsem_inspect_schema` | Get session.db table schema |
| `capsem_inspect` | Run SQL against session.db (telemetry) |
| `capsem_list` | Show all VMs (running + stopped persistent) |
| `capsem_info` | VM details (config, status, persistent, PID) |
| `capsem_delete` | Destroy VM and wipe all state |
| `capsem_persist` | Convert running ephemeral VM to persistent |
| `capsem_purge` | Kill all temp VMs (all=true includes persistent) |
| `capsem_fork` | Fork a running/stopped VM into a reusable image |
| `capsem_image_list` | List all user images |
| `capsem_image_inspect` | Inspect a specific image's metadata |
| `capsem_image_delete` | Delete a user image |

### Debug workflow

**Quick one-shot** (no VM management): `capsem_run` with the command you want to test.

**Iterative debugging** (long-lived VM):
1. **Create**: `capsem_create` -- boots a fresh VM in ~10s
2. **Test**: `capsem_exec` with the command you want to verify (e.g., `capsem-doctor -k net`, `cat /etc/resolv.conf`, `curl https://example.com`)
3. **Inspect**: `capsem_read_file` to check config files, logs; `capsem_inspect` to query telemetry tables
4. **Iterate**: fix code on host, rebuild (`just build`), create a new VM to test again
5. **Cleanup**: `capsem_delete` when done

### When to use MCP tools vs just recipes

| Scenario | Use |
|----------|-----|
| Quick check: "does this command work in the guest?" | `capsem_run` |
| Read a guest file to understand state | `capsem_read_file` |
| Verify telemetry was recorded correctly | `capsem_inspect` with SQL query |
| Full regression suite | `just test` |
| Build + boot + validate in one shot | `just smoke` |
| Benchmark performance | `just bench` |

MCP tools are for fast, targeted checks during development. Just recipes are for comprehensive validation before committing.

### Common debug queries

```sql
-- Check network events for a domain
SELECT * FROM net_events WHERE domain LIKE '%example%' ORDER BY timestamp DESC LIMIT 10;

-- Verify MCP tool calls were logged
SELECT server_name, tool_name, decision, duration_ms FROM mcp_calls ORDER BY timestamp DESC;

-- Check model API calls
SELECT provider, model, status_code, duration_ms FROM model_calls ORDER BY timestamp DESC;

-- File system events
SELECT operation, path, success FROM fs_events ORDER BY timestamp DESC LIMIT 20;
```

## End-to-end validation is not optional

After any change touching guest binaries, network policy, telemetry, MCP, or VM lifecycle:

1. `just run "capsem-doctor"` -- verifies sandbox integrity inside the VM
2. After telemetry/logging changes: run a real session and verify with `just inspect-session` that all 6 tables (net_events, model_calls, tool_calls, tool_responses, mcp_calls, fs_events) are populated correctly

## When tests fail

Never dismiss a test failure as "pre-existing" or "unrelated." Every failure must be investigated. Follow the dev-debugging workflow:

1. **Do not change the test to make it pass.** The test is evidence. Changing the assertion to match broken behavior destroys that evidence.
2. **Reproduce and diagnose first.** Understand *why* it fails before writing any fix. See the dev-debugging skill for the full methodology: reproduce with a test, diagnose root cause, then fix comprehensively.
3. **Fix the code, not the test.** If the test is genuinely wrong (not the code), explain in detail why the test's expectation is incorrect before changing it.

## Platform gating tests

`cargo test --test platform_gating` scans all `.rs` files under `crates/` for macOS-only and Linux-only symbols (`libc::clonefile`, `AppleVzHypervisor`, `KvmHypervisor`, `FICLONE`, etc.) and verifies they appear inside `#[cfg(target_os = "...")]` blocks. This catches ungated platform APIs before they reach CI. Run this test when adding any platform-specific code.

## Testable design

Extract logic into `capsem-core` -- never embed business logic in the app layer where it's coupled to Tauri. If you can't test something without booting a VM or launching the GUI, it belongs in core.