--- name: dev-testing description: Capsem testing policy and workflow. Use whenever running tests, writing new tests, or verifying changes work. Covers the three test tiers (unit, smoke, full), TDD red-green-refactor, adversarial security testing, coverage policy, and the mandatory end-to-end VM validation. For VM-specific tests see dev-testing-vm, for hypervisor tests see dev-testing-hypervisor, for frontend tests see dev-testing-frontend. --- # Testing ## Test tiers Three tiers, fast to thorough. Every change must pass all three before it ships. | Command | What | VM? | |---------|------|-----| | `just test` | Everything: unit tests (llvm-cov, warnings-as-errors for service crates) + cross-compile + frontend + all Python integration tests + injection + benchmarks | Yes | | `just smoke` | Quick end-to-end: repack + sign + boot + capsem-doctor + MCP + service integration (~30s) | Yes | `just test` is the single source of truth. There is no "fast" tier that skips integration tests -- that's how the "Connection refused" bug shipped while tests said green. Individual `test-*` recipes exist for targeted debugging but `just test` is the gate. ## TDD workflow Write tests first: 1. Write failing tests that capture expected behavior 2. Verify they fail for the right reason 3. Write minimal implementation to pass them 4. Refactor Without a failing test first, it's easy to write tests that pass by accident or don't actually verify the behavior you intended. ## Parallel tests as dogfooding (n=4 is non-negotiable) `just test` runs the python suite under `pytest -n 4 --dist=loadfile`. Four real VMs boot simultaneously. **This is the canary, not just a speed-up.** We ship Capsem as a multi-VM sandbox for AI agents -- if our own test suite cannot safely boot 4 concurrent VMs, real users running an agent farm will hit the exact same bug. Treat any concurrency flake as a Capsem-side bug, not a test-tuning problem: - "Suspend timed out" under load -> service IPC handling is racy, not "bump the timeout" - "Session did not become ready" -> Apple VZ resource serialization, VirtioFS lock contention, or service handling concurrent provisions; investigate, don't suppress - Two tests both want the same VM name -> name-collision bug in `validate_vm_name` / registry, not "isolate test names better" - Stale socket between tests -> service didn't reap a child cleanly, real production bug Anti-patterns when a test flakes under `-n 4`: - Adding `time.sleep()` to "let things settle" -- masking a race - Bumping the per-test timeout -- buying time for a real bug to manifest in prod instead of CI - Marking the test `serial` so it runs alone -- defeating the dogfooding signal The host has plenty of headroom (48 GB RAM, 14 cores; 4 VMs at 2 GB / 2 CPU each = 8 GB / 8 cores). If concurrency surfaces a flake, fix the product, then re-run. Bumping `-n` higher (8, 12) is the natural follow-on once n=4 is stable -- real users will run more. ### Orphan processes across runs are a product bug (not a test bug) If a previous `just test -n 4` run was interrupted (ctrl-C, pytest-xdist worker death, host crash) and the NEXT run flakes with "vm-ready never asserted", UDS "connection refused", or mysterious HTTP 500s -- the cause is companion processes from the interrupted run still alive under PID 1. `pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)"` will make the flake vanish, but that is cleanup-after-the-fact. The fix is on the COMPANION side: every spawned companion (gateway, tray, and any new one) must use `capsem-guard::install(parent_pid, lock_path)` to enforce (a) refuse-standalone, (b) singleton, (c) self-exit on parent death. See `/dev-rust-patterns` lesson 18. Regression tests live in `tests/capsem-service/test_companion_lifecycle.py` -- never remove them; when adding a new companion, extend that file. **Never `pkill -f capsem-` with a broad pattern** during test debugging: `capsem-` matches `--crate-name capsem-core` in running rustc/cargo invocations and will SIGKILL the compiler mid-build. Use a binary-path pattern like `pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)"` instead. ### When `-n 1` is actually the right answer: multi-service-only gotchas One narrow class of concurrency bug belongs at `-n 1`, not `-n 4`: **bugs that only exist when two `capsem-service` processes run on the same host**. Apple's Virtualization.framework does not tolerate overlapping `saveMachineStateToURL` / `restoreMachineStateFromURL` calls on sibling VMs, and we serialize with a per-service `tokio::sync::Mutex` (`ServiceState::save_restore_lock`). That lock is in-process, so it only serializes VMs inside one service. Production always has exactly one service per host per user, so the lock is sufficient in real deployments. `tests/capsem-mcp/test_stress_suspend_resume.py` runs under pytest-xdist, which spawns one `capsem-service` per worker. At `-n 2+`, worker A's service can't see worker B's lock, and you re-expose the bug that never happens in production. This is the one case where the "n=4 dogfoods concurrency" rule doesn't apply -- the concurrency being tested would never happen outside the test harness. Keep this harness at `-n 1`. Full context and the failure signature live in `docs/src/content/docs/gotchas/concurrent-suspend-resume.md`. This is NOT a blanket license to run any flaky test at `-n 1`. If you're tempted to demote another test, first ask: *"Would this failure occur in production with one capsem-service and N VMs?"* If yes, it belongs at `-n 4`; fix the product. ## Adversarial testing Capsem is a security product. Every security-relevant feature needs tests that actively try to break invariants. Think like an attacker: - Can a corp-blocked domain be snuck through another provider's list? - Does an overlapping wildcard in allow+block always deny? - Does malformed input (empty strings, unicode, huge payloads, invalid JSON) get rejected? - Can path traversal escape the VirtioFS sandbox? - Can a guest process modify its own binaries? Stress-test boundary conditions. Write tests for the attacks you'd attempt yourself. ### Security invariants to verify in tests When touching security-relevant code, check these invariants have test coverage: | Invariant | What to test | Where | |-----------|-------------|-------| | VirtioFS share is `guest/` only | `session_dir/guest/` exists, symlinks resolve, host-only files (`session.db`, `serial.log`) are outside the share | `capsem-core::lib::tests` | | UDS sockets are 0600 | After bind, verify permissions exclude other users | `capsem-process` | | Process env is cleared | `env_clear()` called, only allowlisted vars passed | `capsem-service` spawn tests | | No `process::exit` on guest I/O | Control channel close causes loop break, not exit | `capsem-process` | | Sensitive logs are 0600 | `serial.log` created with restricted permissions | `capsem-process` | | Gateway auth on all routes | Every route except `GET /` returns 401 without token | `capsem-gateway::auth::tests` | | Auth rate limiting | 429 after threshold, resets after window | `capsem-gateway::auth::tests` | | CORS rejects external origins | Only localhost/127.0.0.1/tauri allowed | `capsem-gateway::tests` | | Body size limit | 413 for >10MB payloads | `capsem-gateway::proxy::tests` | | VM ID validation | Path traversal (`../`), dots, spaces, null bytes rejected | `capsem-gateway::terminal::tests` | | Rootfs read-only | squashfs mounted ro, guest binaries 555 | `capsem-doctor` in-VM tests | | Suspend reports errors | IPC failure and timeout both return 500, not silent success | `capsem-service` tests | ## Test fixture anti-pattern: masking races with polling If all test fixtures wait/poll before asserting, the tests will never catch server-side race conditions. For every endpoint that talks to a VM socket, write at least one test that calls it IMMEDIATELY after provision (no `wait_exec_ready`, no `ready_vm` fixture). The server must handle readiness internally. **Pattern to avoid** (masks the bug -- server never needs wait logic because client always waits): ``` fixture calls provision -> fixture polls wait_exec_ready -> test calls exec ``` **Required test pattern** (catches the bug -- if server doesn't wait, test fails): ``` test calls provision -> test immediately calls exec -> server handles wait ``` See `tests/capsem-service/test_svc_exec_ready.py` for the regression tests that enforce this. ### wait_exec_ready is a single call, not a loop `wait_exec_ready` (in `tests/helpers/service.py`, `tests/helpers/mcp.py`, `tests/capsem-gateway/test_gw_e2e.py`) makes one exec call with the server-side timeout passed through. The server's `handle_exec` calls `wait_for_vm_ready` internally, which polls until the VM is ready. Do NOT add client-side retry loops -- that creates a double-wait where each retry can block for the full server timeout (30s client retries x 30s server wait = pathological cascade). One wait, one place. ### Exec latency regression gate `tests/capsem-serial/test_boot_timing.py::test_exec_latency_under_1_5_seconds` asserts that provision-to-first-exec completes in under 1.5s. If this test fails, investigate boot time (process.log boot_timeline spans), not the wait mechanism. ## Where tests live - **Rust unit: sibling `tests.rs` file, not inline `mod tests { ... }`.** See the next subsection. - Rust integration: `crates/capsem-core/tests/` - In-VM diagnostics: `guest/artifacts/diagnostics/test_*.py` (see dev-testing-vm) - Hypervisor: KVM + Apple VZ tests (see dev-testing-hypervisor) - Frontend: `frontend/src/lib/__tests__/` (see dev-testing-frontend) - Python (builder): `tests/test_*.py` - Python integration (service daemon): `tests/capsem-*/` directories, each with its own conftest.py and pytest marker ### Rust unit tests: sibling `tests.rs` pattern **Every Rust module keeps its unit tests in a sibling `tests.rs`, not an inline `mod tests { ... }` block.** The parent module declares: ```rust // foo.rs OR foo/mod.rs // ... production code ... #[cfg(test)] mod tests; ``` and the tests go in `tests.rs` in the same directory: ```rust // tests.rs -- sibling of foo.rs or child of foo/ use super::*; #[test] fn roundtrip() { ... } ``` **Why.** Inline `#[cfg(test)] mod tests { ... }` blocks are appended at the bottom of prod files and commonly hit 50–99% of the file's line count. That means every Read, grep, and scroll to reach production code walks past thousands of test lines first. Several modules in this codebase hit 4,000+ lines that way before extraction. Agents and humans both read faster when prod code isn't buried. **Mechanics.** - `tests.rs` is a submodule of the parent file -- `use super::*;` works, private items are visible, `#[cfg(test)]` on the `mod tests;` declaration still gates compilation. - For files that don't yet have a sibling directory (e.g. `lib.rs`, `foo.rs`), put `tests.rs` next to them in the same `src/` directory. - For files that are already `foo/mod.rs`, put `tests.rs` inside `foo/`. - Attributes on the inline `mod tests` block (e.g. `#[allow(unused_imports)]`) move onto the declaration: `#[cfg(test)]\n#[allow(unused_imports)]\nmod tests;`. **Extraction recipe** (for any remaining inline `mod tests { ... }`): 1. Move the block body (everything between the outer `{` and `}`) into a new sibling `tests.rs`. 2. Dedent one indentation level so contents read as top-level items. 3. Replace the old inline block with `#[cfg(test)] mod tests;` (plus any attributes that were on the original). 4. `cargo test -p ` -- should pass identically. **When to push back.** If you see a new PR or agent output adding an inline `mod tests { ... }` block, request it be moved to `tests.rs` before merge. Exceptions are narrow: tiny helper modules under ~50 lines total where inline tests plus prod code fit on one screen, or a module that's already a test-only helper. ## Integration test suites All Python integration tests live under `tests/capsem-*/` and use pytest markers. Each suite has a dedicated `just` recipe. | Suite | Directory | Marker | VM? | What it tests | |-------|-----------|--------|-----|---------------| | Service API | `capsem-service/` | `integration` | Yes | HTTP endpoints: provision, list, info, exec, logs, file I/O, delete | | CLI | `capsem-cli/` | `integration` | Yes | CLI subcommands via subprocess | | MCP | `capsem-mcp/` | `mcp` | Yes | MCP server black-box (stdio, tool routing) | | Session DB | `capsem-session/` | `session` | Yes | Telemetry: net/model/tool/mcp/fs/snapshot events | | Snapshots | `capsem-snapshots/` | `snapshot` | Yes | Auto/manual snapshots, revert | | Isolation | `capsem-isolation/` | `isolation` | Yes | Multi-VM filesystem + network isolation | | Security | `capsem-security/` | `security` | Yes | Binary perms, codesigning, asset integrity, env blocklist | | Config | `capsem-config/` | `config` | Yes | Limits, resource bounds, hot-reload | | Bootstrap | `capsem-bootstrap/` | `bootstrap` | No | Setup flow, dev tools, asset checks | | Stress | `capsem-stress/` | `stress` | Yes | 5 concurrent VMs, rapid create/delete | | Build chain | `capsem-build-chain/` | `build_chain` | Yes | cargo build -> codesign -> pack -> manifest -> boot | | Guest | `capsem-guest/` | `guest` | Yes | Network, services, filesystem, env inside guest | | Cleanup | `capsem-cleanup/` | `cleanup` | Yes | Process killed, socket removed, session dir removed | | Codesign | `capsem-codesign/` | `codesign` | No | All binaries signed, entitlements present (FAIL not skip) | | Serial | `capsem-serial/` | `serial` | Yes | Console logs, boot timing < 30s | | Session lifecycle | `capsem-session-lifecycle/` | `session_lifecycle` | Yes | DB exists, schema, events, survives shutdown | | Config runtime | `capsem-config-runtime/` | `config_runtime` | Yes | CPU/RAM applied in guest, blocked domains | | Recipes | `capsem-recipes/` | `recipe` | No | just run-service, just doctor, cargo build | | Recovery | `capsem-recovery/` | `recovery` | Yes | Stale socket/instances, orphaned process, double service | | Rootfs artifacts | `capsem-rootfs-artifacts/` | `rootfs` | No | Artifact files, build context, doctor consistency | | Session exhaustive | `capsem-session-exhaustive/` | `session_exhaustive` | Yes | Per-table data validation, cross-table FK integrity | | Install | `capsem-install/` | `install` | No | Native installer: layout, auto-launch, service install, setup wizard, update, uninstall, lifecycle, reinstall, error paths | Composite recipe: `just test-vm` runs build-chain + guest + cleanup + codesign + serial + session-lifecycle + config-runtime + recovery. `just test-install` runs the install suite in Docker with systemd. `just test` runs everything. ## Test matrix: what runs where ### Rust crate CI coverage | Crate | Tests | CI macOS | CI Linux | Smoke | Full | |-------|------:|:--------:|:--------:|:-----:|:----:| | capsem-core | ~1695 | Yes | Yes | No | Yes | | capsem-agent | ~71 | Yes | No | No | Yes | | capsem-logger | ~47 | Yes | Yes | No | Yes | | capsem-proto | ~132 | Yes | Yes | No | Yes | | capsem-gateway | ~38 | Yes | No | No | Yes | | capsem-service | ~109 | Yes | Yes | No | Yes | | capsem (CLI) | ~140 | Yes | Yes | No | Yes | | capsem-mcp | ~67 | Yes | Yes | No | Yes | | capsem-tray | ~47 | Yes | No | No | Yes | | capsem-process | ~62 | Yes | No | No | Yes | | capsem-app | ~35 | Check | No | No | Yes | ### Python integration suite tier map | Suite | Marker | VM? | CI | Smoke | Full | |-------|--------|:---:|:--:|:-----:|:----:| | capsem-bootstrap | `bootstrap` | No | Run | No | Yes | | capsem-codesign | `codesign` | No | Run | No | Yes | | capsem-rootfs-artifacts | `rootfs` | No | Run | No | Yes | | capsem-mcp | `mcp` | Yes | Collect | Yes | Yes | | capsem-service | `integration` | Yes | Collect | Yes | Yes | | capsem-cli | `integration` | Yes | Collect | Yes | Yes | | capsem-gateway | `gateway` | Yes | Collect | Yes | Yes | | capsem-e2e | `e2e` | Yes | Collect | No | Yes | | capsem-session | `session` | Yes | Collect | No | Yes | | capsem-session-lifecycle | `session_lifecycle` | Yes | Collect | No | Yes | | capsem-session-exhaustive | `session_exhaustive` | Yes | Collect | No | Yes | | capsem-security | `security` | Yes | Collect | No | Yes | | capsem-isolation | `isolation` | Yes | Collect | No | Yes | | capsem-snapshots | `snapshot` | Yes | Collect | No | Yes | | capsem-config | `config` | Yes | Collect | No | Yes | | capsem-config-runtime | `config_runtime` | Yes | Collect | No | Yes | | capsem-guest | `guest` | Yes | Collect | No | Yes | | capsem-cleanup | `cleanup` | Yes | Collect | No | Yes | | capsem-stress | `stress` | Yes | Collect | No | Yes | | capsem-recovery | `recovery` | Yes | Collect | No | Yes | | capsem-serial | `serial` | Yes | Collect | No | Yes | | capsem-lifecycle | `integration` | Yes | Collect | No | Yes | | capsem-build-chain | `build_chain` | Yes | Collect | No | Yes | | capsem-recipes | `recipe` | No | Run | No | Yes | | capsem-install | `install` | No | Yes (Docker) | No | Yes | "Run" = tests execute in CI. "Collect" = imports verified (`--collect-only`) but tests skip (need VM). "Yes (Docker)" = runs in dedicated Docker+systemd CI job. ### Coverage targets | Component | Floor | Enforced | Where | |-----------|------:|:--------:|-------| | Rust workspace | 70% | `--fail-under-lines 70` | CI (`cargo llvm-cov`), `just test` | | Python builder | 90% | `--cov-fail-under=90` | CI (`pytest`), `just test` | | capsem-service | 80% | Codecov component | `codecov.yml` | | capsem-mcp | 80% | Codecov component | `codecov.yml` | | capsem-gateway | 80% | Codecov component | `codecov.yml` | | capsem (CLI) | 80% | Codecov component | `codecov.yml` | ## Coverage - Rust: `cargo llvm-cov` via `just test` (floor: 70% line coverage) - Python: `--cov-fail-under=90` - `codecov.yml` maps components to code paths. Update it when files or directories are added, moved, or renamed. ## Fast debug with capsem MCP tools When the capsem MCP server is configured, Claude Code has direct VM control via MCP tools -- no shell commands or just recipes needed. This is the fastest way to test changes interactively because you stay in the conversation loop: create a VM, run commands, inspect results, fix code, repeat. ### The tools | Tool | What it does | |------|-------------| | `capsem_create` | Spin up a fresh VM (returns VM id). Named VMs are persistent. | | `capsem_run` | One-shot: boot temp VM, exec command, destroy, return output | | `capsem_exec` | Run a command inside a running guest | | `capsem_stop` | Stop VM (persistent: preserve state; ephemeral: destroy) | | `capsem_resume` | Resume a stopped persistent VM | | `capsem_read_file` | Read a file from the guest filesystem | | `capsem_write_file` | Write a file into the guest | | `capsem_inspect_schema` | Get session.db table schema | | `capsem_inspect` | Run SQL against session.db (telemetry) | | `capsem_list` | Show all VMs (running + stopped persistent) | | `capsem_info` | VM details (config, status, persistent, PID) | | `capsem_delete` | Destroy VM and wipe all state | | `capsem_persist` | Convert running ephemeral VM to persistent | | `capsem_purge` | Kill all temp VMs (all=true includes persistent) | | `capsem_fork` | Fork a running/stopped VM into a reusable image | | `capsem_image_list` | List all user images | | `capsem_image_inspect` | Inspect a specific image's metadata | | `capsem_image_delete` | Delete a user image | ### Debug workflow **Quick one-shot** (no VM management): `capsem_run` with the command you want to test. **Iterative debugging** (long-lived VM): 1. **Create**: `capsem_create` -- boots a fresh VM in ~10s 2. **Test**: `capsem_exec` with the command you want to verify (e.g., `capsem-doctor -k net`, `cat /etc/resolv.conf`, `curl https://example.com`) 3. **Inspect**: `capsem_read_file` to check config files, logs; `capsem_inspect` to query telemetry tables 4. **Iterate**: fix code on host, rebuild (`just build`), create a new VM to test again 5. **Cleanup**: `capsem_delete` when done ### When to use MCP tools vs just recipes | Scenario | Use | |----------|-----| | Quick check: "does this command work in the guest?" | `capsem_run` | | Read a guest file to understand state | `capsem_read_file` | | Verify telemetry was recorded correctly | `capsem_inspect` with SQL query | | Full regression suite | `just test` | | Build + boot + validate in one shot | `just smoke` | | Benchmark performance | `just bench` | MCP tools are for fast, targeted checks during development. Just recipes are for comprehensive validation before committing. ### Common debug queries ```sql -- Check network events for a domain SELECT * FROM net_events WHERE domain LIKE '%example%' ORDER BY timestamp DESC LIMIT 10; -- Verify MCP tool calls were logged SELECT server_name, tool_name, decision, duration_ms FROM mcp_calls ORDER BY timestamp DESC; -- Check model API calls SELECT provider, model, status_code, duration_ms FROM model_calls ORDER BY timestamp DESC; -- File system events SELECT operation, path, success FROM fs_events ORDER BY timestamp DESC LIMIT 20; ``` ## End-to-end validation is not optional After any change touching guest binaries, network policy, telemetry, MCP, or VM lifecycle: 1. `just run "capsem-doctor"` -- verifies sandbox integrity inside the VM 2. After telemetry/logging changes: run a real session and verify with `just inspect-session` that all 6 tables (net_events, model_calls, tool_calls, tool_responses, mcp_calls, fs_events) are populated correctly ## When tests fail Never dismiss a test failure as "pre-existing" or "unrelated." Every failure must be investigated. Follow the dev-debugging workflow: 1. **Do not change the test to make it pass.** The test is evidence. Changing the assertion to match broken behavior destroys that evidence. 2. **Reproduce and diagnose first.** Understand *why* it fails before writing any fix. See the dev-debugging skill for the full methodology: reproduce with a test, diagnose root cause, then fix comprehensively. 3. **Fix the code, not the test.** If the test is genuinely wrong (not the code), explain in detail why the test's expectation is incorrect before changing it. ## Platform gating tests `cargo test --test platform_gating` scans all `.rs` files under `crates/` for macOS-only and Linux-only symbols (`libc::clonefile`, `AppleVzHypervisor`, `KvmHypervisor`, `FICLONE`, etc.) and verifies they appear inside `#[cfg(target_os = "...")]` blocks. This catches ungated platform APIs before they reach CI. Run this test when adding any platform-specific code. ## Testable design Extract logic into `capsem-core` -- never embed business logic in the app layer where it's coupled to Tauri. If you can't test something without booting a VM or launching the GUI, it belongs in core.