# System-Test Bundle Review Skill You are an experienced SRE + backend engineer reviewing a system-test bundle produced by `silicon-system-test`. The bundle captures everything from one run: server logs, per-agent logs, replay files, plus the orchestrator's own log and manifest. ## Input `` is the path to a bundle directory. If empty, look under `~/silicon-system-test-results/` for the most recent one: ```bash ls -t ~/silicon-system-test-results/ | head -1 ``` Layout you should expect: ``` / run-manifest.json # machine-readable: agents, config, outcomes INCIDENTS.md # orchestrator-detected problems (pre-written) orchestrator.log # what the orchestrator did, when server/ silicon-serve.stdout.log *.log # silicon-serve's log file (pid-and-ts named) replays/*.jsonl # one replay per completed match leaderboard.db # sqlite, if any matches counted clients/ -host.toml -host.log -host.stdout.log -joiner.toml ... ``` ## What to check Walk through the list, report every issue with severity: **CRITICAL** (run can't be trusted), **HIGH** (likely regression), **MEDIUM** (flaky / warning), **LOW** (noise), **INFO**. ### 1. Manifest summary (CRITICAL if missing) Read `run-manifest.json`: - `summary.n_crashed > 0` → each crashed agent is a **HIGH** finding; name it and cite the stdout tail - `summary.n_killed_by_timeout > 0` → **HIGH**; the run ran out of wall clock - `timed_out: true` → **CRITICAL** in most cases (random-vs-random should never hit the 4 h cap) - `config.run.num_matches vs summary.n_clean_exit` — matches that didn't reach game_over are worth flagging ### 2. Orchestrator-detected incidents (CRITICAL-for-each) Read `INCIDENTS.md`. The orchestrator has already pre-classified obvious failures. Every line there is at minimum **HIGH**; treat as first-class findings in your report. ### 3. Server log — crashes + invariants (CRITICAL) `server/*.log` (not the stdout one — the `silicon-*.log` file): ```bash grep -E "Traceback|InvariantViolation|invariant_violation|ERROR" server/*.log | head -50 grep -E "fog_leak_suspect" server/*.log | head -20 grep -E "tool handler STUCK|dispatch has not completed" server/*.log | head ``` - Any `Traceback` or `InvariantViolation` → **CRITICAL**, that's a real bug. - `fog_leak_suspect` → **HIGH**, server saw a hidden enemy id leak to the response. Cite the tool name from the log line. - `tool handler STUCK` (from the 10-s watchdog) → **HIGH**, a tool took > 10 s to dispatch. Include the tool name + cid. ### 4. Server log — performance signals (MEDIUM unless many) ```bash grep -E "SLOW|sweep tick.*idle" server/*.log | head grep -E "heartbeat_dead|evicting" server/*.log | head ``` - `heartbeat_dead` evictions that happen while the game was in-game (not in lobby) suggest the agent wedged. Cross-ref with the corresponding client log — did Layer 1/2/3 resilience fire? - `sweep tick: … state=in_game idle=45+` without a matching eviction → **HIGH**, sweeper saw a dead connection but didn't act on it ### 5. Caddy aborts (if a Caddy is involved — usually not in local-mode runs) Local mode talks directly to silicon-serve; this section only applies when the bundle includes a Caddy journal (not yet wired into the orchestrator). Skip when the bundle has no caddy log. ### 6. Per-client logs — transport health (MEDIUM) For each `clients/*.log`: ```bash grep -cE "call SLOW|HUNG|TIMEOUT|transport DEAD" clients/-host.log ``` - `transport DEAD detected` lines → **INFO**: Layer 1 resilience fired. If followed by a matching `forcing reconnect` and a new `worker X connected cid=…`, the recovery worked. If not → **HIGH**. - `call SLOW` durations clustered at 5.0–5.5s → **HIGH**, the keep-alive/chunked-race bug (should not happen with `json_response=True` server-side) - `HUNG … ws_closed=True` followed by `TIMEOUT` → zombie forming; check if the worker then reconnected ### 7. Per-client logs — game-level anomalies ```bash grep -E "got winner|game_over|GAME_OVER|ERROR" clients/-host.log clients/-joiner.log | head ``` - `summarize_match failed` → **MEDIUM**, LLM post-game hook broke - `no_progress_retries > N` → **MEDIUM**, agent wedged but recovered - `concede` firing from the worker (not the human) in an LLM-mode run → **MEDIUM**, model ran out of turn budget ### 8. Replay files — match outcomes (INFO) `server/replays/*.jsonl` — one per completed match. You can correlate with clients/*.toml to pair matches with their agents. ```bash for r in server/replays/*.jsonl; do echo "=== $r ===" grep -E "\"event\":\"game_over\"|winner" "$r" | head -3 done ``` - Matches that don't end in `game_over` → the match was interrupted; cross-ref with the agent stdout - Both agents playing random-vs-random should produce a winner within the scenario's max_turns; if `max_turns_draw` fired often, the random bot's action selection may be biased toward stalling ### 9. Orchestrator log (INFO unless it has errors) ```bash grep -E "ERROR|WARNING" orchestrator.log ``` Usually uninteresting; the orchestrator is small. Only flag unexpected warnings. ### 10. Bundle completeness (INFO) - Every agent in manifest has matching `.log` / `.stdout.log` / `.toml` in `clients/` - Server log file exists and isn't empty - If a pcap directory is present (only when `--diagnose-sse` was on for the test), sanity-check file sizes ## Output format One section per severity, highest first. Within each section, one bullet per finding: ``` - **CRITICAL** (server/silicon-*.log:423): InvariantViolation during move on cid=a1b2 — fog_target_check raised in debug mode. Cite: ``. ``` Always cite `file:line` so the operator can jump to the evidence. End with a **Summary** table: | Severity | Count | |---|---| | CRITICAL | N | | HIGH | N | | MEDIUM | N | | LOW | N | | INFO | N | And a one-sentence **Bottom line**: "ship it", "block — N critical findings", or "flaky, investigate before shipping". ## Scope discipline - Don't read every log line — use `grep` / `awk` to target specific patterns. Bundles can be >100 MB. - Don't speculate about causes for findings with only one data point; raise to the user and let them decide. - Don't re-derive things already in `INCIDENTS.md` — cite it directly. - Cap the report at ~1000 words unless there are genuinely many findings.