--- version: "1.0.0" evaluation: programmatic agent: claude-code model: claude-sonnet-4-6 model_provider: anthropic snapshot: prism-playwright # The headline deliverable — the self-contained QA report. primary_outputs: - report.html - replay.py origin: url: "https://github.com/wizenheimer/canary" source_host: "github.com" source_title: "Canary — QA harness for Claude Code" imported_at: "2026-06-07T00:00:00Z" imported_by: "skill-to-runbook-converter@1.0.0" attribution: collection_or_org: "wizenheimer" skill_name: "canary" author: "Canary contributors (usecanary) — github.com/wizenheimer/canary" license: "MIT" confidence: "high" secrets: {} --- # Canary — Agent-Driven Browser QA — Agent Runbook > Converted, with attribution, from **Canary** (github.com/wizenheimer/canary, MIT) — a QA > harness for coding agents. Canary itself ships a CLI + a QuickJS-WASM Playwright sandbox + > a daemon; this runbook re-implements its core idea with plain **Playwright** so it runs in > the Jetty sandbox: describe a flow, the agent drives a real browser, and you get back a > self-contained report **and** a reusable replay script. > **EXECUTE THIS RUNBOOK NOW.** Drive the browser with tools and write every deliverable to > `{{results_dir}}`. This is a task to perform, not a document to summarize. Your first > action is a tool call (Step 1). ## Inputs (already provided) - **Target URL:** {{target_url}} — where the flow starts. - **Flow:** {{flow}} — the user journey to QA, in plain language, with the checks that must hold (visible text / URL / element state / no console error). - **Credentials (optional):** {{credentials}} — e.g. `user=...,pass=...` for a login step. ## Objective QA a described user flow against a live web app the way Canary does: the agent drives a real (headless) browser through small, intent-named steps, and **captures evidence at every step** — a screenshot, console messages, and network activity — plus a Playwright **trace** for the whole session. Each step that encodes a check is an assertion (visible text, URL, element state, no console error). The run produces two things Canary insists on having together: a **report you can just read** (`report.html`, self-contained) and the **exact reusable script** (`replay.py`) that reproduces the flow in CI with zero agent cost. Don't make the user choose between an opaque agent run and hand-written Playwright — hand back both. --- ## REQUIRED OUTPUT FILES (MANDATORY) **You MUST write all of the following to `{{results_dir}}`. The task is NOT complete until every file exists and is non-empty. No exceptions.** | File | Description | |------|-------------| | `{{results_dir}}/report.html` | Self-contained QA report: per-step status, the inline screenshot of each step, console errors, a network summary, and the overall verdict. Open it, commit it, send it. | | `{{results_dir}}/replay.py` | The reusable Playwright script that reproduces the flow exactly — re-runnable in CI with no agent cost. | | `{{results_dir}}/steps.json` | Structured per-step results: name, action, check, status, screenshot path, console errors. | | `{{results_dir}}/trace.zip` | The Playwright trace for the whole session (open with `playwright show-trace`). | | `{{results_dir}}/console.log` | All browser console messages captured during the run. | | `{{results_dir}}/network.har` | The network HAR for the session. | | `{{results_dir}}/summary.md` | Executive summary: flow, verdict, steps passed/failed, the single most important finding. | | `{{results_dir}}/validation_report.json` | Stage-by-stage validation with `overall_passed`. See Step 5. | Screenshots go in `{{results_dir}}/screenshots/`. If you finish but have not written every file, go back and write it. --- ## Parameters | Parameter | Template Variable | Default | Description | |-----------|------------------|---------|-------------| | Results directory | `{{results_dir}}` | `/app/results` (Jetty) / `./results` (local) | Output directory | | Target URL | `{{target_url}}` | *(required)* | Where the flow starts | | Flow | `{{flow}}` | *(required)* | The plain-language flow + the checks that must hold | | Credentials | `{{credentials}}` | *(optional)* | Login creds if the flow needs them | | Headless | `{{headless}}` | `true` | Run the browser headless (always true on Jetty) | --- ## Dependencies | Dependency | Type | Required | Description | |------------|------|----------|-------------| | `playwright` (Python) + Chromium | Runtime | Yes | Pre-installed on the `prism-playwright` snapshot | --- ## Step 1: Environment Setup ```bash mkdir -p "{{results_dir}}/screenshots" python -c "import playwright; print('playwright', playwright.__version__)" || python -m pip install --quiet playwright python -m playwright install chromium 2>/dev/null || true SITE="{{target_url}}" [ -n "$SITE" ] && [ "$SITE" != "{{target_url}}" ] || { echo "ERROR: no target_url provided"; exit 1; } echo "QA target: $SITE" ``` --- ## Step 2: Explore, then Drive the Flow First **observe** the target: fetch the page, note the real selectors for the elements the flow touches (don't guess — read the DOM). Then translate the plain-language `{{flow}}` into small, intent-named steps and drive them with the harness below, capturing evidence per step. Each step is either an **action** (navigate, click, fill) or an **assertion** (a check from the flow). Re-read selectors from the live page if one fails (max 3 retries per step), the way Canary's explore-and-record loop does. ```python # Canary-style harness: one screenshot + console + per-step status, plus a full trace. import json, pathlib, sys from playwright.sync_api import sync_playwright RESULTS = "{{results_dir}}" TARGET = "{{target_url}}" pathlib.Path(f"{RESULTS}/screenshots").mkdir(parents=True, exist_ok=True) console_msgs, steps = [], [] with sync_playwright() as p: browser = p.chromium.launch(headless=True) context = browser.new_context(record_har_path=f"{RESULTS}/network.har", viewport={"width": 1280, "height": 800}) context.tracing.start(screenshots=True, snapshots=True, sources=True) page = context.new_page() page.on("console", lambda m: console_msgs.append({"type": m.type, "text": m.text})) page.on("pageerror", lambda e: console_msgs.append({"type": "pageerror", "text": str(e)})) def step(name, action, fn, check=None): """Run one intent-named step; screenshot + record status. check() -> bool|None.""" rec = {"name": name, "action": action, "status": "pass", "console_errors": 0, "error": None} try: fn(page) page.wait_for_timeout(400) if check is not None: rec["status"] = "pass" if check(page) else "fail" except Exception as e: rec["status"] = "fail"; rec["error"] = str(e)[:300] shot = f"{RESULTS}/screenshots/{len(steps)+1:02d}-{name.replace(' ','_')[:40]}.png" try: page.screenshot(path=shot, full_page=False) except Exception: shot = None rec["screenshot"] = (shot.split("/")[-1] if shot else None) rec["console_errors"] = sum(1 for m in console_msgs if m["type"] in ("error", "pageerror")) steps.append(rec) print(f" [{rec['status'].upper()}] {name}") return rec # ---- EXAMPLE shape — REPLACE these with the steps for {{flow}} ---- step("open the app", "navigate", lambda pg: pg.goto(TARGET, wait_until="domcontentloaded")) # step("submit the form", "click", lambda pg: pg.click("button[type=submit]")) # step("result is visible", "assert", lambda pg: None, # check=lambda pg: pg.get_by_text("Success").is_visible()) # ------------------------------------------------------------------ context.tracing.stop(path=f"{RESULTS}/trace.zip") context.close(); browser.close() pathlib.Path(f"{RESULTS}/console.log").write_text("\n".join(f"[{m['type']}] {m['text']}" for m in console_msgs)) pathlib.Path(f"{RESULTS}/steps.json").write_text(json.dumps(steps, indent=2)) passed = sum(1 for s in steps if s["status"] == "pass") print(f"verdict: {passed}/{len(steps)} steps passed") ``` Drive **real** assertions from the flow's checks (visible text, URL, element state). A step with no check is an action; a step that encodes a check is an assertion and must actually verify the page, not just not-throw. --- ## Step 3: Write the Reusable Replay Script Write `{{results_dir}}/replay.py` — a standalone Playwright script (no agent, no harness) that reproduces the exact flow and exits non-zero if any assertion fails. This is the artifact that runs in CI with zero inference cost. ```python # replay.py — generated; reproduces the QA flow headless and asserts each check. from playwright.sync_api import sync_playwright, expect def main(): with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_context().new_page() page.goto("{{target_url}}") # ... the exact steps + expect(...) assertions from Step 2 ... browser.close() if __name__ == "__main__": main() ``` It must mirror Step 2's steps one-to-one. Smoke-test it: `python {{results_dir}}/replay.py` should exit 0 on a passing flow. --- ## Step 4: Build `report.html` Write a **self-contained** `{{results_dir}}/report.html` (no external assets — inline the screenshots as base64). Include: the flow + target URL, the overall verdict (pass/fail), and for each step its name, action, status, inline screenshot, and console-error count; plus a network summary (total requests, failures) and a footer noting `trace.zip` / `replay.py`. ```python import base64, json, pathlib RESULTS = "{{results_dir}}" steps = json.load(open(f"{RESULTS}/steps.json")) def img(name): p = pathlib.Path(f"{RESULTS}/screenshots/{name}") if not (name and p.exists()): return "" b = base64.b64encode(p.read_bytes()).decode() return f'' passed = sum(1 for s in steps if s["status"]=="pass"); total=len(steps) verdict = "PASS" if passed==total and total>0 else "FAIL" rows = "".join( f'
' f'{i+1}. {s["name"]} ({s["action"]}) ' f'{s["status"].upper()}' f'
console errors: {s.get("console_errors",0)}{" · "+s["error"] if s.get("error") else ""}
' f'
{img(s.get("screenshot"))}
' for i,s in enumerate(steps)) html = f"""Canary QA Report

Canary QA Report

Target: {{target_url}}
Verdict: {verdict}  ({passed}/{total} steps)

{rows}

Evidence: trace.zip (playwright show-trace) · network.har · replay.py · console.log

""" pathlib.Path(f"{RESULTS}/report.html").write_text(html) print("wrote report.html", verdict) ``` --- ## Step 5: Evaluate, Validate & Iterate (max 3 rounds) | Status | Criteria | |--------|----------| | `PASS` | The flow drove ≥ 2 steps, evidence was captured for each (screenshot + console + the shared trace.zip + network.har), every assertion step ran a real check, `report.html` and `replay.py` both exist and are non-empty, and `replay.py` smoke-runs without import/syntax errors. | | `PARTIAL` | The flow ran but a non-blocking step failed (e.g. one optional assertion), or `replay.py` reproduces only part of the flow. Report which step and why. | | `FAIL` | The browser couldn't drive the flow at all (target unreachable, every step errored), or `report.html`/`replay.py` is missing. | If a step failed on a brittle selector, re-read the live DOM and fix the selector (max 3 rounds), then re-run. Write `validation_report.json`: ```json { "version": "1.0.0", "run_date": "", "parameters": { "target_url": "{{target_url}}" }, "stages": [ { "name": "setup", "passed": true, "message": "playwright + chromium ready" }, { "name": "drive", "passed": true, "message": "N steps driven, evidence captured" }, { "name": "replay", "passed": true, "message": "replay.py written and smoke-runs" }, { "name": "report", "passed": true, "message": "report.html + all artifacts written" } ], "results": { "steps_total": 0, "steps_passed": 0, "verdict": "PASS|FAIL" }, "overall_passed": true } ``` `overall_passed` is `true` iff every stage passed and `report.html` + `replay.py` exist. --- ## Step 6: Write Executive Summary Write `{{results_dir}}/summary.md`: ```markdown # Canary QA — Results ## Overview - **Date**: · **Target**: {{target_url}} - **Flow**: - **Verdict**: PASS|FAIL · **Steps**: / ## Steps | # | Step | Action | Status | Console errors | |---|------|--------|--------|----------------| ## Most important finding ## Artifacts - report.html (self-contained) · replay.py (CI-ready) · trace.zip · network.har · console.log ``` --- ## Final Checklist (MANDATORY — do not skip) ### Verification Script ```bash echo "=== FINAL OUTPUT VERIFICATION ===" RESULTS_DIR="{{results_dir}}" for f in "$RESULTS_DIR/report.html" "$RESULTS_DIR/replay.py" "$RESULTS_DIR/steps.json" \ "$RESULTS_DIR/trace.zip" "$RESULTS_DIR/console.log" "$RESULTS_DIR/network.har" \ "$RESULTS_DIR/summary.md" "$RESULTS_DIR/validation_report.json"; do [ -s "$f" ] && echo "PASS: $f ($(wc -c < "$f") bytes)" || echo "FAIL: $f is missing or empty" done SHOTS=$(ls "$RESULTS_DIR"/screenshots/*.png 2>/dev/null | wc -l | tr -d ' ') [ "$SHOTS" -ge 2 ] && echo "PASS: $SHOTS step screenshots" || echo "FAIL: too few screenshots ($SHOTS)" python3 -c "import ast; ast.parse(open('$RESULTS_DIR/replay.py').read()); print('PASS: replay.py parses')" || echo "FAIL: replay.py has a syntax error" echo "=== VERIFICATION COMPLETE ===" ``` ### Checklist - [ ] The flow drove ≥ 2 intent-named steps against the live target - [ ] Every step has a screenshot; console + network + the shared `trace.zip` were captured - [ ] Each assertion step ran a real check (visible text / URL / state), not just no-throw - [ ] `report.html` is self-contained (screenshots inlined, opens with no external assets) - [ ] `replay.py` reproduces the flow and parses/smoke-runs cleanly - [ ] `summary.md` states the verdict and the single most important finding - [ ] `validation_report.json` has `stages`, `results`, `overall_passed` **If ANY item fails, go back and fix it. Do NOT finish until all items pass.** --- ## Tips - **Observe before you drive.** Read the live DOM for real selectors; don't guess. Canary's edge is exploring the actual page, not replaying a brittle pre-written script. - **An assertion must assert.** A step that "checks" something has to verify the page (text visible, URL changed, element enabled) — a step that merely doesn't throw is an action, not a check. Console errors are a check too: a clean flow has zero. - **Hand back both.** The whole point is a readable `report.html` AND the exact `replay.py`. The report is for a human; the script runs in CI with zero agent cost on replay. - **Small, intent-named steps.** "log in", "add to cart", "cart shows 1" — not "click #btn-3". Intent names make the report and the trace readable. - **Headless, deterministic targets.** Public test apps (TodoMVC, the-internet, saucedemo) are ideal: stable selectors, no auth walls, designed to be driven.