---
version: "1.0.0"
evaluation: programmatic
agent: claude-code
model: claude-sonnet-4-6
model_provider: anthropic
snapshot: prism-playwright
# The headline deliverable — the self-contained QA report.
primary_outputs:
- report.html
- replay.py
origin:
url: "https://github.com/wizenheimer/canary"
source_host: "github.com"
source_title: "Canary — QA harness for Claude Code"
imported_at: "2026-06-07T00:00:00Z"
imported_by: "skill-to-runbook-converter@1.0.0"
attribution:
collection_or_org: "wizenheimer"
skill_name: "canary"
author: "Canary contributors (usecanary) — github.com/wizenheimer/canary"
license: "MIT"
confidence: "high"
secrets: {}
---
# Canary — Agent-Driven Browser QA — Agent Runbook
> Converted, with attribution, from **Canary** (github.com/wizenheimer/canary, MIT) — a QA
> harness for coding agents. Canary itself ships a CLI + a QuickJS-WASM Playwright sandbox +
> a daemon; this runbook re-implements its core idea with plain **Playwright** so it runs in
> the Jetty sandbox: describe a flow, the agent drives a real browser, and you get back a
> self-contained report **and** a reusable replay script.
> **EXECUTE THIS RUNBOOK NOW.** Drive the browser with tools and write every deliverable to
> `{{results_dir}}`. This is a task to perform, not a document to summarize. Your first
> action is a tool call (Step 1).
## Inputs (already provided)
- **Target URL:** {{target_url}} — where the flow starts.
- **Flow:** {{flow}} — the user journey to QA, in plain language, with the checks that must
hold (visible text / URL / element state / no console error).
- **Credentials (optional):** {{credentials}} — e.g. `user=...,pass=...` for a login step.
## Objective
QA a described user flow against a live web app the way Canary does: the agent drives a real
(headless) browser through small, intent-named steps, and **captures evidence at every step**
— a screenshot, console messages, and network activity — plus a Playwright **trace** for the
whole session. Each step that encodes a check is an assertion (visible text, URL, element
state, no console error). The run produces two things Canary insists on having together: a
**report you can just read** (`report.html`, self-contained) and the **exact reusable script**
(`replay.py`) that reproduces the flow in CI with zero agent cost. Don't make the user choose
between an opaque agent run and hand-written Playwright — hand back both.
---
## REQUIRED OUTPUT FILES (MANDATORY)
**You MUST write all of the following to `{{results_dir}}`. The task is NOT complete until
every file exists and is non-empty. No exceptions.**
| File | Description |
|------|-------------|
| `{{results_dir}}/report.html` | Self-contained QA report: per-step status, the inline screenshot of each step, console errors, a network summary, and the overall verdict. Open it, commit it, send it. |
| `{{results_dir}}/replay.py` | The reusable Playwright script that reproduces the flow exactly — re-runnable in CI with no agent cost. |
| `{{results_dir}}/steps.json` | Structured per-step results: name, action, check, status, screenshot path, console errors. |
| `{{results_dir}}/trace.zip` | The Playwright trace for the whole session (open with `playwright show-trace`). |
| `{{results_dir}}/console.log` | All browser console messages captured during the run. |
| `{{results_dir}}/network.har` | The network HAR for the session. |
| `{{results_dir}}/summary.md` | Executive summary: flow, verdict, steps passed/failed, the single most important finding. |
| `{{results_dir}}/validation_report.json` | Stage-by-stage validation with `overall_passed`. See Step 5. |
Screenshots go in `{{results_dir}}/screenshots/`. If you finish but have not written every
file, go back and write it.
---
## Parameters
| Parameter | Template Variable | Default | Description |
|-----------|------------------|---------|-------------|
| Results directory | `{{results_dir}}` | `/app/results` (Jetty) / `./results` (local) | Output directory |
| Target URL | `{{target_url}}` | *(required)* | Where the flow starts |
| Flow | `{{flow}}` | *(required)* | The plain-language flow + the checks that must hold |
| Credentials | `{{credentials}}` | *(optional)* | Login creds if the flow needs them |
| Headless | `{{headless}}` | `true` | Run the browser headless (always true on Jetty) |
---
## Dependencies
| Dependency | Type | Required | Description |
|------------|------|----------|-------------|
| `playwright` (Python) + Chromium | Runtime | Yes | Pre-installed on the `prism-playwright` snapshot |
---
## Step 1: Environment Setup
```bash
mkdir -p "{{results_dir}}/screenshots"
python -c "import playwright; print('playwright', playwright.__version__)" || python -m pip install --quiet playwright
python -m playwright install chromium 2>/dev/null || true
SITE="{{target_url}}"
[ -n "$SITE" ] && [ "$SITE" != "{{target_url}}" ] || { echo "ERROR: no target_url provided"; exit 1; }
echo "QA target: $SITE"
```
---
## Step 2: Explore, then Drive the Flow
First **observe** the target: fetch the page, note the real selectors for the elements the
flow touches (don't guess — read the DOM). Then translate the plain-language `{{flow}}` into
small, intent-named steps and drive them with the harness below, capturing evidence per step.
Each step is either an **action** (navigate, click, fill) or an **assertion** (a check from
the flow). Re-read selectors from the live page if one fails (max 3 retries per step), the way
Canary's explore-and-record loop does.
```python
# Canary-style harness: one screenshot + console + per-step status, plus a full trace.
import json, pathlib, sys
from playwright.sync_api import sync_playwright
RESULTS = "{{results_dir}}"
TARGET = "{{target_url}}"
pathlib.Path(f"{RESULTS}/screenshots").mkdir(parents=True, exist_ok=True)
console_msgs, steps = [], []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(record_har_path=f"{RESULTS}/network.har", viewport={"width": 1280, "height": 800})
context.tracing.start(screenshots=True, snapshots=True, sources=True)
page = context.new_page()
page.on("console", lambda m: console_msgs.append({"type": m.type, "text": m.text}))
page.on("pageerror", lambda e: console_msgs.append({"type": "pageerror", "text": str(e)}))
def step(name, action, fn, check=None):
"""Run one intent-named step; screenshot + record status. check() -> bool|None."""
rec = {"name": name, "action": action, "status": "pass", "console_errors": 0, "error": None}
try:
fn(page)
page.wait_for_timeout(400)
if check is not None:
rec["status"] = "pass" if check(page) else "fail"
except Exception as e:
rec["status"] = "fail"; rec["error"] = str(e)[:300]
shot = f"{RESULTS}/screenshots/{len(steps)+1:02d}-{name.replace(' ','_')[:40]}.png"
try: page.screenshot(path=shot, full_page=False)
except Exception: shot = None
rec["screenshot"] = (shot.split("/")[-1] if shot else None)
rec["console_errors"] = sum(1 for m in console_msgs if m["type"] in ("error", "pageerror"))
steps.append(rec)
print(f" [{rec['status'].upper()}] {name}")
return rec
# ---- EXAMPLE shape — REPLACE these with the steps for {{flow}} ----
step("open the app", "navigate", lambda pg: pg.goto(TARGET, wait_until="domcontentloaded"))
# step("submit the form", "click", lambda pg: pg.click("button[type=submit]"))
# step("result is visible", "assert", lambda pg: None,
# check=lambda pg: pg.get_by_text("Success").is_visible())
# ------------------------------------------------------------------
context.tracing.stop(path=f"{RESULTS}/trace.zip")
context.close(); browser.close()
pathlib.Path(f"{RESULTS}/console.log").write_text("\n".join(f"[{m['type']}] {m['text']}" for m in console_msgs))
pathlib.Path(f"{RESULTS}/steps.json").write_text(json.dumps(steps, indent=2))
passed = sum(1 for s in steps if s["status"] == "pass")
print(f"verdict: {passed}/{len(steps)} steps passed")
```
Drive **real** assertions from the flow's checks (visible text, URL, element state). A step
with no check is an action; a step that encodes a check is an assertion and must actually
verify the page, not just not-throw.
---
## Step 3: Write the Reusable Replay Script
Write `{{results_dir}}/replay.py` — a standalone Playwright script (no agent, no harness) that
reproduces the exact flow and exits non-zero if any assertion fails. This is the artifact that
runs in CI with zero inference cost.
```python
# replay.py — generated; reproduces the QA flow headless and asserts each check.
from playwright.sync_api import sync_playwright, expect
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_context().new_page()
page.goto("{{target_url}}")
# ... the exact steps + expect(...) assertions from Step 2 ...
browser.close()
if __name__ == "__main__":
main()
```
It must mirror Step 2's steps one-to-one. Smoke-test it: `python {{results_dir}}/replay.py`
should exit 0 on a passing flow.
---
## Step 4: Build `report.html`
Write a **self-contained** `{{results_dir}}/report.html` (no external assets — inline the
screenshots as base64). Include: the flow + target URL, the overall verdict (pass/fail), and
for each step its name, action, status, inline screenshot, and console-error count; plus a
network summary (total requests, failures) and a footer noting `trace.zip` / `replay.py`.
```python
import base64, json, pathlib
RESULTS = "{{results_dir}}"
steps = json.load(open(f"{RESULTS}/steps.json"))
def img(name):
p = pathlib.Path(f"{RESULTS}/screenshots/{name}")
if not (name and p.exists()): return ""
b = base64.b64encode(p.read_bytes()).decode()
return f''
passed = sum(1 for s in steps if s["status"]=="pass"); total=len(steps)
verdict = "PASS" if passed==total and total>0 else "FAIL"
rows = "".join(
f'
Target: {{target_url}}
Verdict:
{verdict}
({passed}/{total} steps)
Evidence: trace.zip (playwright show-trace) · network.har · replay.py · console.log
""" pathlib.Path(f"{RESULTS}/report.html").write_text(html) print("wrote report.html", verdict) ``` --- ## Step 5: Evaluate, Validate & Iterate (max 3 rounds) | Status | Criteria | |--------|----------| | `PASS` | The flow drove ≥ 2 steps, evidence was captured for each (screenshot + console + the shared trace.zip + network.har), every assertion step ran a real check, `report.html` and `replay.py` both exist and are non-empty, and `replay.py` smoke-runs without import/syntax errors. | | `PARTIAL` | The flow ran but a non-blocking step failed (e.g. one optional assertion), or `replay.py` reproduces only part of the flow. Report which step and why. | | `FAIL` | The browser couldn't drive the flow at all (target unreachable, every step errored), or `report.html`/`replay.py` is missing. | If a step failed on a brittle selector, re-read the live DOM and fix the selector (max 3 rounds), then re-run. Write `validation_report.json`: ```json { "version": "1.0.0", "run_date": "