---
version: "1.1.0"
evaluation: rubric
agent: claude-code
model: claude-sonnet-4-6
model_provider: anthropic
snapshot: python312-uv
origin:
  attribution:
    collection_or_org: jettyio
    skill_name: ambient-scribe-quality-gate
    confidence: high
secrets:
  ANTHROPIC_API_KEY:
    env: ANTHROPIC_API_KEY
    description: "Anthropic API key for the scribe-under-test and the independent judge (via litellm). Get one at https://console.anthropic.com/settings/keys."
    required: true
  OPENAI_API_KEY:
    env: OPENAI_API_KEY
    description: "Optional. If set, use a cross-vendor judge (openai/...) for stronger judge independence. Fallback scribe model if no Anthropic key."
    required: false
---

# "Ambient Scribe Quality Gate" — SOAP Note Hill-Climb — Agent Runbook

## Objective

Run the eval an ambient AI medical-scribe team actually cares about, and make
the *hill-climb visible*. The agent generates a **panel of synthetic clinical encounters** (a
"clinic day"), and for each one it: synthesizes a realistic, messy doctor–patient **transcript**
from a hidden **ground-truth case sheet**, has a **scribe-under-test** write a **SOAP note from the
transcript only**, then an **independent judge** scores the note against the ground truth on a
5-criterion clinical rubric. The note starts from a deliberately thin baseline prompt and the agent
**hill-climbs the weakest criterion** (≤3 rounds per encounter) until it clears the quality gate —
logging every iteration's scores so the 3.x → 4.x climb is legible in the Jetty trajectory.

The point of the demo is twofold and both are the deliverable:
1. **Long-running agentic workflow** — `panel_size` encounters × (generate → scribe → judge →
   iterate) is dozens of sequential model calls, run durably in a sandbox.
2. **Hill-climb with a runbook** — a declared rubric + pass threshold, self-scored, with the
   weakest dimension re-rolled until the gate is met. The **iteration log is the centerpiece.**

**The honesty mechanic:** the rubric has a **hard hallucination floor** — any clinical claim in the
note not supported by the transcript caps the score. The climb cannot rubber-stamp itself by
padding; it has to *earn* points by grounding every statement.

> **SYNTHETIC DATA ONLY.** Every patient, encounter, name, and value is fictional and generated
> in-workflow. No real PHI, no real MRNs, no real people. This is a documentation-quality eval, not
> clinical advice and not a medical device.

---

## EVAL CONFIG

```yaml
PANEL:        a "clinic day" of synthetic encounters spanning specialties + complexity + a trap
ROLES:
  - CASE AUTHOR:   writes the hidden ground-truth case sheet (structured clinical facts)
  - TRANSCRIPTIONIST: synthesizes a messy ambient transcript that *contains* every ground-truth
                      fact (so completeness is achievable) plus an embedded TRAP (so groundedness
                      is testable) — never in SOAP order, with filler/interruptions/tangents
  - SCRIBE UNDER TEST: writes a SOAP note from the TRANSCRIPT ONLY (never sees the case sheet).
                       Its system prompt is the thing being hill-climbed.
  - JUDGE:         scores the note against BOTH the transcript and the ground-truth case sheet;
                   extracts hallucinated claims + missed facts; must be a DIFFERENT model than the
                   scribe (independence — a same-model judge rubber-stamps its own output).
THE TRAP (each case embeds exactly one, to make the rubric discriminating):
  - a med the patient says they STOPPED taking (note must not list it as active)
  - a symptom the patient explicitly DENIES (a pertinent negative, not a positive)
  - a value stated INDIRECTLY ("my sugar's been running around 200") — capture, don't invent a lab
  - a family-member's condition mentioned in passing (belongs in FH, not the patient's PMH)
THE BASELINE (deliberately thin, so there is room to climb):
  scribe_system = "You are a medical scribe. Write a SOAP note from this visit transcript."
RUBRIC_1: completeness  — captures the clinically salient ground-truth facts into the right section
RUBRIC_2: groundedness  — HARD FLOOR: every claim traceable to the transcript; no invented findings
RUBRIC_3: structure     — correct S/O/A/P placement and format; nothing in the wrong section
RUBRIC_4: coding        — ICD-10 / CPT suggestions are defensible and tied to the documented A/P
RUBRIC_5: conciseness   — summarized, scannable, no transcript-dumping or verbatim bloat
```

---

## REQUIRED OUTPUT FILES (MANDATORY)

**Write all of the following to `{{results_dir}}`. The task is NOT complete until every file
exists and is non-empty.**

| File | Description |
|------|-------------|
| `{{results_dir}}/case_sheets.json` | The N hidden ground-truth case sheets (synthetic facts + the trap) |
| `{{results_dir}}/transcripts/encounter_NN.md` | The messy ambient transcript for each encounter |
| `{{results_dir}}/notes/encounter_NN.md` | The final (best-iteration) SOAP note for each encounter |
| `{{results_dir}}/scorecards/encounter_NN.json` | Per-encounter rubric scores + **iteration_log** (the climb) |
| `{{results_dir}}/leaderboard.md` | Aggregate per-criterion means, pass rate, hallucination incidents across the panel |
| `{{results_dir}}/summary.md` | Executive summary: the climb story, weakest→strongest, the winning scribe prompt |
| `{{results_dir}}/scribe_prompt_final.md` | The hill-climbed scribe system prompt (the artifact a real team would ship) |
| `{{results_dir}}/validation_report.json` | Structured results + `overall_passed` |

If you finish but have not written all files, go back and write them before stopping.

---

## Parameters

| Parameter | Template Variable | Default | Description |
|-----------|------------------|---------|-------------|
| Results directory | `{{results_dir}}` | `/app/results` | Output directory |
| Panel size | `{{panel_size}}` | `8` | Number of synthetic encounters (the "clinic day"). 4 for a quick demo, 8–12 for a real run. |
| Scribe model | `{{scribe_model}}` | `anthropic/claude-sonnet-4-6` | litellm model string for the note writer (the workhorse under test) |
| Judge model | `{{judge_model}}` | `anthropic/claude-opus-4-8` | litellm model string for the judge. **Must differ from scribe** for independence. |
| Max iterations | `{{max_iterations}}` | `3` | Hill-climb rounds per encounter |
| Pass threshold | `{{pass_threshold}}` | `4.0` | Overall average to pass; also: no criterion < 3 AND zero hallucinations on the final note |

---

## Dependencies

| Dependency | Required | Description |
|------------|----------|-------------|
| `ANTHROPIC_API_KEY` | Yes | Auth for scribe + judge (via litellm). |
| `litellm` (pip) | Yes | Provider-agnostic LLM calls (Anthropic / OpenAI / etc.) |
| `OPENAI_API_KEY` | No | Enables a cross-vendor judge for stronger independence. |

---

## Step 1: Environment Setup

```bash
pip install -q litellm
mkdir -p {{results_dir}}/transcripts {{results_dir}}/notes {{results_dir}}/scorecards
# Shared-panel mode (for an apples-to-apples model bake-off): if a panel was uploaded
# (case_sheets.json + encounter_*.md transcripts), reuse it verbatim and SKIP generation so
# every scribe grades IDENTICAL transcripts. Otherwise a fresh panel is generated below.
PANEL="$(find / -name 'case_sheets.json' -not -path '*/results/*' 2>/dev/null | head -1)"
if [ -n "$PANEL" ]; then
  cp "$PANEL" {{results_dir}}/case_sheets.json
  find / -name 'encounter_*.md' -not -path '*/results/*' 2>/dev/null | while read -r f; do cp "$f" {{results_dir}}/transcripts/ 2>/dev/null || true; done
  echo "SHARED PANEL DETECTED ($PANEL): $(ls {{results_dir}}/transcripts 2>/dev/null | wc -l) transcripts copied — generation will be SKIPPED."
else
  echo "No uploaded panel found — a fresh synthetic panel will be generated."
fi
if [ -z "$ANTHROPIC_API_KEY" ] && [ -z "$OPENAI_API_KEY" ]; then
  echo "ERROR: need ANTHROPIC_API_KEY (preferred) or OPENAI_API_KEY"; exit 1
fi
python - <<'PY'
import os
has_anthropic = bool(os.environ.get("ANTHROPIC_API_KEY"))
has_openai = bool(os.environ.get("OPENAI_API_KEY"))
print(f"Anthropic key: {'SET' if has_anthropic else 'missing'} | OpenAI key: {'SET' if has_openai else 'missing'}")
print("litellm import:", end=" ")
import litellm  # noqa
print("OK")
PY
```

Notes on model selection (adapt in the scripts below if a key is missing):
- If only `OPENAI_API_KEY` is set, use `openai/gpt-...` strings for scribe + judge.
- Keep **judge ≠ scribe**. A self-judging model rubber-stamps its own note (a known
  weak-model-self-validation failure mode) and the hill-climb will look like it passes when it
  hasn't moved. Cross-vendor (Anthropic scribe + OpenAI judge, or vice-versa) is strongest.

---

## Step 2: The Shared LLM Helper

Write `{{results_dir}}/_llm.py` once; every later step imports it. It wraps `litellm.completion`,
asks for JSON when needed, and retries on transient errors.

```python
# {{results_dir}}/_llm.py
import json, os, time, litellm

def call(model, system, user, json_mode=False, max_tokens=2000, temperature=0.4, retries=3):
    msgs = [{"role": "system", "content": system}, {"role": "user", "content": user}]
    kwargs = {"model": model, "messages": msgs, "max_tokens": max_tokens, "temperature": temperature}
    if json_mode:
        kwargs["response_format"] = {"type": "json_object"}  # ignored by providers that don't support it
    last = None
    for i in range(retries):
        try:
            r = litellm.completion(**kwargs)
            return r.choices[0].message.content
        except Exception as e:
            last = e; time.sleep(2 * (i + 1))
    raise last

def call_json(model, system, user, **kw):
    raw = call(model, system, user, json_mode=True, **kw)
    raw = raw.strip()
    if raw.startswith("```"):                      # strip markdown fences if a model adds them
        raw = raw.split("```", 2)[1].lstrip("json").strip() if "```" in raw[3:] else raw.strip("`")
    try:
        return json.loads(raw)
    except Exception:
        s, e = raw.find("{"), raw.rfind("}")        # last-resort: slice the outermost object
        return json.loads(raw[s:e + 1])
```

---

## Step 3: Generate the Synthetic Case Panel (hidden ground truth)

Write `{{results_dir}}/case_sheets.json`. Vary specialty + complexity, and embed exactly one
**trap** per case. **Fully fictional** — no real people, no real identifiers.

```python
import json, sys, os
sys.path.insert(0, "{{results_dir}}")
from _llm import call_json

# Shared-panel mode: an uploaded panel was copied in Step 1 — reuse it, skip generation.
if os.path.exists("{{results_dir}}/case_sheets.json") and os.path.getsize("{{results_dir}}/case_sheets.json") > 0:
    print("Shared panel present — skipping case-sheet generation."); raise SystemExit

PANEL_SIZE = int("{{panel_size}}")
CASE_MODEL = "{{scribe_model}}"   # the case author can be the workhorse; ground truth is structured

SEEDS = [  # (specialty, complexity, trap_type) — extended/truncated to PANEL_SIZE
    ("family medicine", "low",      "stopped_med"),
    ("cardiology",      "moderate", "denied_symptom"),
    ("pediatrics",      "low",      "indirect_value"),
    ("endocrinology",   "high",     "indirect_value"),
    ("dermatology",     "low",      "family_history"),
    ("psychiatry",      "moderate", "denied_symptom"),
    ("orthopedics",     "moderate", "stopped_med"),
    ("internal med",    "high",     "family_history"),
    ("urgent care",     "moderate", "stopped_med"),
    ("ob/gyn",          "moderate", "indirect_value"),
    ("gastroenterology","high",     "denied_symptom"),
    ("pulmonology",     "moderate", "family_history"),
]
SEEDS = (SEEDS * ((PANEL_SIZE // len(SEEDS)) + 1))[:PANEL_SIZE]

SYS = (
  "You author HIDDEN GROUND-TRUTH case sheets for a SYNTHETIC medical-scribe eval. Everything is "
  "fictional — invent patients, never use real people or real identifiers. Return ONLY JSON.")

cases = []
for i, (spec, cx, trap) in enumerate(SEEDS, 1):
    user = f"""Author one fictional outpatient encounter as a structured case sheet.
Specialty: {spec}. Complexity: {cx}. Embedded trap type: {trap}.

Trap semantics (embed exactly one, matching the type):
- stopped_med: patient mentions a medication they USED to take but STOPPED. Correct note must NOT list it as a current med.
- denied_symptom: patient explicitly DENIES a symptom. Correct note records it as a pertinent NEGATIVE, not a positive.
- indirect_value: patient states a value indirectly ("sugar around 200", "BP's been high at the pharmacy"). Capture as reported; do NOT fabricate a formal lab/vital.
- family_history: a FAMILY MEMBER's condition is mentioned. Belongs in Family History, NOT the patient's PMH.

Return JSON with keys:
  id ("encounter_{i:02d}"), specialty, complexity, trap_type, trap_detail (one sentence describing the exact trap fact),
  demographics ({{age, sex}} — synthetic, no name needed),
  chief_complaint (string),
  hpi_facts (array of short factual strings the clinician would document),
  ros (array of {{system, finding, pertinent_negative: bool}}),
  pmh (array), current_meds (array of {{name, dose}}), allergies (array),
  family_history (array), social_history (array),
  vitals ({{...}}), exam_findings (array of short strings),
  assessment (array of {{problem, icd10}}),
  plan (array of short strings)."""
    case = call_json(CASE_MODEL, SYS, user, max_tokens=2200, temperature=0.7)
    case["id"] = f"encounter_{i:02d}"
    case.setdefault("trap_type", trap)
    cases.append(case)
    print(f"  case {i}/{PANEL_SIZE}: {spec} / {cx} / trap={trap}")

json.dump(cases, open("{{results_dir}}/case_sheets.json", "w"), indent=2)
print(f"Wrote {len(cases)} case sheets.")
```

Spot-check `case_sheets.json`: each case has a clear `trap_detail`, a non-trivial `hpi_facts` list,
and at least one pertinent-negative ROS entry. Regenerate any case that's too thin to discriminate.

---

## Step 4: Synthesize the Ambient Transcripts

For each case, write `{{results_dir}}/transcripts/encounter_NN.md` — a realistic, **messy**
doctor–patient dialogue. The transcript must **contain every ground-truth fact** (so a perfect
note is achievable) **and the trap, stated the tricky way** — but NOT in SOAP order, with natural
filler, interruptions, and tangents. This is what an ambient scribe actually "hears."

```python
import json, sys, os
sys.path.insert(0, "{{results_dir}}")
from _llm import call

CASE_MODEL = "{{scribe_model}}"
cases = json.load(open("{{results_dir}}/case_sheets.json"))
# Shared-panel mode: transcripts already provided — skip synthesis so every scribe sees the same input.
if cases and all(os.path.getsize(f"{{results_dir}}/transcripts/{c['id']}.md") > 0 for c in cases if os.path.exists(f"{{results_dir}}/transcripts/{c['id']}.md")) \
   and all(os.path.exists(f"{{results_dir}}/transcripts/{c['id']}.md") for c in cases):
    print("Shared transcripts present — skipping transcript synthesis."); raise SystemExit

SYS = ("You write realistic ambient clinic-visit TRANSCRIPTS for a synthetic eval. Output a raw "
       "Doctor/Patient dialogue — natural, messy, with greetings, filler, a tangent or interruption, "
       "and facts revealed out of order. Do NOT structure it as a SOAP note. Do NOT label sections.")

for c in cases:
    user = (f"Turn this case sheet into a natural spoken transcript (~25–45 turns). Every fact below "
            f"must surface somewhere in the dialogue, and the TRAP must be spoken the tricky way "
            f"({c.get('trap_type')}: {c.get('trap_detail','')}). Vitals/exam come out as the clinician "
            f"says them aloud. Keep it conversational.\n\nCASE SHEET:\n{json.dumps(c, indent=2)}")
    txt = call(CASE_MODEL, SYS, user, max_tokens=2600, temperature=0.8)
    open(f"{{results_dir}}/transcripts/{c['id']}.md", "w").write(txt)
    print(f"  transcript: {c['id']} ({len(txt)} chars)")
```

Read one transcript end-to-end: is the trap genuinely embedded (e.g. the stopped med is mentioned
as stopped, not as a current med)? If a transcript leaks SOAP structure or omits the trap,
regenerate it.

---

## Step 5: The Baseline Scribe Prompt + the Hill-Climb Fix Library

The scribe starts thin on purpose. The **fix library** is the menu the climb draws from — each entry
targets one rubric criterion. Write `{{results_dir}}/scribe_prompt_final.md` at the end with the
prompt that won.

```python
BASELINE_SCRIBE = "You are a medical scribe. Write a SOAP note from this visit transcript."

# Each fix is appended to the scribe prompt when its criterion is the weakest unmet one.
FIX_LIBRARY = {
  "completeness": (
    "Capture every clinically salient fact: chief complaint, all HPI elements, pertinent positives "
    "AND negatives from the review of systems, current meds with doses, allergies, vitals, exam "
    "findings, assessment, and plan. Do not drop documented facts."),
  "groundedness": (
    "Every statement MUST be traceable to something said in the transcript. Never infer or invent "
    "labs, vitals, doses, or findings that were not stated. If the patient says they STOPPED a med, "
    "do not list it as current. If a value is given indirectly ('around 200'), record it as the "
    "patient reported it — do not fabricate a formal lab result. If something was not documented, "
    "omit it or write 'not documented' — never guess."),
  "structure": (
    "Use a strict SOAP template. SUBJECTIVE = what the patient reports (HPI, ROS, histories). "
    "OBJECTIVE = vitals + exam findings only. ASSESSMENT = problems/diagnoses. PLAN = orders, "
    "meds, follow-up. Put a family member's condition under Family History, never the patient's PMH."),
  "coding": (
    "In the Assessment, suggest an ICD-10 code for each documented problem and a plausible E/M CPT "
    "code for the visit. Every code must be defensible from documented findings — no codes for "
    "undocumented or ruled-out conditions."),
  "conciseness": (
    "Summarize; do not transcribe. No verbatim patient quotes unless clinically necessary. Each "
    "section scannable. Convey the signal, not the whole conversation."),
}
print("Baseline scribe prompt loaded; fix library has:", list(FIX_LIBRARY))
```

---

## Step 6: The Judge (rubric + hard hallucination floor)

The judge sees the **transcript + the hidden case sheet + the note** and returns structured scores
plus extracted **hallucinated claims** and **missed facts**. The hard floor is then applied
**deterministically in code** (not left to the judge's gestalt) so the gate can't be talked past.

```python
import sys
sys.path.insert(0, "{{results_dir}}")
from _llm import call_json

JUDGE_MODEL = "{{judge_model}}"
PASS = float("{{pass_threshold}}")

JUDGE_SYS = (
  "You are a STRICT clinical-documentation auditor. You are given a ground-truth case sheet, the "
  "visit transcript, and a SOAP note written from the transcript. Score the note. Be adversarial "
  "about hallucinations: list EVERY claim in the note not supported by the transcript. Return ONLY JSON.")

RUBRIC = """Score each 1-5 (5=excellent, 3=acceptable, 1=poor):
- completeness: fraction of ground-truth salient facts captured in the correct section
- groundedness: every claim traceable to the transcript (no invented findings/labs/doses); trap handled correctly
- structure: correct S/O/A/P placement and format
- coding: ICD-10/CPT suggestions defensible and tied to the documented A/P
- conciseness: summarized and scannable; no transcript-dumping"""

def judge_note(case, transcript, note):
    user = (f"{RUBRIC}\n\nGROUND TRUTH CASE SHEET:\n{__import__('json').dumps(case)}\n\n"
            f"TRANSCRIPT:\n{transcript}\n\nSOAP NOTE UNDER REVIEW:\n{note}\n\n"
            "Return JSON: {scores:{completeness,groundedness,structure,coding,conciseness}, "
            "hallucinated_claims:[strings], missed_facts:[strings], notes:{<criterion>:<one line>}}")
    v = call_json(JUDGE_MODEL, JUDGE_SYS, user, max_tokens=1800, temperature=0.0)
    s = {k: int(v["scores"][k]) for k in ["completeness","groundedness","structure","coding","conciseness"]}
    halluc = v.get("hallucinated_claims", []) or []
    # HARD FLOOR: any hallucination caps groundedness at 1 and the overall at 2.0
    if halluc:
        s["groundedness"] = 1
    overall = sum(s.values()) / 5.0
    if halluc:
        overall = min(overall, 2.0)
    passed = (overall >= PASS) and all(x >= 3 for x in s.values()) and not halluc
    return {"scores": s, "overall": round(overall, 2), "hallucinated_claims": halluc,
            "missed_facts": v.get("missed_facts", []), "notes": v.get("notes", {}), "passed": passed}
```

| # | Criterion | 5 (Excellent) | 3 (Acceptable) | 1 (Poor) |
|---|-----------|---------------|-----------------|----------|
| 1 | **Completeness** | All salient ground-truth facts captured in the right section | Core facts present, a few minor omissions | Major omissions; the note is unusable |
| 2 | **Groundedness** | Every claim traceable to the transcript; trap handled right | Mostly grounded, soft inferences | **Any hallucination → 1 (hard floor; caps overall at 2.0)** |
| 3 | **Structure** | Clean S/O/A/P; nothing in the wrong section | Mostly right; a misplaced item or two | Sections jumbled or missing |
| 4 | **Coding** | Defensible ICD-10 + CPT tied to documented A/P | Codes present but loose | Wrong/absent codes, or codes for undocumented dx |
| 5 | **Conciseness** | Summarized, scannable, signal-dense | A little bloated | Transcript dump |

**Pass threshold: overall ≥ {{pass_threshold}}, no criterion below 3, and ZERO hallucinations on the final note.**

---

## Step 7: Run the Panel — Scribe → Judge → Hill-Climb (the centerpiece)

For each encounter: write a note from the baseline prompt, judge it, then while it fails and rounds
remain, **append the fix for the weakest criterion**, regenerate, re-judge — logging every
iteration. Persist the best note + the full `iteration_log`.

```python
import json, sys
sys.path.insert(0, "{{results_dir}}")
from _llm import call

SCRIBE_MODEL = "{{scribe_model}}"
MAX_ITERS = int("{{max_iterations}}")
cases = json.load(open("{{results_dir}}/case_sheets.json"))

def write_note(scribe_prompt, transcript):
    return call(SCRIBE_MODEL, scribe_prompt, f"VISIT TRANSCRIPT:\n{transcript}", max_tokens=1800, temperature=0.3)

CRIT_ORDER = ["groundedness", "completeness", "structure", "coding", "conciseness"]  # fix groundedness first

for c in cases:
    transcript = open(f"{{results_dir}}/transcripts/{c['id']}.md").read()
    prompt = BASELINE_SCRIBE
    applied, log, best = [], [], None
    for it in range(1, MAX_ITERS + 1):
        note = write_note(prompt, transcript)
        verdict = judge_note(c, transcript, note)
        log.append({"iteration": it, "applied_fixes": list(applied),
                    "scores": verdict["scores"], "overall": verdict["overall"],
                    "hallucinations": len(verdict["hallucinated_claims"]), "passed": verdict["passed"]})
        if best is None or verdict["overall"] > best["overall"]:
            best = {"note": note, **verdict}
        print(f"  {c['id']} iter {it}: overall={verdict['overall']} "
              f"halluc={len(verdict['hallucinated_claims'])} passed={verdict['passed']}")
        if verdict["passed"]:
            break
        # pick the weakest UNMET criterion not already patched (groundedness prioritized)
        unmet = sorted([k for k in CRIT_ORDER if verdict["scores"][k] < 5 and k not in applied],
                       key=lambda k: (verdict["scores"][k], CRIT_ORDER.index(k)))
        if not unmet:
            break
        target = unmet[0]; applied.append(target)
        prompt = prompt + "\n\n" + FIX_LIBRARY[target]

    open(f"{{results_dir}}/notes/{c['id']}.md", "w").write(best["note"])
    json.dump({"id": c["id"], "specialty": c.get("specialty"), "trap_type": c.get("trap_type"),
               "final": {k: best[k] for k in ["scores","overall","hallucinated_claims","missed_facts","passed","notes"]},
               "iteration_log": log, "winning_fixes": applied},
              open(f"{{results_dir}}/scorecards/{c['id']}.json", "w"), indent=2)
    # the prompt that won this encounter (longest/strongest); keep the most-patched as the shippable artifact
    open("{{results_dir}}/scribe_prompt_final.md", "w").write(
        "# Hill-climbed scribe system prompt\n\n```\n" + prompt + "\n```\n")
    print(f"{c['id']} done: final overall={best['overall']} fixes={applied}")
```

Each `iteration_log` should show the climb — e.g. iter 1 overall 2.0 (a hallucinated lab),
iter 2 after the groundedness fix → 3.4, iter 3 after completeness → 4.2 (pass). If an encounter
*starts* passing (baseline 4.0+), that case isn't discriminating — note it; the panel's value is the
ones that have to climb.

---

## Step 8: Aggregate Leaderboard

Write `{{results_dir}}/leaderboard.md`: per-criterion means at baseline (iter 1) vs final, the
panel pass rate, total hallucination incidents caught, and the average iterations-to-pass.

```python
import json, glob
rows = [json.load(open(f)) for f in sorted(glob.glob("{{results_dir}}/scorecards/*.json"))]
crits = ["completeness","groundedness","structure","coding","conciseness"]

def mean(xs): return round(sum(xs)/len(xs), 2) if xs else 0.0
base = {k: mean([r["iteration_log"][0]["scores"][k] for r in rows]) for k in crits}
final = {k: mean([r["final"]["scores"][k] for r in rows]) for k in crits}
pass_rate = mean([1 if r["final"]["passed"] else 0 for r in rows]) * 100
halluc_caught = sum(il["hallucinations"] for r in rows for il in r["iteration_log"])
iters = mean([len(r["iteration_log"]) for r in rows])

lines = ["# Ambient Scribe Quality Gate — Leaderboard\n",
         f"- Encounters: {len(rows)}  |  Pass rate: {pass_rate:.0f}%  |  Avg iterations/encounter: {iters}",
         f"- Hallucination incidents caught across the climb: {halluc_caught}\n",
         "| Criterion | Baseline (iter 1) | Final | Δ |", "|---|---|---|---|"]
for k in crits:
    lines.append(f"| {k} | {base[k]} | {final[k]} | {round(final[k]-base[k],2):+} |")
lines += ["", "| Encounter | Specialty | Trap | Iters | Final | Passed | Winning fixes |",
          "|---|---|---|---|---|---|---|"]
for r in rows:
    lines.append(f"| {r['id']} | {r.get('specialty','')} | {r.get('trap_type','')} | "
                 f"{len(r['iteration_log'])} | {r['final']['overall']} | "
                 f"{'✅' if r['final']['passed'] else '❌'} | {', '.join(r.get('winning_fixes',[])) or '—'} |")
open("{{results_dir}}/leaderboard.md","w").write("\n".join(lines))
print("\n".join(lines))
```

---

## Step 9: Write Executive Summary

Write `{{results_dir}}/summary.md`: the climb story in plain language — where the panel started,
which criterion was the biggest lever (usually groundedness — the hallucination floor does the
work), the final pass rate, and the **winning scribe prompt** (the artifact a real scribe team would
ship). Include the per-criterion baseline→final table and 1–2 vivid examples (e.g. "encounter_04
invented an A1c of 7.2 the patient never gave; the groundedness fix replaced it with 'patient
reports home glucose ~200', clearing the floor").

---

## Step 10: Write Validation Report

Write `{{results_dir}}/validation_report.json`:

```json
{
  "version": "1.0.0",
  "eval": "ambient-scribe-soap-hillclimb",
  "parameters": {"panel_size": 8, "scribe_model": "{{scribe_model}}", "judge_model": "{{judge_model}}",
                 "max_iterations": 3, "pass_threshold": 4.0},
  "stages": [
    {"name": "setup", "passed": true, "message": ""},
    {"name": "case_panel", "passed": true, "message": "N synthetic ground-truth case sheets w/ traps"},
    {"name": "transcripts", "passed": true, "message": "messy ambient transcripts containing every fact + the trap"},
    {"name": "scribe_judge_climb", "passed": true, "message": "per-encounter hill-climb; hard hallucination floor"},
    {"name": "leaderboard", "passed": true, "message": "baseline vs final per-criterion + pass rate"}
  ],
  "panel": {"encounters": 0, "pass_rate_pct": 0, "avg_iterations": 0.0, "hallucinations_caught": 0},
  "per_criterion_final_mean": {"completeness": 0, "groundedness": 0, "structure": 0, "coding": 0, "conciseness": 0},
  "pass_threshold": 4.0,
  "overall_passed": false,
  "output_files": [
    "{{results_dir}}/case_sheets.json","{{results_dir}}/leaderboard.md","{{results_dir}}/summary.md",
    "{{results_dir}}/scribe_prompt_final.md","{{results_dir}}/validation_report.json"
  ]
}
```

Set `overall_passed` true when the panel pass rate meets your bar (e.g. ≥ 75% of encounters pass the
gate) — and say so explicitly in `summary.md`.

---

## Final Checklist (MANDATORY)

```bash
echo "=== FINAL OUTPUT VERIFICATION ==="
RESULTS_DIR="{{results_dir}}"; FAIL=0
for f in case_sheets.json leaderboard.md summary.md scribe_prompt_final.md validation_report.json; do
  if [ ! -s "$RESULTS_DIR/$f" ]; then echo "FAIL: $f missing/empty"; FAIL=$((FAIL+1)); else echo "PASS: $f ($(wc -c < "$RESULTS_DIR/$f") bytes)"; fi
done
NT=$(ls "$RESULTS_DIR"/transcripts/*.md 2>/dev/null | wc -l | tr -d ' ')
NN=$(ls "$RESULTS_DIR"/notes/*.md 2>/dev/null | wc -l | tr -d ' ')
NS=$(ls "$RESULTS_DIR"/scorecards/*.json 2>/dev/null | wc -l | tr -d ' ')
echo "transcripts=$NT notes=$NN scorecards=$NS"
[ "$NT" -ge 1 ] && [ "$NT" = "$NN" ] && [ "$NN" = "$NS" ] || { echo "FAIL: per-encounter file counts mismatch"; FAIL=$((FAIL+1)); }
[ "$FAIL" -gt 0 ] && { echo "OVERALL: FAIL ($FAIL)"; exit 1; }; echo "OVERALL: PASS"
```

- [ ] `case_sheets.json` written with `panel_size` synthetic cases, each with a `trap_detail`
- [ ] One transcript, one note, one scorecard per encounter (counts match)
- [ ] Every scorecard has an `iteration_log` showing the climb (scores per round)
- [ ] The hard hallucination floor fired at least once across the panel (the gate has teeth)
- [ ] `judge_model` ≠ `scribe_model` (independence)
- [ ] `leaderboard.md` shows baseline→final per-criterion deltas + pass rate
- [ ] `scribe_prompt_final.md` is the hill-climbed prompt; `summary.md` tells the climb story
- [ ] Verification script printed PASS

**Do not finish until every item passes.**

## Tips

- **Synthetic only.** No real PHI ever. This is a documentation-quality eval, not clinical advice.
- **Judge ≠ scribe** is non-negotiable for a credible climb — same-model judging rubber-stamps.
- **The hallucination floor is the whole point.** It's computed in code from the judge's extracted
  claim list, not from the judge's overall vibe — so the climb can't pad its way past the gate.
- **Start the scribe thin.** A strong baseline prompt leaves no room to climb and kills the demo.
  The visible 2.0 → 4.2 arc per encounter is the story worth showing.
- **Groundedness usually moves first and most** — fix it before chasing completeness/coding.
- For a fast walkthrough use `panel_size=4`; for a real run use `8–12` (and budget the minutes —
  that runtime *is* the "long-running agentic workflow" half of the demo).