# Second Opinion v3 — "LLM Council" Design _Status: implemented (v3). v2 (2026-06-03) added the council mechanics; v3 (2026-06-10) swapped the transport onto the Amicus fanout/JSON engine primitives. v2 history lives in git (`V2-COUNCIL-DESIGN.md`, deleted at v3)._ _Design for `SKILL.md` and `MODEL-NOTES.md` of the `second-opinion` skill._ ## 1. Intent Upgrade `second-opinion` by porting the best mechanics of the **LLM Council** web-app pattern — peer cross-review, anonymized ranking, aggregate scoring, and a designated chairman — into the skill, while keeping its existing strengths (model recommendation, sidecar orchestration, tiered accept/deny, reviewed-copy output, and the MODEL-NOTES self-improvement loop). This is the **"full port" (option C)**: cross-review + anonymization + aggregate scoring + non-Claude chairman + per-model inspectable artifacts. ## 2. Core framing & principles - **Second-opinion is a *secondary* review tool.** By the time it runs, **Claude has already given its opinion** in the main conversation. The skill exists to bring in *independent outside* views. - **The council is the *non-Claude* bench by default.** Council members are models from families other than the orchestrator (Gemini, DeepSeek, GPT, etc.). Claude is **not** a first-opinion council member unless the optional "Claude in the council" toggle is on (§5.4) — and even then it is judged but does not vote or chair. - **Claude's role shrinks to orchestrator:** prep material, recommend the council, anonymize, drive the stages, score, present accept/deny, write files. **Claude does not synthesize the verdict** — a council model chairs that (§5.3). - **Subject of review.** The novel cross-review step has models critique **each other's reviews of your artifact** — *not* re-review the artifact. (LLM Council ranks the models' own answers; here the "answers" are the reviews.) - **This is an executed skill, not an app.** All council logic (anonymize → rank → aggregate → chair) is prose workflow Claude performs while driving the `amicus` CLI. v3 note: the *transport* is now engine-native — each review wave is ONE `amicus fanout --json` call returning structured run documents — but scoring, tallying, anonymization, and synthesis remain Claude's manual work. No backend, no parsing code beyond reading JSON fields. Deterministic arithmetic/formatting/schema helpers under `amicus council` (findings validation, tier tally, street-cred, ledger) are sanctioned; judgment, synthesis, anonymization, and de-anonymization remain Claude's inline work. ## 3. What changes vs. v1 | Area | v1 | v3 (this design) | |---|---|---| | Independent reviews | ✅ Phase 2 parallel sidecars | ✅ Stage 1 — now emits a **structured findings list** | | Cross-review | ❌ none | ⭐ **Stage 2** — anonymized peer ranking **+** per-finding adjudication | | Synthesis | Claude synthesizes | ⭐ **Council-model chair** synthesizes; Claude only presents | | Decision tiers | Claude's consensus/divergence read | ⭐ **Peer-validated** tiers (Disputed / Confirmed / Contested / Singleton) | | Scoring | none | ⭐ Reviewer **street-cred** + per-finding **peer-confidence** | | Artifacts | reviewed copy + report | + per-model raw reviews, cross-review matrix, chair verdict (run folder) | | MODEL-NOTES | per-model quirks | + **reviewer-reliability** rolling table feeding recommendations | | Transport (v3) | N parallel `start` calls + prose-scraping | ⭐ one `fanout --json` wave per stage; briefings via `--prompt-file`; JSON status/summary parsing | Preserved unchanged: intake/criteria intake, the MODEL-NOTES operating rules, reviewed-copy vs standalone-report logic, cost guardrail, and the Stage 6 approval-gated MODEL-NOTES update. ## 4. The council flow Run as ordered phases; track as todos. **Three sequential waves of model calls** (Stage 1 → 2 → 3 each depend on the prior); within each wave, models run in parallel. ### Stage 0 — Intake & prep - Confirm **source material**, **the analysis**, **the criteria** (ask only for what's missing). - Prepare material for council models per MODEL-NOTES (extract clean text from links / large / marked-up sources to a small temp file; small text used as-is). - Pick the council: **non-Claude models, default 3 from different families** (enough voices for a real cross-review + a tie-breaker). Recommend ranked-by-fit (consult MODEL-NOTES reviewer-reliability), state cost, **disclose run shape up front** — "~2N+1 calls across 3 waves, ~X min" — and **wait for confirmation**. Honor the cost guardrail. - **Scale-down is explicit:** 1 model = thorough single pass (skip Stage 2 & chair); 2 = cross- review works but ranking is thin; 3 = default deep council. - **Optional — "Claude in the council" (default off):** offer to add Claude as a *judged* contributor so the bench's verdict on Claude's own take is visible. When on, Claude adds one Stage-1 review to the bundle but does **not** judge (Stage 2) or chair (Stage 3). See §5.4. ### Stage 1 — Independent reviews - All council models review **your artifact** via ONE fanout wave (see SKILL.md Stage 1 for the canonical command). The wave JSON returns every leg's status + review in one parse. A red-team reviewer with a distinct brief runs as a separate concurrent `amicus start --json` call alongside the wave (fanout legs share one prompt by design). - Required structured output: a **findings list**, each finding = `id · claim · severity (blocker/major/minor/nit) · location (section/quote) · rationale`, plus a short overall take. - Save each raw review to the run folder (§6). - If **"Claude in the council"** is on, Claude also produces a **fresh** review in the same findings format (a new structured pass on the artifact, regardless of any upstream feedback) and adds it to the bundle as one more review (§5.4). ### Stage 2 — Cross-review (the headline) - Claude builds **one shared anonymized bundle**: all Stage 1 reviews relabeled **Review A/B/C…**, with a private label↔model map Claude keeps (§5.1). - The **same bundle** goes to **every** council model — exactly fanout's shared-prompt model: one wave call distributes it, and each judge is asked to do two things on the bundle: 1. **Rank** the reviews by accuracy + insight, ending with a parseable block: `FINAL RANKING:` then `1. Review C` / `2. Review A` … (LLM Council's format). 2. **Adjudicate findings** — for each finding in the bundle: `agree | dispute | neutral` + one-line reason (an "I missed this, it's valid" counts as agree). Findings are referenced by **review label + finding id** (e.g. `A2` = Review A's 2nd finding) so Claude can map each verdict back to the originating model and claim when tallying. - Each model **unknowingly ranks/adjudicates its own review too** — this is the anti-favoritism mechanism, not a bug (§5.1). - Claude de-anonymizes for scoring/display only. ### Stage 3 — Council-chair synthesis - A designated **non-Claude** chair (recommended + confirmed in Stage 0/launch) receives all reviews + rankings + adjudications and writes the **synthesized verdict**, weighted by street-cred and peer-confidence. Independent of Claude. - Chair selection & fallback: §5.3. ### Stage 4 — Tiered decisions (peer-validated) - **Consensus tier** = **Confirmed** findings (≥ 2 peer agreements, agrees dominate) → offer one **bulk accept/deny** (user may name exceptions). - **Judgment tier** = **Disputed** (strong peer pushback), **Contested** (live dispute), or **Singleton** (only the raiser) → present **each individually**, showing the adjudication data and which model raised/disputed it. - Record every decision (accepted / denied / modified / deferred). ### Stage 5 — Outputs - **Editable source** → write `-reviewed.` next to the original (accepted changes only; validate structural integrity). **Fixed source** → standalone reviewed report. - Always write the run folder artifacts (§6). ### Stage 6 — Capture lessons (compounding) - Reflect on failures/mitigations and briefing wording, as today. - **Ledger auto-appends** — `ledger.appendRun(record)` writes one row per (run × model) to `council-ledger.jsonl` automatically at finalize (shown in the run summary). No manual reliability-table update needed. - **Qualitative MODEL-NOTES update (approval-gated):** draft per-model quirk/conformance notes; write the proposed diff to `_tmp-proposed-model-notes-update.md`; present its path in the approval prompt; do not write until approved. The approval prompt carries the diff file's path; chat text alone is not sufficient (an approval dialog can hide the chat transcript). Keep it tight. ## 5. Key mechanics ### 5.1 Anonymization (shared bundle) - After Stage 1, Claude assembles **one** bundle with stable labels Review A/B/C… and keeps a private map (e.g., `Review A → deepseek`, `Review B → gemini`, …). - The identical bundle is sent to every judge. Because a model can't tell which review is its own, it ranks/adjudicates all of them honestly; symmetric self-bias washes out across judges. - Claude only de-anonymizes when computing scores and writing the matrix/report for the user. - When "Claude in the council" is on, Claude's own review is anonymized into the **same** bundle and judged blind by the council models. Claude never ranks/adjudicates (it holds the map) — the asymmetry detailed in §5.4. ### 5.2 Scoring (`amicus council tally` computes; Claude may override at the margins) **Street cred** — computed two ways by `amicus council tally`: - **withSelf** = each model's mean rank position across **all** judges' `FINAL RANKING:` blocks (lower = better). - **peersOnly** = mean rank across judges **other than** that model (self-vote excluded). The cross-review matrix shows both; the ledger and Stage-0 bench recommendations consume **peersOnly** only. **Per-finding peer-confidence tier** — determined by a **peers-only** cascade: for a finding raised by model R, peers are all judges except R (the raiser's own adjudication is excluded — consistent with the peers-only street-cred rule). Let `a` = peer agrees, `d` = peer disputes. The cascade is exhaustive and mutually exclusive: | Priority | Tier | Rule | Meaning | |---|---|---|---| | 1 | **Disputed** | `d ≥ 2` and `d > a` | Strong peer pushback — the finding itself is likely wrong | | 2 | **Confirmed** | `a ≥ 2` and `a > d` | ≥ 2 independent corroborations, agrees dominate | | 3 | **Contested** | `d ≥ 1` (whatever remains) | At least one live dispute — in question | | 4 | **Singleton** | else (`d = 0` and `a < 2`) | At most one endorsement, no pushback — thin | `confidence` is `thin` when total engaged peers `a + d ≤ 1` — cells `(0,0)`, `(1,0)`, and `(0,1)`. **Claude may override the tier at `thin` margins** (recorded as `tierOverride: {from, to, reason}` and surfaced in the matrix and `verdict.json`). These four tiers drive Stage 4. `amicus council tally` assigns them deterministically; judgment at the margins remains Claude's. ### 5.3 Chair selection & fallback - Default: Claude **recommends a non-Claude chair** from the council each run (often the strongest reasoner / best reviewer-reliability) and the user confirms at launch. - The chair **may** also be a Stage-1 council member (it sees the anonymized bundle + scores). - **Fallback order if the chair fails:** re-run → promote next-best council model → **Claude chairs only as last resort, with explicit disclosure** that the verdict is no longer fully independent. ### 5.4 Claude in the council (optional, default off) Lets you see how the bench judges Claude's *own* take. - **Asymmetric by necessity.** Claude is the orchestrator and holds the label↔model map, so it cannot judge blind. Therefore Claude **contributes a review to be judged but does not vote (Stage 2) or chair (Stage 3).** The verdict stays independent. - **Which review: always fresh** — Claude does a new structured Stage-1 review on the artifact every time it's enabled (not a formalization of upstream feedback). - **Readout — "How Claude's review fared":** Claude's street-cred rank among peers and the Disputed/Confirmed/Contested/Singleton split of its findings, reported in the matrix and report. - **Integrity:** when Claude presents results, it reports the bench's verdict on its own review at face value — no defending or re-litigating. ## 6. Outputs & naming One tidy run folder: `output/-council/` (or `./second-opinion/-council/` if no `output/` exists): - `review-.md` ×N — raw Stage 1 reviews (plus `review-claude.md` when "Claude in the council" is on) - `crossreview-matrix.md` — adjudication grid + street-cred table (de-anonymized) - `verdict.md` — the chair's synthesis (prose) - `verdict.json` — schema-stamped machine-readable record: tally output + Stage-4 decisions, written via `buildVerdict(record, decisions)` at Stage 5 - `report.md` — synthesis + decision log + what was applied (+ the "How Claude's review fared" readout when "Claude in the council" is on) - `-reviewed.` — written **next to the original**, as today (editable sources only) - Temp extracts get a clearly-temporary name and are cleaned up at the end. ## 7. MODEL-NOTES reviewer-reliability The append-only `council-ledger.jsonl` (consumed via `amicus council stats`) is the **authoritative source of quantitative reviewer-reliability data** — runs, avg peers-only street-cred, confirm-rate, fact-error rate, conformance distribution. `MODEL-NOTES.md` keeps only *qualitative* per-model quirks and structural-conformance notes (`clean` / `repaired` / `unstructured`); it may embed a snapshot generated from `amicus council stats --json` but is no longer hand-edited for numbers. Stage-6 reliability updates are written by the ledger auto-append; the MODEL-NOTES prose update remains approval-gated. ## 8. Gating, cost, degradation & failure handling - **Gating:** council is the default identity but scales down (§ Stage 0). Always disclose run shape/cost and confirm before launching. - **Cost guardrail (unchanged):** never `o3`/`o3-pro` unless the user asks by name; warn on cost. - **Degradation (judge-count thresholds, unchanged):** gating counts **non-Claude judges**. 1 judge → single-pass (no Stage 2/chair); if the bench drops below 2 mid-run, degrade to single-pass and disclose. 2 judges → Stage 2 runs but note the thin ranking. ("Claude in the council" adds a *judged* review but **no** judge.) - **Wave-degrade rules (v3):** leg failures are read from the wave document (`status: partial`, `counts`, per-leg `status`/`error`): - **Stage 1:** a leg ends `error`/`timeout`/`crashed`/`aborted` → proceed when ≥2 reviews survive; below 2, offer a re-run of the dead leg or a disclosed downgrade to single-pass. - **Stage 2:** a judge leg dies → tally rankings/adjudications over the surviving judges and disclose the reduced bench in `crossreview-matrix.md`. Tier definitions are unchanged (they already count "judges engaged"). - **Stage 3:** chair failure uses the same fallback chain (re-run → promote next-best non-Claude → Claude chairs with explicit disclosure). - **Run stats (v3):** `report.md` includes a per-leg table (model, status, durationMs) read from the wave/run documents. `durationMs` and `usage` are copied verbatim from the per-leg run docs; any leg with no run doc gets `durationMs: null` (and `usage: null`) — never invent a value. The schema carries no cost data — never invent cost figures. - **Transient failures:** provider 502s etc. → re-run the affected leg (solo `start --json`) or the wave; never present a half-finished run as an answer. ## 9. Non-goals (YAGNI) - No web UI, API server, or persistent conversation store (LLM Council's app shell). - No code/backend for scoring or parsing — Claude does it inline. Deterministic arithmetic/formatting/schema helpers under `amicus council` (findings validation, tier tally, street-cred, ledger) are sanctioned; judgment, synthesis, anonymization, and de-anonymization remain Claude's inline work. - No automatic MODEL-NOTES writes — always approval-gated. - Claude is **not** a council member by default; it joins only via the opt-in toggle (§5.4), and even then it is judged-but-non-voting/non-chairing. ## 10. Open questions - None blocking. Possible later refinement: a numeric peer-confidence score instead of the four qualitative tiers, if tiers prove too coarse in practice. ## 11. Implementation surface - `SKILL.md` — the Stage 0–6 council flow on the v3 transport (WS-3: findings contract, tally assembly recipe, `amicus council tally/stats`, `verdict.json`, ledger auto-append). - `MODEL-NOTES.md` — qualitative per-model quirks, structural-conformance notes, cost guardrail, Stage-2 briefing tips. Quantitative reliability data now generated by `amicus council stats` (ledger). Engine workarounds that F1/F2/F4 made obsolete were pruned at v3. - `src/council/` — the deterministic helpers (`findings.js`, `tally.js`, `verdict.js`, `ledger.js`). No other files.