--- name: council tier: orchestration description: 'Multi-model consensus council for validation, research, and brainstorming. Spawns parallel judges with configurable perspectives and optional explorer sub-agents. Modes: validate, brainstorm, research. Triggers: council, validate, brainstorm, critique, research, analyze, multi-model, consensus.' dependencies: - standards # optional - loaded for code validation context replaces: judge --- # /council — Multi-Model Consensus Council Spawn parallel judges with different perspectives, consolidate into consensus. Works for any task — validation, research, brainstorming. ## Quick Start ```bash /council --quick validate recent # fast inline check /council validate this plan # validation (2 agents) /council brainstorm caching approaches # brainstorm /council validate the implementation # validation (critique triggers map here) /council research kubernetes upgrade strategies # research /council research the CI/CD pipeline bottlenecks # research (analyze triggers map here) /council --preset=security-audit validate the auth system # preset personas /council --deep --explorers=3 research upgrade automation # deep + explorers /council --debate validate the auth system # adversarial 2-round review /council --deep --debate validate the migration plan # thorough + debate /council # infers from context ``` ## Use Cases Council is a general-purpose multi-model consensus tool. Use it for: | Use Case | Example | Recommended Mode | |----------|---------|-----------------| | Code review | `/council validate recent` | validate | | Plan validation | `/council validate the migration plan` | validate | | Architecture analysis | `/council --preset=architecture research microservices boundaries` | research | | Deep codebase research | `/council --deep --explorers=3 research the auth system` | research | | Decision making | `/council brainstorm caching strategies` | brainstorm | | Risk assessment | `/council --preset=ops validate the deployment pipeline` | validate | | Security audit | `/council --preset=security-audit validate the API` | validate | | Spec feedback | `/council validate the design doc` | validate | | Technology comparison | `/council research Redis vs Memcached for our use case` | research | | Incident investigation | `/council --deep research why deployments are slow` | research | ## Modes | Mode | Agents | Vendors | Use Case | |------|--------|---------|----------| | `--quick` | 0 (inline) | Self | Fast single-agent check, no spawning | | default | 2 | Claude | Independent judges (no perspective labels) | | `--deep` | 3 | Claude | Thorough review | | `--mixed` | 3+3 | Claude + Codex | Cross-vendor consensus | | `--debate` | 2+ | Claude | Adversarial refinement (2 rounds) | ```bash /council --quick validate recent # inline single-agent check, no spawning /council recent # 2 Claude agents /council --deep recent # 3 Claude agents /council --mixed recent # 3 Claude + 3 Codex ``` ## When to Use `--debate` Use `--debate` for high-stakes or ambiguous reviews where judges are likely to disagree: - Security audits, architecture decisions, migration plans - Reviews where multiple valid perspectives exist - Cases where a missed finding has real consequences Skip `--debate` for routine validation where consensus is expected. Debate adds R2 latency (judges stay alive, process a second round via SendMessage). **Incompatibilities:** - `--quick` and `--debate` cannot be combined. `--quick` runs inline with no spawning; `--debate` requires multi-agent rounds. If both are passed, exit with error: "Error: --quick and --debate are incompatible." - `--debate` is only supported with validate mode. Brainstorm and research do not produce PASS/WARN/FAIL verdicts. If combined, exit with error: "Error: --debate is only supported with validate mode." ## Task Types | Type | Trigger Words | Perspective Focus | |------|---------------|-------------------| | **validate** | validate, check, review, assess, critique, feedback, improve | Is this correct? What's wrong? What could be better? | | **brainstorm** | brainstorm, explore, options, approaches | What are the alternatives? Pros/cons? | | **research** | research, investigate, deep dive, explore deeply, analyze, examine, evaluate, compare | What can we discover? What are the properties, trade-offs, and structure? | Natural language works — the skill infers task type from your prompt. --- ## Architecture ### Execution Flow ``` ┌─────────────────────────────────────────────────────────────────┐ │ Phase 1: Build Packet (JSON) │ │ - Task type (validate/brainstorm/research) │ │ - Target description │ │ - Context (files, diffs, prior decisions) │ │ - Perspectives to assign │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Phase 1a: Create Team │ │ TeamCreate(team_name="council-YYYYMMDD-") │ │ Team lead = spawner (this agent) │ └─────────────────────────────────────────────────────────────────┘ │ ┌─────────────────┴─────────────────┐ ▼ ▼ ┌───────────────────────┐ ┌───────────────────────┐ │ CLAUDE AGENTS │ │ CODEX AGENTS │ │ (Task tool, teammates│ │ (Bash tool, parallel)│ │ on council team) │ │ │ │ │ │ Agent 1 (independent │ │ Agent 1 (independent │ │ or with preset) │ │ or with preset) │ │ Agent 2 │ │ Agent 2 │ │ Agent 3 │ │ Agent 3 (--deep only)│ │ (--mixed only) │ │ (--deep/--mixed only)│ │ │ │ │ │ Output: JSON + MD │ │ Write files, then │ │ Files: .agents/ │ │ SendMessage to lead │ │ council/codex-* │ │ Files: .agents/ │ └───────────────────────┘ │ council/claude-* │ │ └───────────────────────┘ │ │ │ └─────────────────┬─────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Phase 2: Consolidation (Team Lead) │ │ - Receive completion messages from judges via SendMessage │ │ - Read all agent output files │ │ - Compute consensus verdict │ │ - Identify shared findings │ │ - Surface disagreements with attribution │ │ - Generate Markdown report for human │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Phase 3: Cleanup │ │ - shutdown_request each judge │ │ - TeamDelete() │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Output: Markdown Council Report │ │ - Consensus: PASS/WARN/FAIL │ │ - Shared findings │ │ - Disagreements (if any) │ │ - Recommendations │ └─────────────────────────────────────────────────────────────────┘ ``` ### Graceful Degradation | Failure | Behavior | |---------|----------| | 1 of N agents times out | Proceed with N-1, note in report | | All Codex agents fail | Proceed Claude-only, note degradation | | All agents fail | Return error, suggest retry | | Codex CLI not installed | Skip Codex agents, Claude-only (warn user) | | Native teams unavailable | Fall back to `Task(run_in_background=true)` fire-and-forget | | Output dir missing | Create `.agents/council/` automatically | Timeout: 120s per agent (configurable via `--timeout=N` in seconds). **Minimum quorum:** At least 1 agent must respond for a valid council. If 0 agents respond, return error. ### Pre-Flight Checks Before spawning agents, verify tools are available: ```bash # Always available (Task tool is built-in) # Claude agents: no pre-flight needed # Native teams: TeamCreate is built-in # Fallback: if TeamCreate fails, use Task(run_in_background=true) # Codex agents (--mixed only) if ! which codex > /dev/null 2>&1; then echo "⚠️ Codex CLI not found. Falling back to Claude-only." # Downgrade --mixed to --deep (3 Claude agents) else # Model availability test — catches account-type restrictions (e.g. gpt-5.3-codex on ChatGPT accounts) CODEX_MODEL="${COUNCIL_CODEX_MODEL:-gpt-5.3-codex}" if ! codex exec --full-auto -m "$CODEX_MODEL" -C "$(pwd)" "echo model-check-ok" > /dev/null 2>&1; then echo "⚠️ Codex model $CODEX_MODEL unavailable. Falling back to Claude-only." # Downgrade --mixed to --deep (3 Claude agents) fi fi # Check agent count (includes --count override) judges = count_override if --count else (judges_for_mode) total_agents = judges * (1 + explorers) if total_agents > 12: error("Total agent count ${total_agents} exceeds MAX_AGENTS (12). Reduce --count, --explorers, or remove --mixed.") # Create output directory mkdir -p .agents/council ``` --- ## Quick Mode (`--quick`) Single-agent inline validation. No subprocess spawning, no Task tool, no Codex. The current agent performs a structured self-review using the same output schema as a full council. **When to use:** Routine checks, mid-implementation sanity checks, pre-commit quick scan. Use full council for important decisions, final reviews, or when cross-perspective disagreement is valuable. ### Quick Mode Execution 1. **Gather context** (same as full council — read target files, get diffs) 2. **Skip agent spawning** — no Task tool, no background agents 3. **Perform structured self-review inline** using this template: ``` Analyze the target as a single independent reviewer. Target: {TARGET_DESCRIPTION} Context: {FILES_AND_DIFFS} Respond with: 1. A JSON block matching the council output_schema: { "verdict": "PASS | WARN | FAIL", "confidence": "HIGH | MEDIUM | LOW", "key_insight": "Single sentence summary", "findings": [ { "severity": "critical | significant | minor", "category": "security | architecture | performance | style", "description": "What was found", "location": "file:line if applicable", "recommendation": "How to address" } ], "recommendation": "Concrete next step" } 2. A brief Markdown explanation (2-5 paragraphs max) ``` 4. **Write report** to `.agents/council/YYYY-MM-DD-quick-.md` 5. **Label clearly** as `Mode: quick (single-agent)` in the report header ### Quick Mode Report Format ```markdown # Council Quick Check: **Date:** YYYY-MM-DD **Mode:** quick (single-agent, no multi-perspective spawning) **Target:** ## Verdict: PASS | WARN | FAIL ## Analysis --- *Quick check — for thorough multi-perspective review, run `/council validate` (default mode).* ``` ### Quick Mode Limitations - No cross-perspective disagreement (single viewpoint) - No cross-vendor insights (no Codex) - Lower confidence ceiling than full council - Not suitable for security audits or architecture decisions — use `--deep` or `--mixed` for those --- ## Packet Format (JSON) The packet sent to each agent. **File contents are included inline** — agents receive the actual code/plan text in the packet, not just paths. This ensures both Claude and Codex agents can analyze without needing file access. ```json { "council_packet": { "version": "1.0", "mode": "validate | brainstorm | research", "target": "Implementation of user authentication system", "context": { "files": [ { "path": "src/auth/jwt.py", "content": "" }, { "path": "src/auth/middleware.py", "content": "" } ], "diff": "git diff output if applicable", "spec": { "source": "bead na-0042 | plan doc | none", "content": "The spec/bead description text (optional — included when wrapper provides it)" }, "prior_decisions": [ "Using JWT, not sessions", "Refresh tokens required" ] }, "perspective": "skeptic (only when --preset or --perspectives used)", "perspective_description": "What could go wrong? (only when --preset or --perspectives used)", "output_schema": { "verdict": "PASS | WARN | FAIL", "confidence": "HIGH | MEDIUM | LOW", "key_insight": "Single sentence summary", "findings": [ { "severity": "critical | significant | minor", "category": "security | architecture | performance | style", "description": "What was found", "location": "file:line if applicable", "recommendation": "How to address" } ], "recommendation": "Concrete next step" } } } ``` --- ## Perspectives ### Default: Independent Judges (No Perspectives) When no `--preset` or `--perspectives` flag is provided, all judges get the **same prompt** with no perspective label. Diversity comes from independent sampling, not personality labels. | Judge | Prompt | Assigned To | |-------|--------|-------------| | **Judge 1** | Independent judge — same prompt as all others | Agent 1 | | **Judge 2** | Independent judge — same prompt as all others | Agent 2 | | **Judge 3** | Independent judge — same prompt as all others | Agent 3 (--deep/--mixed) | The default judge prompt (no perspective labels): ``` You are Council Judge {N}. You are one of {TOTAL} independent judges evaluating the same target. {JSON_PACKET} Instructions: 1. Analyze the target thoroughly 2. Write your analysis to: .agents/council/{OUTPUT_FILENAME} - Start with a JSON code block matching the output_schema - Follow with Markdown explanation 3. Send verdict to team lead Your job is to find problems. A PASS with caveats is less valuable than a specific FAIL. ``` When `--preset` or `--perspectives` is used, judges receive the perspective-labeled prompt instead (see Agent Prompts section). ### Auto-Escalation Rule **When `--preset` or `--perspectives` specifies more perspectives than the current judge count, automatically escalate judge count to match.** For example: - `/council --preset=security-audit validate X` → 3 perspectives → auto-escalate to 3 judges (equivalent to `--deep`) - `/council --perspectives="a,b,c,d" validate X` → 4 perspectives → auto-escalate to 4 judges (equivalent to `--count=4`) This prevents silently dropping perspectives. The `--count` flag overrides auto-escalation (user explicitly chose a count). ### Custom Perspectives Simple name-based: ```bash /council --perspectives="security,performance,ux" validate the API ``` ### Built-in Presets Use `--preset=` for common persona configurations: | Preset | Perspectives | Best For | |--------|-------------|----------| | `default` | (none — independent judges) | General validation | | `security-audit` | attacker, defender, compliance | Security review | | `architecture` | scalability, maintainability, simplicity | System design | | `research` | breadth, depth, contrarian | Deep investigation | | `ops` | reliability, observability, incident-response | Operations review | | `code-review` | error-paths, api-surface, spec-compliance | Code validation (used by /vibe) | | `plan-review` | missing-requirements, feasibility, scope | Plan validation (used by /pre-mortem) | | `retrospective` | plan-compliance, tech-debt, learnings | Post-implementation review (used by /post-mortem) | ```bash /council --preset=security-audit validate the auth system /council --preset=research --explorers=3 research upgrade automation /council --preset=architecture research microservices boundaries ``` **Preset definitions** are built-in perspective configurations. **Preset perspective details:** ``` security-audit: attacker: "How would I exploit this? What's the weakest link?" defender: "How do we detect and prevent attacks? What's our blast radius?" compliance: "Does this meet regulatory requirements? What's our audit trail?" architecture: scalability: "Will this handle 10x load? Where are the bottlenecks?" maintainability: "Can a new engineer understand this in a week? Where's the complexity?" simplicity: "What can we remove? Is this the simplest solution?" research: breadth: "What's the full landscape? What options exist? What's adjacent?" depth: "What are the deep technical details? What's under the surface?" contrarian: "What's the conventional wisdom wrong about? What's overlooked?" ops: reliability: "What fails first? What's our recovery time? Where are SPOFs?" observability: "Can we see what's happening? What metrics/logs/traces do we need?" incident-response: "When this breaks at 3am, what do we need? What's our runbook?" code-review: error-paths: "Trace every error handling path. What's uncaught? What fails silently?" api-surface: "Review every public interface. Is the contract clear? Breaking changes?" spec-compliance: "Compare implementation against the spec/bead. What's missing? What diverges?" # Note: spec-compliance gracefully degrades to general correctness review when no spec # is present in context.spec. The judge reviews code on its own merits. plan-review: missing-requirements: "What's not in the spec that should be? What questions haven't been asked?" feasibility: "What's technically hard or impossible here? What will take 3x longer than estimated?" scope: "What's unnecessary? What's missing? Where will scope creep?" retrospective: plan-compliance: "What was planned vs what was delivered? What's missing? What was added?" tech-debt: "What shortcuts were taken? What will bite us later? What needs cleanup?" learnings: "What patterns emerged? What should be extracted as reusable knowledge?" ``` --- ## Explorer Sub-Agents Judges can spawn explorer sub-agents for parallel deep-dive research. This is the key differentiator for `research` mode — massive parallel exploration. ### Flag | Flag | Default | Max | Description | |------|---------|-----|-------------| | `--explorers=N` | 0 | 5 | Number of explorer sub-agents per judge | ### Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Judge (independent or with perspective) │ │ │ │ 1. Receive packet + perspective │ │ 2. Identify N sub-questions to explore │ │ 3. Spawn N explorers in parallel (Task tool, background) │ │ 4. Collect explorer results │ │ 5. Synthesize into final judge response │ └─────────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │Explorer 1│ │Explorer 2│ │Explorer 3│ │Sub-Q: "A"│ │Sub-Q: "B"│ │Sub-Q: "C"│ │ │ │ │ │ │ │Codebase │ │Codebase │ │Codebase │ │search + │ │search + │ │search + │ │analysis │ │analysis │ │analysis │ └──────────┘ └──────────┘ └──────────┘ ``` **Total agents:** `judges * (1 + explorers)` **MAX_AGENTS = 12** (hard limit). If total agents (judges x (1 + explorers)) exceeds 12, exit with error: "Error: Total agent count {N} exceeds MAX_AGENTS (12). Reduce --explorers or remove --mixed." | Example | Judges | Explorers | Total Agents | Status | |---------|--------|-----------|--------------|--------| | `/council research X` | 2 | 0 | 2 | Valid | | `/council --explorers=3 research X` | 2 | 3 | 8 | Valid | | `/council --deep --explorers=3 research X` | 3 | 3 | 12 | Valid (at cap) | | `/council --mixed --explorers=3 research X` | 6 | 3 | 24 | BLOCKED (exceeds 12) | | `/council --mixed research X` | 6 | 0 | 6 | Valid | | `/council --mixed --explorers=1 research X` | 6 | 1 | 12 | Valid (at cap) | ### Explorer Prompt ``` You are Explorer {M} for Council Judge {N}{PERSPECTIVE_SUFFIX}. (PERSPECTIVE_SUFFIX is " — THE {PERSPECTIVE}" when using presets, or empty for independent judges) ## Your Sub-Question {SUB_QUESTION} ## Context Working directory: {CWD} Target: {TARGET} ## Instructions 1. Use available tools (Glob, Grep, Read, Bash) to investigate the sub-question 2. Search the codebase, documentation, and any relevant sources 3. Be thorough — your findings feed directly into the judge's analysis 4. Return a structured summary: ### Findings ### Evidence ### Assessment ``` ### Explorer Execution Explorers are spawned as `Explore`-type subagents for speed: ``` Task( description="Explorer for Judge {N}: {SUB_QUESTION_SHORT}", subagent_type="Explore", model="sonnet", run_in_background=true, prompt="{EXPLORER_PROMPT}" ) ``` **Model selection:** Explorers use `sonnet` by default (fast, good at search). Judges use `opus` (thorough analysis). Override with `--explorer-model=`. ### Sub-Question Generation When `--explorers=N` is set, the judge prompt includes: ``` Before analyzing, identify {N} specific sub-questions that would help you answer thoroughly. For each sub-question, spawn an explorer agent to investigate it. Use the explorer findings to inform your final analysis. Sub-questions should be: - Specific and searchable (not vague) - Complementary (cover different aspects) - Relevant to your analysis angle (perspective if assigned, or general if independent) ``` ### Timeout Explorer timeout: 60s (half of judge timeout). Judge timeout starts after all explorers complete. --- ## Debate Phase (`--debate`) When `--debate` is passed, council runs two rounds instead of one. Round 1 produces independent verdicts. Round 2 lets judges review each other's work and revise. **Native teams unlock the key advantage:** Judges stay alive after R1. Instead of re-spawning fresh R2 judges with truncated R1 verdicts, the team lead sends other judges' full R1 verdicts via `SendMessage`. Judges wake from idle, process R2 with full context (their own R1 analysis + others' verdicts), and write R2 files. Result: no truncation loss, no spawn overhead, richer debate. ### Execution Flow (with --debate) ``` Phase 1: Build Packet + Create Team + Spawn R1 judges as teammates │ Collect all R1 verdicts (via SendMessage) Judges go idle after R1 (stay alive) │ Phase 1.5: Prepare R2 context (--debate only) - For each OTHER judge's R1 verdict, extract full JSON verdict - Team lead sends R2 instructions to each judge via SendMessage - Each judge already has its own R1 in context (no truncation needed) - Each judge receives other judges' verdicts (full JSON, not truncated) │ Phase 2: Judges wake up for Round 2 (--debate only) - Same judge instances as R1 (not re-spawned) - Each judge processes via SendMessage: - Other judges' full R1 JSON verdicts - Steel-manning rebuttal prompt - Branch: disagreed OR agreed - Judges write R2 files + send completion message │ Collect all R2 verdicts │ Phase 3: Consolidation (uses R2 verdicts when --debate) │ Phase 4: shutdown_request each judge, TeamDelete() ``` ### Round 2 via SendMessage **Branch selection (team lead responsibility):** ``` r1_verdicts = [extract JSON verdict from each R1 output file] r1_unanimous = all verdicts have same verdict value (PASS/WARN/FAIL) For each judge in [judge-1, judge-2, judge-3...] (or [judge-{perspective}...] with presets): other_verdicts = [v for v in r1_verdicts if v.judge != this_judge] branch = "agreed" if r1_unanimous else "disagreed" SendMessage( type="message", recipient="judge-{perspective}", content=build_r2_message(other_verdicts, branch), summary="Debate R2: review other verdicts" ) ``` **R2 message content:** See "Debate Round 2 Message" in Agent Prompts section below. **R2 output files:** Use `-r2` suffix to preserve R1 files: ``` .agents/council/YYYY-MM-DD--claude-{perspective}-r2.md ``` **With --explorers:** Explorers run in R1 only. R2 judges do not spawn explorers. Explorer findings from R1 are already in the judge's context (no truncation loss). **With --mixed:** Only Claude judges participate in R2 (they stay alive on the team). Codex agents run once in R1 (Bash-spawned, cannot join teams). For consolidation, use Claude R2 verdicts + Codex R1 verdicts. ### R1 Verdict Injection for R2 Since judges stay alive, truncation is no longer needed for a judge's **own** R1 verdict — it's already in their context. For **other judges' verdicts** sent via SendMessage, include the full JSON verdict block: ```json { "judge": "judge-1 (or perspective name when using presets)", "verdict": "WARN", "confidence": "HIGH", "key_insight": "Rate limiting missing on auth endpoints", "findings": [ {"severity": "significant", "description": "No rate limiting on /login"}, {"severity": "significant", "description": "JWT expiry too long (1h)"}, {"severity": "minor", "description": "Missing request ID in error responses"} ], "recommendation": "Add rate limiting to auth endpoints" } ``` Full Markdown analysis remains in `.agents/council/YYYY-MM-DD--claude-{perspective}.md` files and can be referenced by the team lead during consolidation. ### Timeout and Failure Handling | Scenario | Behavior | |----------|----------| | No R2 completion message within `COUNCIL_R2_TIMEOUT` (default 90s) | Read their R1 output file, use R1 verdict for consolidation | | All judges fail R2 | Fall back to R1-only consolidation (note in report) | | R1 judge timed out | No R2 message for that perspective (N-1 in R2) | | Mixed R2 timeout | Consolidate with available R2 verdicts + R1 fallbacks | ### Cost and Latency `--debate` adds R2 latency but **reduces spawn overhead** vs the old re-spawn approach: - **Agents spawned:** N judges total (same instances for both rounds, not 2N) - **Wall time:** R1 time + R2 time (sequential rounds, but R2 is faster — no spawn delay) - **With --mixed:** Only Claude judges get R2. Codex agents run once (Bash-spawned, cannot join teams). For consolidation, use Claude R2 verdicts + Codex R1 verdicts for consensus computation. - **With --explorers:** Explorers run in R1 only. R2 cost = judge processing time (no explorer multiplication). - **Non-verdict modes:** `--debate` is only supported with validate mode. If combined with brainstorm or research, exit with error: "Error: --debate is only supported with validate mode. Debate requires PASS/WARN/FAIL verdicts." --- ## Agent Prompts ### Judge Agent Prompt — Default (Independent, No Perspectives) Used when no `--preset` or `--perspectives` flag is provided: ``` You are Council Judge {N}. You are one of {TOTAL} independent judges evaluating the same target. You are a teammate on team "{TEAM_NAME}". {JSON_PACKET} Instructions: 1. Analyze the target thoroughly 2. Write your analysis to: .agents/council/{OUTPUT_FILENAME} - Start with a JSON code block matching the output_schema - Follow with Markdown explanation 3. Send a message to the team lead with your verdict, confidence, and key insight 4. You may receive follow-up messages (e.g., debate round 2). Process and respond. Your job is to find problems. A PASS with caveats is less valuable than a specific FAIL. When sending your verdict to the team lead, use the structured envelope format. Your message MUST start with a JSON code block: \`\`\`json { "type": "verdict", "verdict": { "value": "PASS | WARN | FAIL", "confidence": "HIGH | MEDIUM | LOW", "key_insight": "One sentence summary" } } \`\`\` Rules: - Do NOT message other judges — all communication through team lead - Do NOT access TaskList — team lead manages task flow ``` ### Judge Agent Prompt — With Perspectives (Preset or Custom) Used when `--preset` or `--perspectives` flag is provided: ``` You are Council Member {N} — THE {PERSPECTIVE}. You are a teammate on team "{TEAM_NAME}". {JSON_PACKET} Your angle: {PERSPECTIVE_DESCRIPTION} Instructions: 1. Analyze the target from your perspective 2. Write your analysis to: .agents/council/{OUTPUT_FILENAME} - Start with a JSON code block matching the output_schema - Follow with Markdown explanation 3. Send a message to the team lead with your verdict, confidence, and key insight 4. You may receive follow-up messages (e.g., debate round 2). Process and respond. When sending your verdict to the team lead, use the structured envelope format. Your message MUST start with a JSON code block: \`\`\`json { "type": "verdict", "verdict": { "value": "PASS | WARN | FAIL", "confidence": "HIGH | MEDIUM | LOW", "key_insight": "One sentence summary" } } \`\`\` Rules: - Do NOT message other judges — all communication through team lead - Do NOT access TaskList — team lead manages task flow ``` ### Debate Round 2 Message (via SendMessage) When `--debate` is active, the team lead sends this message to each judge after R1 completes. The judge already has its own R1 analysis in context (no truncation needed). ``` ## Debate Round 2 ## Anti-Anchoring Protocol Before reviewing other judges' verdicts: 1. **RESTATE your R1 position** — Write 2-3 sentences summarizing your own R1 verdict and the key evidence that led to it. This anchors you to YOUR OWN reasoning before exposure to others. 2. **Then review other verdicts** — Only after restating your position, read the other judges' JSON verdicts below. 3. **Evidence bar for changing verdict** — You may only change your verdict if you can cite a SPECIFIC technical detail, code location, or factual error that you missed in R1. "Judge 2 made a good point" is NOT sufficient. "Judge 2 found an unchecked error path at auth.py:45 that I missed" IS sufficient. Other judges' R1 verdicts: ### {OTHER_JUDGE_PERSPECTIVE} {FULL_JSON_VERDICT} (repeat for each other judge) ## Debate Instructions You MUST follow this structure: **IF judges disagreed in R1 (different verdicts):** 1. **STEEL-MAN**: State the strongest version of an argument from another judge that you initially disagree with. Show you understand it fully before responding to it. 2. **CHALLENGE**: Identify at least one specific claim from another judge that you believe is wrong or incomplete. Cite evidence. 3. **ACKNOWLEDGE**: Identify at least one point from another judge that strengthens, modifies, or adds to your analysis. 4. **REVISE OR CONFIRM**: State your final verdict with specific reasoning. If changing from R1, explain exactly what new evidence changed your mind. If confirming, explain why the opposing arguments did not persuade you. **IF all judges agreed in R1 (same verdict):** Do NOT invent disagreement. Instead, stress-test the consensus: 1. **DEVIL'S ADVOCATE**: What is the strongest argument AGAINST the consensus? 2. **BLIND SPOT**: What perspective or risk did all judges overlook? 3. **CONFIRM OR REVISE**: Does the consensus hold under scrutiny? Do NOT change your verdict merely because others disagree. Do NOT defensively maintain without engaging with opposing arguments. Write revised verdict to: .agents/council/{R2_OUTPUT_FILENAME} Include "debate_notes" field in your JSON. Send completion message when done. Required JSON format: { "verdict": "PASS | WARN | FAIL", "confidence": "HIGH | MEDIUM | LOW", "key_insight": "...", "findings": [...], "recommendation": "...", "debate_notes": { "revised_from": "original verdict if changed, or null if unchanged", "steel_man": "strongest opposing argument I considered", "challenges": [ { "target_judge": "judge-2", "claim": "what they claimed", "response": "why I agree or disagree" } ], "acknowledgments": [ { "source_judge": "judge-1", "point": "what they found", "impact": "how it affected my analysis" } ] } } Then provide a Markdown explanation of your debate reasoning. ``` ### Consolidation Prompt ``` You are the Council Chairman. You have received {N} judge reports from {VENDORS}. ## Judge Reports {JUDGE_OUTPUTS_JSON} ## Your Task Synthesize into a final council report. For validate mode: 1. **Consensus Verdict**: PASS if all PASS, FAIL if any FAIL, else WARN 2. **Shared Findings**: Points all judges agree on 3. **Disagreements**: Where judges differ (with attribution) 4. **Cross-Vendor Insights**: (if --mixed) Unique findings per vendor 5. **Final Recommendation**: Concrete next step For brainstorm mode: 1. **Options Explored**: Each option with multi-perspective assessment 2. **Trade-offs**: Pros/cons matrix 3. **Recommendation**: Synthesized best approach For research mode: 1. **Facets Explored**: What each judge investigated 2. **Synthesized Findings**: Merged findings organized by theme 3. **Open Questions**: What remains unknown 4. **Recommendation**: Next steps for further investigation or action Output format: Markdown report for human consumption. ``` ### Consolidation Prompt — Debate Additions When `--debate` is used, append this to the consolidation prompt: ``` ## Additional Instructions (Debate Mode) You have received TWO rounds of judge reports. Round 1 (independent assessment): Each judge evaluated independently. Round 2 (post-debate revision): Each judge reviewed all other judges' findings and revised. When synthesizing: 1. Use Round 2 verdicts for the CONSENSUS VERDICT computation (PASS/WARN/FAIL) 2. Use Round 1 verdicts for FINDING COMPLETENESS — a finding in R1 but dropped in R2 without explanation deserves mention 3. Compare R1 and R2 to identify position shifts 4. Flag judges who changed verdict without citing a specific technical detail, a misinterpretation they corrected, or a finding they missed (possible anchoring) Flag judges who changed verdict without citing: - A specific file:line or code location - A factual error in their R1 analysis - A missing test case or edge case These are "weak flips" — potential anchoring, not genuine persuasion. 5. If R1 had at least 2 judges with different verdicts AND R2 is unanimous, note "Convergence detected — review reasoning for anchoring risk" 6. In the report, include the Verdict Shifts table showing R1→R2 changes per judge 7. Detect whether debate ran via native teams (judges stayed alive between rounds) or fallback (R2 judges were re-spawned with truncated R1 verdicts). Include the `**Fidelity:**` field in the report header: "full" for native teams, "degraded" for fallback. When a Round 2 verdict is unavailable (timeout fallback): - Read the full R1 output file (.agents/council/YYYY-MM-DD--claude-{perspective}.md) - Extract the JSON verdict block (first JSON code block in the file) - Use this as the judge's verdict for consolidation - Mark in report: "Judge {perspective}: R1 verdict (R2 timeout)" ``` --- ## Consensus Rules | Condition | Verdict | |-----------|---------| | All PASS | PASS | | Any FAIL | FAIL | | Mixed PASS/WARN | WARN | | All WARN | WARN | Disagreement handling: - If Claude says PASS and Codex says FAIL → DISAGREE (surface both) - Severity-weighted: Security FAIL outweighs style WARN **DISAGREE resolution:** When vendors disagree, the spawner presents both positions with reasoning and defers to the user. No automatic tie-breaking — cross-vendor disagreement is a signal worth human attention. --- ## Output Format ### Council Report (Markdown) ```markdown ## Council Consensus: WARN **Target:** Implementation of user authentication **Modes:** validate, --mixed **Judges:** 3 Claude (Opus 4.6) + 3 Codex (GPT-5.3-Codex) --- ### Verdicts | Vendor | Judge 1 | Judge 2 | Judge 3 | |--------|---------|---------|---------| | Claude | PASS | WARN | PASS | | Codex | WARN | WARN | WARN | *(With `--preset`, column headers reflect perspective names instead of Judge N)* --- ### Shared Findings - JWT implementation follows best practices - Refresh token rotation is correctly implemented - Test coverage is adequate ### Disagreements | Issue | Claude | Codex | |-------|--------|-------| | Rate limiting | Optional for internal APIs | Required per OWASP | | Token expiry | 1 hour acceptable | Should be 15 minutes | ### Cross-Vendor Insights **Claude-only:** Noted UX friction in token refresh flow **Codex-only:** Flagged potential timing attack in token comparison --- ### Recommendation Add rate limiting to auth endpoints. Consider reducing token expiry to 30 minutes as compromise. --- *Council completed in 45s. 6/6 judges responded.* ``` ### Brainstorm Report ```markdown ## Council Brainstorm: **Target:** **Judges:** ### Options Explored | Option | Judge 1 | Judge 2 | Judge 3 | |--------|---------|---------|---------| | Option A | Assessment | Assessment | Assessment | | Option B | Assessment | Assessment | Assessment | ### Recommendation *Council completed in Ns. N/N judges responded.* ``` **Write to:** `.agents/council/YYYY-MM-DD-brainstorm-.md` ### Research Report ```markdown ## Council Research: **Target:** **Judges:** ### Facets Explored Each judge investigated a different aspect of the topic: | Facet | Judge | Key Findings | |-------|-------|-------------| | | Judge 1 | | | | Judge 2 | | | | Judge 3 | | ### Synthesized Findings ### Open Questions - ### Recommendation *Council completed in Ns. N/N judges responded.* ``` **Write to:** `.agents/council/YYYY-MM-DD-research-.md` ### Debate Report Additions When `--debate` is used, add these sections to any report format: **Header addition:** ```markdown **Mode:** {task_type}, --debate **Rounds:** 2 (independent assessment + adversarial debate) **Fidelity:** full (native teams — judges retained full R1 context for R2) ``` If debate ran in fallback mode (re-spawned with truncated R1 verdicts), use instead: ```markdown **Mode:** {task_type}, --debate **Rounds:** 2 (independent assessment + adversarial debate) **Fidelity:** degraded (fallback — R1 verdicts truncated for R2 re-spawn) ``` **After the Verdicts table, add:** ```markdown ### Verdict Shifts (R1 → R2) | Judge | R1 Verdict | R2 Verdict | Changed? | Reason | |-------|-----------|-----------|----------|--------| | Judge 1 (or Perspective) | PASS | WARN | Yes | Accepted Judge 2's finding on rate limiting | | Judge 2 (or Perspective) | WARN | WARN | No | Confirmed after reviewing counterarguments | | Judge 3 (or Perspective) | PASS | PASS | No | Maintained — challenged Judge 2's scope concern | ### Debate Notes **Key Exchanges:** - **Judge 1 ← Judge 2:** [what was exchanged and its impact] - **Judge 3 vs Judge 2:** [where they disagreed and why] **Steel-Man Highlights:** - Judge 1 steel-manned: "[strongest opposing argument they engaged with]" - Judge 2 steel-manned: "[strongest opposing argument they engaged with]" ``` **Convergence Detection:** If Round 1 had at least 2 judges with different verdicts AND Round 2 is unanimous, add this flag: ```markdown > **⚠ Convergence Detected:** Judges who disagreed in Round 1 now agree in Round 2. > Review debate reasoning to verify this reflects genuine persuasion, not anchoring. > Round 1 verdicts preserved above for comparison. ``` **Footer update:** ```markdown *Council completed in {R1_time + R2_time}. {N}/{N} judges responded in R1, {M}/{N} in R2.* ``` --- ## Configuration ### Partial Completion **Minimum quorum:** At least 1 agent must respond for a valid council (already documented). **Recommended quorum:** 80% of judges (e.g., 2 of 3 in --deep mode). **Timeout behavior:** If a judge does not respond within COUNCIL_TIMEOUT: 1. Log warning: "Judge {name} timed out after {N}s" 2. Proceed with remaining judges 3. Note in report: "N/M judges responded (J judge(s) timed out)" 4. If below minimum quorum (1), return error **User cancellation:** If the user cancels mid-council: 1. Send shutdown_request to all judges 2. Read any output files already written 3. Generate partial report with INCOMPLETE marker 4. Note: "Council cancelled by user. Partial results from N/M judges." ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `COUNCIL_TIMEOUT` | 120 | Agent timeout in seconds | | `COUNCIL_CODEX_MODEL` | gpt-5.3-codex | Default Codex model for --mixed | | `COUNCIL_CLAUDE_MODEL` | opus | Claude model for agents | | `COUNCIL_EXPLORER_MODEL` | sonnet | Model for explorer sub-agents | | `COUNCIL_EXPLORER_TIMEOUT` | 60 | Explorer timeout in seconds | | `COUNCIL_R2_TIMEOUT` | 90 | Maximum wait time for R2 debate completion after sending debate messages. Shorter than R1 since judges already have context. | ### Flags | Flag | Description | |------|-------------| | `--deep` | 3 Claude agents instead of 2 | | `--mixed` | Add 3 Codex agents | | `--debate` | Enable adversarial debate round (2 rounds via SendMessage, same agents). Incompatible with `--quick`. | | `--timeout=N` | Override timeout in seconds (default: 120) | | `--perspectives="a,b,c"` | Custom perspective names | | `--preset=` | Built-in persona preset (security-audit, architecture, research, ops, code-review, plan-review, retrospective) | | `--count=N` | Override agent count per vendor (e.g., `--count=4` = 4 Claude, or 4+4 with --mixed). Subject to MAX_AGENTS=12 cap. | | `--explorers=N` | Explorer sub-agents per judge (default: 0, max: 5). Max effective value depends on judge count. Total agents capped at 12. | | `--explorer-model=M` | Override explorer model (default: sonnet) | --- ## CLI Spawning Commands ### Team Setup **Create the council team before spawning judges:** ``` TeamCreate(team_name="council-YYYYMMDD-") ``` Team naming convention: `council-YYYYMMDD-` (e.g., `council-20260206-auth-system`). ### Claude Agents (via Native Teams) **Spawn judges as teammates on the council team:** Default (independent judges, no perspectives): ``` Task( description="Council judge 1", subagent_type="general-purpose", model="opus", team_name="council-YYYYMMDD-", name="judge-1", prompt="{JUDGE_DEFAULT_PROMPT}" ) ``` With perspectives (--preset or --perspectives): ``` Task( description="Council judge: Error-Paths", subagent_type="general-purpose", model="opus", team_name="council-YYYYMMDD-", name="judge-error-paths", prompt="{JUDGE_PERSPECTIVE_PROMPT}" ) ``` Judges join the team, write output files, and send completion messages to the team lead via `SendMessage`. **Fallback (if native teams unavailable):** ``` Task( description="Council judge 1", subagent_type="general-purpose", model="opus", run_in_background=true, prompt="{JUDGE_PACKET}" ) ``` ### Codex Agents (via Codex CLI) **Canonical Codex command form (unchanged — Codex cannot join teams):** ```bash codex exec --full-auto -m gpt-5.3-codex -C "$(pwd)" -o .agents/council/codex-{perspective}.md "{PACKET}" ``` Always use this exact flag order: `--full-auto` → `-m` → `-C` → `-o` → prompt. **Codex CLI flags (ONLY these are valid):** - `--full-auto` — No approval prompts (REQUIRED, always first) - `-m ` — Model override (default: gpt-5.3-codex) - `-C ` — Working directory - `-o ` — Output file (use `-o` not `--output`) **DO NOT USE:** `-q` (doesn't exist), `--quiet` (doesn't exist) ### Parallel Spawning **Spawn all agents in parallel:** ``` # Step 1: Create team TeamCreate(team_name="council-YYYYMMDD-") # Step 2: Spawn Claude judges as teammates (parallel) # Default (independent — no perspectives): Task(description="Judge 1", team_name="council-...", name="judge-1", ...) Task(description="Judge 2", team_name="council-...", name="judge-2", ...) Task(description="Judge 3", team_name="council-...", name="judge-3", ...) # With --preset or --perspectives: # Task(description="Judge: Error-Paths", team_name="council-...", name="judge-error-paths", ...) # Step 3: Spawn Codex agents (Bash tool, parallel — cannot join teams) Bash(command="codex exec --full-auto -m gpt-5.3-codex -C \"$(pwd)\" -o .agents/council/codex-1.md ...", run_in_background=true) Bash(command="codex exec --full-auto -m gpt-5.3-codex -C \"$(pwd)\" -o .agents/council/codex-2.md ...", run_in_background=true) Bash(command="codex exec --full-auto -m gpt-5.3-codex -C \"$(pwd)\" -o .agents/council/codex-3.md ...", run_in_background=true) ``` **Wait for completion:** Judges send completion messages to the team lead via `SendMessage`. These arrive automatically as conversation turns. For Codex agents, use `TaskOutput(task_id="...", block=true)`. ### Debate Round 2 (via SendMessage) **After R1 completes, send R2 instructions to existing judges (no re-spawn):** ``` # Determine branch r1_unanimous = all R1 verdicts have same value # Send to each judge SendMessage( type="message", recipient="judge-1", # or "judge-{perspective}" with presets content="## Debate Round 2\n\nOther judges' R1 verdicts:\n\n{OTHER_VERDICTS_JSON}\n\n{DEBATE_INSTRUCTIONS_FOR_BRANCH}", summary="Debate R2: review other verdicts" ) ``` Judges wake from idle, process R2, write R2 files, send completion message. **R2 completion wait:** After sending R2 debate messages to all judges, wait up to `COUNCIL_R2_TIMEOUT` (default 90s) for each judge's completion message via `SendMessage`. If a judge does not respond within the timeout, read their R1 output file (`.agents/council/YYYY-MM-DD--claude-{perspective}.md`) and use the R1 verdict for consolidation. Log: `Judge R2 timeout — using R1 verdict.` ### Team Cleanup **After consolidation:** ``` # Shutdown each judge SendMessage(type="shutdown_request", recipient="judge-1", content="Council complete") SendMessage(type="shutdown_request", recipient="judge-2", content="Council complete") SendMessage(type="shutdown_request", recipient="judge-3", content="Council complete") # With presets: use judge-{perspective} names instead (e.g., judge-error-paths) # Delete team TeamDelete() ``` > **Note:** `TeamDelete()` deletes the team associated with this session's `TeamCreate()` call. If running concurrent teams (e.g., council inside crank), each team is cleaned up in the session that created it. No team name parameter is needed — the API tracks the current session's team context automatically. ### Reaper Cleanup Pattern Team cleanup MUST succeed even on partial failures. Follow this sequence: 1. **Attempt graceful shutdown:** Send shutdown_request to each judge 2. **Wait up to 30s** for shutdown_approved responses 3. **If any judge doesn't respond:** Log warning, proceed anyway 4. **Always call TeamDelete()** — even if some judges are unresponsive 5. **TeamDelete cleans up** the team regardless of member state **Failure modes and recovery:** | Failure | Behavior | |---------|----------| | Judge hangs (no response) | 30s timeout → proceed to TeamDelete | | shutdown_request fails | Log warning → proceed to TeamDelete | | TeamDelete fails | Log error → team orphaned (manual cleanup: delete ~/.claude/teams//) | | Lead crashes mid-council | Team orphaned until session ends or manual cleanup | **Never skip TeamDelete.** A lingering team config pollutes future sessions. ### Team Timeout Configuration | Timeout | Default | Description | |---------|---------|-------------| | Judge timeout | 120s | Max time for judge to complete (per round) | | Shutdown grace period | 30s | Time to wait for shutdown_approved | | R2 debate timeout | 90s | Max time for R2 completion after sending debate messages | ### Model Selection | Vendor | Default | Override | |--------|---------|----------| | Claude | opus | `--claude-model=sonnet` | | Codex | gpt-5.3-codex | `--codex-model=` | ### Output Collection All council outputs go to `.agents/council/`: ```bash # Ensure directory exists mkdir -p .agents/council # Claude output (R1) — independent judges .agents/council/YYYY-MM-DD--claude-1.md # Claude output (R1) — with presets .agents/council/YYYY-MM-DD--claude-error-paths.md # Claude output (R2, when --debate) .agents/council/YYYY-MM-DD--claude-1-r2.md # Codex output (R1 only, even with --debate) .agents/council/YYYY-MM-DD--codex-1.md # Final consolidated report .agents/council/YYYY-MM-DD--report.md ``` --- ## Examples ### Validate Recent Changes ```bash /council validate recent ``` 2 independent Claude judges validate recent commits (no perspective labels). ### Deep Architecture Review ```bash /council --deep --preset=architecture research the authentication system ``` 3 Claude agents (scalability, maintainability, simplicity) research auth design. ### Cross-Vendor Validation ```bash /council --mixed validate this plan ``` 3 Claude + 3 Codex agents, cross-vendor synthesis. ### Deep Research with Explorers ```bash /council --deep --explorers=3 research upgrade automation patterns ``` 3 judges each spawn 3 explorers = 12 parallel research threads. Each judge explores a different facet of the topic with sub-agent support. ### Security Audit ```bash /council --preset=security-audit --deep validate the API endpoints ``` 3 judges (attacker, defender, compliance) review security posture. ### Brainstorm Approaches ```bash /council brainstorm caching strategies for the API ``` 2 independent Claude judges explore options, pros/cons, recommend one. ### Research Trade-offs ```bash /council research Redis vs Memcached for session storage ``` 2 independent judges assess properties, trade-offs, and gaps between options. ### Validate a Spec ```bash /council validate the implementation plan in PLAN.md ``` 2 independent Claude judges provide structured feedback on the plan. --- ## Migration from /judge `/council` replaces `/judge`. Migration: | Old | New | |-----|-----| | `/judge recent` | `/council validate recent` | | `/judge 2 opus` | `/council recent` (default) | | `/judge 3 opus` | `/council --deep recent` | The `/judge` skill is deprecated. Use `/council`. --- ## Native Teams Architecture Council uses Claude Code native teams (`TeamCreate`, `SendMessage`, shared `TaskList`) as the primary spawning method for Claude judges. ### Deliberation Protocol The `--debate` flag implements the **deliberation protocol** pattern: > Independent assessment → evidence exchange → position revision → convergence analysis Native teams make this pattern first-class: - **R1:** Judges spawn as teammates, assess independently, send verdicts to team lead - **R2:** Team lead sends other judges' verdicts via `SendMessage`. Judges wake from idle with full R1 context — no truncation, no re-spawn overhead - **Consolidation:** Team lead reads all output files, computes consensus - **Cleanup:** `shutdown_request` each judge, `TeamDelete()` ### Communication Rules - **Judges → team lead only.** Judges never message each other directly. This prevents anchoring (a judge being swayed by another's framing before forming their own view). - **Team lead → judges.** Only the team lead sends messages to judges (R2 debate instructions, shutdown requests). - **No TaskList access.** Judges do not use `TaskList` or `TaskUpdate` — the team lead manages all coordination. ### Ralph Wiggum Compliance Council maintains fresh-context isolation (Ralph Wiggum pattern) with one documented exception: **`--debate` reuses judge context across R1 and R2.** This is intentional. Judges persist within a single atomic council invocation — they do NOT persist across separate council calls. The rationale: - Judges benefit from their own R1 analytical context (reasoning chain, not just the verdict JSON) when evaluating other judges' positions in R2 - Re-spawning with only the verdict summary (~200 tokens) would lose the judge's working memory of WHY they reached their verdict - The exception is bounded: max 2 rounds, within one invocation, with explicit cleanup (shutdown_request + TeamDelete) Without `--debate`, council is fully Ralph-compliant: each judge is a fresh spawn, executes once, writes output, and terminates. ### Fallback If `TeamCreate` is unavailable (API error, environment constraint), fall back to `Task(run_in_background=true)` fire-and-forget. In fallback mode: - `--debate` reverts to R2 re-spawning with truncated R1 verdicts - The debate report must include `**Fidelity:** degraded (fallback — R1 verdicts truncated for R2 re-spawn)` in the header so users know results may be lower fidelity - Non-debate mode works identically (judges write files, team lead reads them) ### Team Naming Convention: `council-YYYYMMDD-` (e.g., `council-20260206-auth-system`). Judge names: `judge-{N}` for independent judges (e.g., `judge-1`, `judge-2`), or `judge-{perspective}` when using presets/perspectives (e.g., `judge-error-paths`, `judge-feasibility`). --- ## See Also - `skills/vibe/SKILL.md` — Complexity + council for code validation (uses `--preset=code-review` when spec found) - `skills/pre-mortem/SKILL.md` — Plan validation (uses `--preset=plan-review`, always 3 judges) - `skills/post-mortem/SKILL.md` — Work wrap-up (uses `--preset=retrospective`, always 3 judges + retro) - `skills/swarm/SKILL.md` — Multi-agent orchestration - `skills/standards/SKILL.md` — Language-specific coding standards - `skills/research/SKILL.md` — Codebase exploration (complementary to council research mode)