---
name: sadd:judge-with-debate
description: Evaluate solutions through multi-round debate between independent judges until consensus
argument-hint: Solution path(s) and evaluation criteria
---
# judge-with-debate
Evaluate solutions through multi-agent debate where independent judges analyze, challenge each other's assessments, and iteratively refine their evaluations until reaching consensus or maximum rounds.
This command implements the Multi-Agent Debate pattern for high-quality evaluation where multiple perspectives and rigorous argumentation improve assessment accuracy. Unlike single-pass evaluation, debate forces judges to defend their positions with evidence and consider counter-arguments.
## Pattern: Debate-Based Evaluation
This command implements iterative multi-judge debate:
```
Phase 0: Setup
mkdir -p .specs/reports
│
Phase 1: Independent Analysis
┌─ Judge 1 → {name}.1.md ─┐
Solution ┼─ Judge 2 → {name}.2.md ─┼─┐
└─ Judge 3 → {name}.3.md ─┘ │
│
Phase 2: Debate Round (iterative) │
Each judge reads others' reports │
↓ │
Argue + Defend + Challenge │
↓ │
Revise if convinced ─────────────┤
↓ │
Check consensus │
├─ Yes → Final Report │
└─ No → Next Round ─────────┘
```
## Process
### Setup: Create Reports Directory
Before starting evaluation, ensure the reports directory exists:
```bash
mkdir -p .specs/reports
```
**Report naming convention:** `.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md`
Where:
- `{solution-name}` - Derived from solution filename (e.g., `users-api` from `src/api/users.ts`)
- `{YYYY-MM-DD}` - Current date
- `[1|2|3]` - Judge number
### Phase 1: Independent Analysis
Launch **3 independent judge agents in parallel** (recommended: Opus for rigor):
1. Each judge receives:
- Path to solution(s) being evaluated
- Evaluation criteria with weights
- Clear rubric for scoring
2. Each produces **independent assessment** saved to `.specs/reports/{solution-name}-{date}.[1|2|3].md`
3. Reports must include:
- Per-criterion scores with evidence
- Specific quotes/examples supporting ratings
- Overall weighted score
- Key strengths and weaknesses
**Key principle:** Independence in initial analysis prevents groupthink.
**Prompt template for initial judges:**
```markdown
You are Judge {N} evaluating a solution independently.
{path to solution file(s)}
{what the solution was supposed to accomplish}
{criteria with descriptions and weights}
.specs/reports/{solution-name}-{date}.{N}.md
Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md for evaluation methodology and execute using following criteria.
Instructions:
1. Read the solution thoroughly
2. For each criterion:
- Find specific evidence (quote exact text)
- Score on the defined scale
- Justify with concrete examples
3. Calculate weighted overall score
4. Write comprehensive report to {output_file}
5. Generate verification 5 questions about your evaluation.
6. Answer verification questions:
- Re-examine solutions for each question
- Find counter-evidence if it exists
- Check for systematic bias (length, confidence, etc.)
7. Revise your report file and update it accordingly.
Add to report begining `Done by Judge {N}`
```
### Phase 2: Debate Rounds (Iterative)
For each debate round (max 3 rounds):
Launch **3 debate agents in parallel**:
1. Each judge agent receives:
- Path to their own previous report (`.specs/reports/{solution-name}-{date}.[1|2|3].md`)
- Paths to other judges' reports (`.specs/reports/{solution-name}-{date}.[1|2|3].md`)
- The original solution
2. Each judge:
- Identifies disagreements with other judges (>1 point score gap on any criterion)
- Defends their own ratings with evidence
- Challenges other judges' ratings they disagree with
- Considers counter-arguments
- Revises their assessment if convinced
3. Updates their report file with new section: `## Debate Round {R}`
4. After they reply, if they reached agreement move to Phase 3: Consensus Report
**Key principle:** Judges communicate only through filesystem - orchestrator doesn't mediate and don't read reports files itself, it can overflow your context.
**Prompt template for debate judges:**
```markdown
You are Judge {N} in debate round {R}.
{path to .specs/reports/{solution-name}-{date}.{N}.md}
Judge 1: .specs/reports/{solution-name}-{date}.1.md
...
{what the solution was supposed to accomplish}
{path to solution}
.specs/reports/{solution-name}-{date}.{N}.md (append to existing file)
Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md for evaluation methodology principles.
Instructions:
1. Read your previous assessment from {your_previous_report}
2. Read all other judges' reports
3. Identify disagreements (where your scores differ by >1 point)
4. For each major disagreement:
- State the disagreement clearly
- Defend your position with evidence
- Challenge the other judge's position with counter-evidence
- Consider whether their evidence changes your view
5. Update your report file by APPENDING:
6. Reply whether you are reached agreement, and with which judge. Include revisited scores and criteria scores.
---
## Debate Round {R}
### Disagreements Identified
**Disagreement with Judge {X} on Criterion "{Name}"**
- My score: {my_score}/5
- Their score: {their_score}/5
- My defense: [quote evidence supporting my score]
- My challenge: [what did they miss or misinterpret?]
[Repeat for each disagreement]
### Revised Assessment
After considering other judges' arguments:
- **Criterion "{Name}"**: [Maintained {X}/5 | Revised from {X} to {Y}/5]
- Reason for change: [what convinced me] OR
- Reason maintained: [why I stand by original score]
[Repeat for changed/maintained scores]
**New Weighted Score**: {updated_total}/5.0
## Evidences
[specific quotes]
---
CRITICAL:
- Only revise if you find their evidence compelling
- Defend your original scores if you still believe them
- Quote specific evidence from the solution
```
### Consensus Check
After each debate round, check for consensus:
**Consensus achieved if:**
- All judges' overall scores within 0.5 points of each other
- No criterion has >1 point disagreement across any two judges
- All judges explicitly state they accept the consensus
**If no consensus after 3 rounds:**
- Report persistent disagreements
- Provide all judge reports for human review
- Flag that automated evaluation couldn't reach consensus
**Orchestration Instructions:**
**Step 1: Run Independent Analysis (Round 1)**
1. Launch 3 judge agents in parallel (Judge 1, 2, 3)
2. Each writes their independent assessment to `.specs/reports/{solution-name}-{date}.[1|2|3].md`
3. Wait for all 3 agents to complete
**Step 2: Check for Consensus**
Let's work through this systematically to ensure accurate consensus detection.
Read all three reports and extract:
- Each judge's overall weighted score
- Each judge's score for every criterion
Check consensus step by step:
1. First, extract all overall scores from each report and list them explicitly
2. Calculate the difference between the highest and lowest overall scores
- If difference ≤ 0.5 points → overall consensus achieved
- If difference > 0.5 points → no consensus yet
3. Next, for each criterion, list all three judges' scores side by side
4. For each criterion, calculate the difference between highest and lowest scores
- If any criterion has difference > 1.0 point → no consensus on that criterion
5. Finally, verify consensus is achieved only if BOTH conditions are met:
- Overall scores within 0.5 points
- All criterion scores within 1.0 point
**Step 3: Decision Point**
- **If consensus achieved**: Go to Step 5 (Generate Consensus Report)
- **If no consensus AND round < 3**: Go to Step 4 (Run Debate Round)
- **If no consensus AND round = 3**: Go to Step 6 (Report No Consensus)
**Step 4: Run Debate Round**
1. Increment round counter (round = round + 1)
2. Launch 3 judge agents in parallel
3. Each agent reads:
- Their own previous report from filesystem
- Other judges' reports from filesystem
- Original solution
4. Each agent appends "Debate Round {R}" section to their own report file
5. Wait for all 3 agents to complete
6. Go back to Step 2 (Check for Consensus)
**Step 5: Reply with Report**
Let's synthesize the evaluation results step by step.
1. Read all final reports carefully
2. Before generating the report, analyze the following:
- What is the consensus status (achieved or not)?
- What were the key points of agreement across all judges?
- What were the main areas of disagreement, if any?
- How did the debate rounds change the evaluations?
3. Reply to user with a report that contains:
- If there is consensus:
- Consensus scores (average of all judges)
- Consensus strengths/weaknesses
- Number of rounds to reach consensus
- Final recommendation with clear justification
- If there is no consensus:
- All judges' final scores showing disagreements
- Specific criteria where consensus wasn't reached
- Analysis of why consensus couldn't be reached
- Flag for human review
4. Command complete
### Phase 3: Consensus Report
If consensus achieved, synthesize the final report by working through each section methodically:
```markdown
# Consensus Evaluation Report
Let's compile the final consensus by analyzing each component systematically.
## Consensus Scores
First, let's consolidate all judges' final scores:
| Criterion | Judge 1 | Judge 2 | Judge 3 | Final |
|-----------|---------|---------|---------|-------|
| {Name} | {X}/5 | {X}/5 | {X}/5 | {X}/5 |
...
**Consensus Overall Score**: {avg}/5.0
## Consensus Strengths
[Review each judge's identified strengths and extract the common themes that all judges agreed upon]
## Consensus Weaknesses
[Review each judge's identified weaknesses and extract the common themes that all judges agreed upon]
## Debate Summary
Let's trace how consensus was reached:
- Rounds to consensus: {N}
- Initial disagreements: {list with specific criteria and score gaps}
- How resolved: {for each disagreement, explain what evidence or argument led to resolution}
## Final Recommendation
Based on the consensus scores and the key strengths/weaknesses identified:
{Pass/Fail/Needs Revision with clear justification tied to the evidence}
```
## Best Practices
### Evaluation Criteria
Choose 3-5 weighted criteria relevant to the solution type:
**Code evaluation:**
- Correctness (30%) - Does it work? Handles edge cases?
- Design Quality (25%) - Clean architecture? Maintainable?
- Efficiency (20%) - Performance considerations?
- Code Quality (15%) - Readable? Well-documented?
- Testing (10%) - Test coverage? Test quality?
**Design/Architecture evaluation:**
- Completeness (30%) - All requirements addressed?
- Feasibility (25%) - Can it actually be built?
- Scalability (20%) - Handles growth?
- Simplicity (15%) - Appropriately simple?
- Documentation (10%) - Clear and comprehensive?
**Documentation evaluation:**
- Accuracy (35%) - Technically correct?
- Completeness (30%) - Covers all necessary topics?
- Clarity (20%) - Easy to understand?
- Usability (15%) - Helpful examples? Good structure?
### Common Pitfalls
❌ **Judges create new reports instead of appending** - Loses debate history
❌ **Orchestrator passes reports between judges** - Violates filesystem communication principle
❌ **Weak initial assessments** - Garbage in, garbage out
❌ **Too many debate rounds** - Diminishing returns after 3 rounds
❌ **Sycophancy in debate** - Judges agree too easily without real evidence
✅ **Judges append to their own report file**
✅ **Judges read other reports from filesystem directly**
✅ **Strong evidence-based initial assessments**
✅ **Maximum 3 debate rounds**
✅ **Require evidence for changing positions**
## Example Usage
### Evaluating an API Implementation
```bash
/judge-with-debate \
--solution "src/api/users.ts" \
--task "Implement REST API for user management" \
--criteria "correctness:30,design:25,security:20,performance:15,docs:10"
```
**Round 1 outputs** (assuming date 2025-01-15):
- `.specs/reports/users-api-2025-01-15.1.md` - Judge 1 scores correctness 4/5, security 3/5
- `.specs/reports/users-api-2025-01-15.2.md` - Judge 2 scores correctness 4/5, security 5/5
- `.specs/reports/users-api-2025-01-15.3.md` - Judge 3 scores correctness 5/5, security 4/5
**Disagreement detected:** Security scores range from 3-5
**Round 2 debate:**
- Judge 1 defends 3/5: "Missing rate limiting, input validation incomplete"
- Judge 2 challenges: "Rate limiting exists in middleware (line 45)"
- Judge 1 revises to 4/5: "Missed middleware, but input validation still weak"
- Judge 3 defends 4/5: "Input validation adequate for requirements"
**Round 2 outputs:**
- All judges now 4-5/5 on security (within 1 point)
- Disagreement on input validation remains
**Round 3 debate:**
- Judges examine specific validation code
- Judge 2 revises to 4/5: "Upon re-examination, email validation regex is weak"
- Consensus: Security = 4/5
**Final consensus:**
```
Correctness: 4.3/5
Design: 4.5/5
Security: 4.0/5 (3 rounds to consensus)
Performance: 4.7/5
Documentation: 4.0/5
Overall: 4.3/5 - PASS
```