--- name: sadd:judge description: Launch a sub-agent judge to evaluate results produced in the current conversation argument-hint: "[evaluation-focus]" --- # Judge Command You are a coordinator launching a specialized judge sub-agent to evaluate work produced earlier in this conversation. The judge operates with isolated context, provides structured evaluation with evidence-based scoring, and returns actionable feedback. This command implements the LLM-as-Judge pattern with context isolation: - **Context Isolation**: Judge operates with fresh context, preventing confirmation bias from accumulated session state - **Chain-of-Thought Scoring**: Justification BEFORE score for 15-25% reliability improvement - **Evidence-Based**: Every score requires specific citations from the work (file locations, line numbers) - **Multi-Dimensional Rubric**: Weighted criteria with clear level descriptions - **Self-Verification**: Dynamic verification questions with documented adjustments The evaluation is **report-only** - findings are presented without automatic changes. ## Your Workflow ### Phase 1: Context Extraction Before launching the judge, identify what needs evaluation: 1. **Identify the work to evaluate**: - Review conversation history for completed work - If arguments provided: Use them to focus on specific aspects - If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)" 2. **Extract evaluation context**: - Original task or request that prompted the work - The actual output/result produced - Files created or modified (with brief descriptions) - Any constraints, requirements, or acceptance criteria mentioned 3. **Provide scope for user**: ``` Evaluation Scope: - Original request: [summary] - Work produced: [description] - Files involved: [list] - Evaluation focus: [from arguments or "general quality"] Launching judge sub-agent... ``` **IMPORTANT**: Pass only the extracted context to the judge - not the entire conversation. This prevents context pollution and enables focused assessment. ### Phase 2: Launch Judge Sub-Agent Use the Task tool to spawn a single judge agent with the following prompt and context. Adjust criteria rubric and weights to match solution type and complexity, for example: - Code Quality - Documentation Quality - Test Coverage - Security - Performance - Usability - Reliability - Maintainability - Scalability - Cost-effectiveness - Compliance - Accessibility - Performance **Judge Agent Prompt:** ```markdown You are an Expert Judge evaluating the quality of work produced in a development session. ## Work Under Evaluation [ORIGINAL TASK] {paste the original request/task} [/ORIGINAL TASK] [WORK OUTPUT] {summary of what was created/modified} [/WORK OUTPUT] [FILES INVOLVED] {list of files with brief descriptions} [/FILES INVOLVED] [EVALUATION FOCUS] {from arguments, or "General quality assessment"} [/EVALUATION FOCUS] Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md and execute. ## Evaluation Criteria ### Criterion 1: Instruction Following (weight: 0.30) Does the work follow all explicit instructions and requirements? **Guiding Questions**: - Does the output fulfill the original request? - Were all explicit requirements addressed? - Are there gaps or unexpected deviations? | Level | Score | Description | |-------|-------|-------------| | Excellent | 5 | All instructions followed precisely, no deviations | | Good | 4 | Minor deviations that do not affect outcome | | Adequate | 3 | Major instructions followed, minor ones missed | | Poor | 2 | Significant instructions ignored | | Failed | 1 | Fundamentally misunderstood the task | ### Criterion 2: Output Completeness (weight: 0.25) Are all requested aspects thoroughly covered? **Guiding Questions**: - Are all components of the request addressed? - Is there appropriate depth for each component? - Are there obvious gaps or missing pieces? | Level | Score | Description | |-------|-------|-------------| | Excellent | 5 | All aspects thoroughly covered with appropriate depth | | Good | 4 | Most aspects covered with minor gaps | | Adequate | 3 | Key aspects covered, some notable gaps | | Poor | 2 | Major aspects missing | | Failed | 1 | Fundamental aspects not addressed | ### Criterion 3: Solution Quality (weight: 0.25) Is the approach appropriate and well-implemented? **Guiding Questions**: - Is the chosen approach sound and appropriate? - Does the implementation follow best practices? - Are there correctness issues or errors? | Level | Score | Description | |-------|-------|-------------| | Excellent | 5 | Optimal approach, clean implementation, best practices followed | | Good | 4 | Good approach with minor issues | | Adequate | 3 | Reasonable approach, some quality concerns | | Poor | 2 | Problematic approach or significant quality issues | | Failed | 1 | Fundamentally flawed approach | ### Criterion 4: Reasoning Quality (weight: 0.10) Is the reasoning clear, logical, and well-documented? **Guiding Questions**: - Is the decision-making transparent? - Were appropriate methods/tools used? - Can someone understand why this approach was taken? | Level | Score | Description | |-------|-------|-------------| | Excellent | 5 | Clear, logical reasoning throughout | | Good | 4 | Generally sound reasoning with minor gaps | | Adequate | 3 | Basic reasoning present | | Poor | 2 | Reasoning unclear or flawed | | Failed | 1 | No apparent reasoning | ### Criterion 5: Response Coherence (weight: 0.10) Is the output well-structured and easy to understand? **Guiding Questions**: - Is the output organized logically? - Can someone unfamiliar with the task understand it? - Is it professionally presented? | Level | Score | Description | |-------|-------|-------------| | Excellent | 5 | Well-structured, clear, professional | | Good | 4 | Generally coherent with minor issues | | Adequate | 3 | Understandable but could be clearer | | Poor | 2 | Difficult to follow | | Failed | 1 | Incoherent or confusing | ``` ### Phase 3: Process and Present Results After receiving the judge's evaluation: 1. **Validate the evaluation**: - Check that all criteria have scores in valid range (1-5) - Verify each score has supporting justification with evidence - Confirm weighted total calculation is correct - Check for contradictions between justification and score - Verify self-verification was completed with documented adjustments 2. **If validation fails**: - Note the specific issue - Request clarification or re-evaluation if needed 3. **Present results to user**: - Display the full evaluation report - Highlight the verdict and key findings - Offer follow-up options: - Address specific improvements - Request clarification on any judgment - Proceed with the work as-is ## Scoring Interpretation | Score Range | Verdict | Interpretation | Recommendation | |-------------|---------|----------------|----------------| | 4.50 - 5.00 | EXCELLENT | Exceptional quality, exceeds expectations | Ready as-is | | 4.00 - 4.49 | GOOD | Solid quality, meets professional standards | Minor improvements optional | | 3.50 - 3.99 | ACCEPTABLE | Adequate but has room for improvement | Improvements recommended | | 3.00 - 3.49 | NEEDS IMPROVEMENT | Below standard, requires work | Address issues before use | | 1.00 - 2.99 | INSUFFICIENT | Does not meet basic requirements | Significant rework needed | ## Important Guidelines 1. **Context Isolation**: Pass only relevant context to the judge - not the entire conversation 2. **Justification First**: Always require evidence and reasoning BEFORE the score 3. **Evidence-Based**: Every score must cite specific evidence (file paths, line numbers, quotes) 4. **Bias Mitigation**: Explicitly warn against length bias, verbosity bias, and authority bias 5. **Be Objective**: Base assessments on evidence and rubric definitions, not preferences 6. **Be Specific**: Cite exact locations, not vague observations 7. **Be Constructive**: Frame criticism as opportunities for improvement with impact context 8. **Consider Context**: Account for stated constraints, complexity, and requirements 9. **Report Confidence**: Lower confidence when evidence is ambiguous or criteria unclear 10. **Single Judge**: This command uses one focused judge for context isolation ## Notes - This is a **report-only** command - it evaluates but does not modify work - The judge operates with fresh context for unbiased assessment - Scores are calibrated to professional development standards - Low scores indicate improvement opportunities, not failures - Use the evaluation to inform next steps and iterations - Pass threshold (3.5/5.0) represents acceptable quality for general use - Adjust threshold based on criticality (4.0+ for critical operations) - Low confidence evaluations may warrant human review