# Quality Assurance Agents Sources: Production evaluation patterns, inspired by anthropics/skills agent architecture Covers: grader agent prompt, blind comparator agent prompt, spawning patterns, output schemas. Does NOT cover: the evaluation workflow process itself (see `references/evaluation-workflow.md`). ## Overview Two reusable agent prompts for evaluating skill output quality: | Agent | Purpose | When to Use | |-------|---------|-------------| | Grader | Evaluate assertions against outputs | After every test run | | Blind Comparator | Judge two outputs without bias | When comparing skill versions | Spawn these via delegation with the relevant prompt section as instructions. ## Grader Agent ### Role The Grader reviews execution transcripts and output files, then determines whether each assertion passes or fails. Provides evidence for every judgment. Two jobs: grade the outputs AND critique the assertions themselves. A passing grade on a weak assertion creates false confidence. ### Delegation Template ``` TASK: Grade the outputs for eval "{eval_name}" against its assertions. EXPECTED OUTCOME: grading.json with pass/fail for each assertion, evidence, summary stats, and optionally eval improvement suggestions. MUST DO: - Read the transcript at {transcript_path} - Examine all files in {outputs_dir} - For each assertion, search for evidence in BOTH transcript and outputs - PASS only when clear evidence exists AND reflects genuine task completion - FAIL when evidence is missing, contradicts, or is merely surface-level - Extract and verify implicit claims from the output - Check {outputs_dir}/user_notes.md if it exists - Critique assertions that are too easy or non-discriminating - Write results to {outputs_dir}/../grading.json MUST NOT DO: - Give partial credit (pass or fail, no middle ground) - Pass assertions based on surface compliance (right filename but wrong content) - Skip reading the actual output files - Assume transcript claims are true without verification - Nitpick every assertion — only flag genuinely problematic ones CONTEXT: - Assertions to evaluate: {assertions_list} - Transcript path: {transcript_path} - Outputs directory: {outputs_dir} ``` ### Grading Criteria **PASS when:** - Transcript or outputs clearly demonstrate the assertion is true - Specific evidence can be cited - Evidence reflects genuine substance, not surface compliance **FAIL when:** - No evidence found - Evidence contradicts the assertion - Cannot be verified from available information - Evidence is superficial (technically satisfied but outcome is wrong) - Output meets assertion by coincidence, not by doing the work **When uncertain:** Burden of proof is on the assertion to pass. ### Claim Extraction Beyond predefined assertions, extract implicit claims from outputs: | Claim Type | Example | How to Verify | |-----------|---------|---------------| | Factual | "The form has 12 fields" | Count in output file | | Process | "Used pypdf to fill the form" | Check transcript tool calls | | Quality | "All fields filled correctly" | Inspect output vs input | Flag unverifiable claims. These may reveal issues assertions miss. ### Eval Self-Critique After grading, assess whether the assertions themselves are good: **Flag when:** - An assertion passed but would also pass for a clearly wrong output - An important outcome has no assertion covering it - An assertion can't actually be verified from available outputs **Don't flag when:** - Assertion is working as intended - Suggestion would be nitpicking without real value ### Grader Output Schema ```json { "expectations": [ { "text": "The output includes a summary table", "passed": true, "evidence": "Found 3-column table in output.md, lines 15-28" }, { "text": "Date fields use ISO 8601 format", "passed": false, "evidence": "Dates use MM/DD/YYYY format (line 7: '03/15/2025')" } ], "summary": { "passed": 4, "failed": 1, "total": 5, "pass_rate": 0.80 }, "claims": [ { "claim": "The report covers all 3 regions", "type": "factual", "verified": true, "evidence": "Sections found for North, South, West regions" } ], "eval_feedback": { "suggestions": [ { "assertion": "Output includes summary table", "reason": "A table with wrong data would also pass. Consider checking column headers match expected schema." } ], "overall": "Assertions check presence but not correctness for 2 of 5 checks." } } ``` ## Blind Comparator Agent ### Role Compare two outputs WITHOUT knowing which skill produced them. Prevents bias toward a particular approach. Judges purely on output quality and task completion. ### When to Use - Comparing a new skill version against the previous version - Deciding between two alternative skill approaches - Validating that improvements actually improved quality Not needed for every iteration. The grader + human review loop is usually sufficient. Use blind comparison when you want rigorous evidence that version B is better than version A. ### Delegation Template ``` TASK: Compare two outputs for the task "{eval_prompt}" and determine which is better. You do NOT know which approach produced which output. EXPECTED OUTCOME: comparison.json with winner, rubric scores, reasoning, and optionally assertion results. MUST DO: - Read output A at {output_a_path} - Read output B at {output_b_path} - Generate a task-specific rubric (content + structure dimensions) - Score each output 1-5 on each criterion - Determine a winner based on overall quality - If assertions provided, check them for both outputs (secondary signal) - Write results to {output_path} MUST NOT DO: - Try to guess which skill produced which output - Favor outputs based on style preferences over correctness - Declare a tie unless outputs are genuinely equivalent - Use assertion scores as primary decision factor CONTEXT: - Task prompt: {eval_prompt} - Output A: {output_a_path} - Output B: {output_b_path} - Assertions (optional): {assertions_list} ``` ### Rubric Generation The comparator generates a task-specific rubric with two dimensions: **Content Rubric** (what the output contains): | Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) | |-----------|----------|----------------|---------------| | Correctness | Major errors | Minor errors | Fully correct | | Completeness | Missing key elements | Mostly complete | All present | | Accuracy | Significant issues | Minor inaccuracies | Accurate | **Structure Rubric** (how the output is organized): | Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) | |-----------|----------|----------------|---------------| | Organization | Disorganized | Reasonable | Clear, logical | | Formatting | Inconsistent | Mostly consistent | Professional | | Usability | Difficult to use | Usable with effort | Easy to use | Adapt criteria to the task. PDF form → "Field alignment", "Data placement". Document → "Section structure", "Heading hierarchy". Data → "Schema correctness". ### Decision Priority 1. **Primary**: Overall rubric score (content + structure) 2. **Secondary**: Assertion pass rates (if provided) 3. **Tiebreaker**: If truly equal, declare TIE (should be rare) ### Comparator Output Schema ```json { "winner": "A", "reasoning": "Output A provides complete solution with proper formatting. Output B is missing the date field and has inconsistencies.", "rubric": { "A": { "content": {"correctness": 5, "completeness": 5, "accuracy": 4}, "structure": {"organization": 4, "formatting": 5, "usability": 4}, "content_score": 4.7, "structure_score": 4.3, "overall_score": 9.0 }, "B": { "content": {"correctness": 3, "completeness": 2, "accuracy": 3}, "structure": {"organization": 3, "formatting": 2, "usability": 3}, "content_score": 2.7, "structure_score": 2.7, "overall_score": 5.4 } }, "output_quality": { "A": { "score": 9, "strengths": ["Complete solution", "All fields present"], "weaknesses": ["Minor style inconsistency in header"] }, "B": { "score": 5, "strengths": ["Readable output", "Correct basic structure"], "weaknesses": ["Missing date field", "Formatting issues"] } }, "expectation_results": { "A": {"passed": 4, "total": 5, "pass_rate": 0.80}, "B": {"passed": 3, "total": 5, "pass_rate": 0.60} } } ``` ## Post-Comparison Analysis After blind comparison identifies a winner, optionally analyze WHY it won. Read both skills and transcripts to extract actionable improvements. ### Analysis Delegation Template ``` TASK: Analyze why the winning output was better and generate improvement suggestions for the losing skill. MUST DO: - Read the comparison result at {comparison_path} - Read both skills (winner at {winner_skill_path}, loser at {loser_skill_path}) - Read both transcripts if available - Identify instruction-following differences - Generate prioritized improvement suggestions - Focus on changes that would have changed the outcome MUST NOT DO: - Suggest cosmetic changes that wouldn't affect quality - Speculate without evidence from transcripts - Recommend changes that would only help this specific test case ``` ### Analysis Output Schema ```json { "comparison_summary": { "winner": "A", "winner_skill": "path/to/winner", "loser_skill": "path/to/loser" }, "winner_strengths": [ "Clear step-by-step instructions for multi-page documents", "Included validation script that caught formatting errors" ], "loser_weaknesses": [ "Vague instruction 'process appropriately' led to inconsistent behavior", "No validation step, errors went uncaught" ], "improvement_suggestions": [ { "priority": "high", "category": "instructions", "suggestion": "Replace 'process appropriately' with explicit steps", "expected_impact": "Eliminates ambiguity causing inconsistent behavior" } ] } ``` ### Suggestion Categories | Category | What to Change | |----------|---------------| | `instructions` | Prose instructions in SKILL.md or references | | `tools` | Scripts or utilities to add/modify | | `examples` | Input/output examples to include | | `error_handling` | Guidance for handling failures | | `structure` | Reorganization of skill content | | `references` | Additional reference docs to add | ### Priority Levels - **high**: Would likely change the outcome of comparison - **medium**: Improves quality but may not change win/loss - **low**: Nice to have, marginal improvement