--- name: evaluation description: "Evaluate agent systems with quality gates and LLM-as-judge. Use when you need to measure component quality or implement quality gates. Not for simple unit testing or binary pass/fail checks without nuance." --- # Evaluation Methods for Agent Systems Build quality gates and measure component quality using outcome-focused evaluation that accounts for non-determinism and multiple valid paths Multi-dimensional rubric implemented with weighted scoring, evidence requirements, and threshold-based quality gates When building quality gates, measuring component quality, or implementing LLM-as-judge. Not for: Simple unit testing or binary pass/fail checks without nuance. DEFINE_RUBRIC → BUILD_TEST_SET → IMPLEMENT_EVALUATION → TRACK_METRICS Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. A robust framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects. ## Core Concepts ### The 95% Finding Research on BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found three factors explain 95% of performance variance: | Factor | Variance Explained | Implication | | -------------------- | ------------------ | --------------------------------- | | Token usage | 80% | More tokens = better performance | | Number of tool calls | ~10% | More exploration helps | | Model choice | ~5% | Better models multiply efficiency | **Critical Insight**: Model upgrades often provide larger gains than doubling token budgets. Claude Sonnet 4.5 > 2× tokens on previous Sonnet. ### Evaluation Challenges **Non-Determinism and Multiple Valid Paths** - Agents may take different valid paths to goals - Traditional evaluations checking specific steps fail - Solution: Outcome-focused evaluation judging results, not paths **Context-Dependent Failures** - Success on simple queries ≠ success on complex ones - Failures emerge only after extended interaction - Solution: Test across complexity levels, include extended interactions **Composite Quality Dimensions** - Agent quality is multi-dimensional - Includes: factual accuracy, completeness, coherence, tool efficiency - Solution: Multi-dimensional rubrics with appropriate weighting ## Evaluation Framework ### Multi-Dimensional Rubrics **Design Principles**: - Cover key quality dimensions - Use descriptive levels (excellent, good, fair, poor, failed) - Convert to numeric scores (0.0 to 1.0) - Weight dimensions based on use case **Core Dimensions**: **Factual Accuracy** - Claims match ground truth - 1.0: All facts correct, no hallucinations - 0.7: Mostly correct, minor inaccuracies - 0.5: Mixed accuracy, some errors - 0.3: Many errors, significant inaccuracies - 0.0: Mostly false, major hallucinations **Completeness** - Output covers all requested aspects - 1.0: Addresses all requirements comprehensively - 0.7: Covers most requirements with minor gaps - 0.5: Partial coverage, missing some aspects - 0.3: Minimal coverage, many gaps - 0.0: Fails to address core requirements **Portability** (Seed System Specific) - Component works without external dependencies - 1.0: Zero dependencies, self-contained, portable - 0.7: Minimal dependencies, mostly portable - 0.5: Some dependencies, requires configuration - 0.3: Many dependencies, limited portability - 0.0: Tightly coupled, non-portable **Context Efficiency** (Seed System Specific) - Uses context optimally (progressive disclosure) - 1.0: Excellent use of progressive disclosure, minimal context - 0.7: Good context management, some optimization - 0.5: Adequate context usage, could be improved - 0.3: Inefficient context usage, verbose - 0.0: Wasteful context usage, bloats prompts **Tool Efficiency** - Uses appropriate tools reasonable number of times - 1.0: Optimal tool selection, minimal calls - 0.7: Good tool usage, slightly inefficient - 0.5: Adequate tool usage, some redundancy - 0.3: Inefficient tool usage, many redundant calls - 0.0: Poor tool selection, excessive calls ### Scoring System **Individual Dimension Scores**: 0.0 to 1.0 for each dimension **Weighted Overall Score**: ```python overall_score = sum(score[dim] * weight[dim] for dim in dimensions) ``` **Pass Threshold**: Set based on use case - Production components: ≥ 0.8 - Development components: ≥ 0.7 - Experimental: ≥ 0.6 ### LLM-as-Judge Pattern **Direct Scoring** - Evaluate against weighted criteria with rubrics - Provide clear task description - Include agent output and ground truth (if available) - Request structured judgment with evidence **Prompt Template**: ``` Task: [Description] Agent Output: [Output] Evaluation Criteria: [Rubric] Evaluate the agent output on each dimension: 1. Factual Accuracy (0.0-1.0): [Score] - [Evidence] 2. Completeness (0.0-1.0): [Score] - [Evidence] 3. Portability (0.0-1.0): [Score] - [Evidence] 4. Context Efficiency (0.0-1.0): [Score] - [Evidence] 5. Tool Efficiency (0.0-1.0): [Score] - [Evidence] Overall Score: [Weighted average] Pass/Fail: [Threshold-based] ``` **Pairwise Comparison** - Compare two outputs with position bias mitigation - Automatically swap positions to reduce bias - Ask judge to choose better overall output **Position Swapping**: ```python def evaluate_pairwise(output_a, output_b): # First comparison: A vs B result_1 = judge_evaluate(output_a, output_b) # Second comparison: B vs A (swapped) result_2 = judge_evaluate(output_b, output_a) # Combine results return reconcile_comparisons(result_1, result_2) ``` ## Evaluation Methods ### Test Set Design **Sample Selection** - Start small during development (dramatic impacts early) - Sample from real usage patterns - Add known edge cases - Ensure coverage across complexity levels **Complexity Stratification** - **Simple**: Single tool call, clear requirements - **Medium**: Multiple tool calls, some ambiguity - **Complex**: Many tool calls, significant ambiguity - **Very Complex**: Extended interaction, deep reasoning ### Context Engineering Evaluation **Testing Context Strategies** - Run with different context strategies on same test set - Compare quality scores, token usage, efficiency metrics - Validate progressive disclosure effectiveness **Degradation Testing** - Test at different context sizes - Identify performance cliffs - Establish safe operating limits ### Continuous Evaluation **Evaluation Pipeline** - Run automatically on component changes - Track results over time - Compare versions to identify improvements/regressions **Production Monitoring** - Sample interactions in production - Evaluate randomly - Set alerts for quality drops ## Practical Implementation ### Building Evaluation Frameworks **Step 1**: Define quality dimensions relevant to use case **Step 2**: Create rubrics with clear level descriptions **Step 3**: Build test sets from real patterns and edge cases **Step 4**: Implement automated evaluation pipelines **Step 5**: Establish baseline metrics before changes **Step 6**: Run evaluations on all significant changes **Step 7**: Track metrics over time **Step 8**: Supplement with human review ### Example: Component Evaluation ```python def evaluate_component(component, test_set): """Evaluate a Seed System component""" rubric = { "factual_accuracy": {"weight": 0.25}, "completeness": {"weight": 0.25}, "portability": {"weight": 0.25}, "context_efficiency": {"weight": 0.15}, "tool_efficiency": {"weight": 0.10} } scores = {} for test in test_set: result = run_test(component, test) for dimension in rubric: scores[dimension] = assess_dimension(result, dimension) overall = weighted_average(scores, rubric) passed = overall >= 0.7 return { "passed": passed, "scores": scores, "overall": overall, "threshold": 0.7 } ``` ## Avoiding Evaluation Pitfalls ❌ **Overfitting to specific paths** - Evaluate outcomes, not specific steps ❌ **Ignoring edge cases** - Include diverse test scenarios ❌ **Single-metric obsession** - Use multi-dimensional rubrics ❌ **Neglecting context effects** - Test with realistic context sizes ❌ **Skipping human evaluation** - Automated evaluation misses subtle issues ## Enhanced Validation Workflow **Phase 1: Component Generation** - Generate component using meta-skills - Apply progressive disclosure principles - Optimize context usage **Phase 2: Multi-Dimensional Evaluation** - Run through evaluation framework - Score on all 5 dimensions - Calculate weighted overall score **Phase 3: Quality Gate** - Block if below threshold (e.g., 0.7) - Provide detailed feedback - Suggest improvements **Phase 4: Evidence Collection** - Store evaluation results - Track metrics over time - Enable regression detection ### Example: Validation Report ```yaml # Validation Report component: my-skill timestamp: 2026-01-26 overall_score: 0.82 threshold: 0.70 passed: true dimensions: factual_accuracy: 0.90 evidence: "All technical claims verified" completeness: 0.85 evidence: "Covers all requirements with minor gaps" portability: 0.80 evidence: "Self-contained, zero external dependencies" context_efficiency: 0.75 evidence: "Good progressive disclosure, some optimization possible" tool_efficiency: 0.85 evidence: "Optimal tool selection, minimal redundant calls" recommendations: - "Consider further context optimization for large components" - "Add more examples to references/" ``` ## Guidelines 1. **Judge outcomes, not paths** - Multiple valid routes to goals 2. **Use multi-dimensional rubrics** - Quality is composite 3. **Test across complexity levels** - Simple ≠ Complex 4. **Implement position swapping** - Mitigate pairwise bias 5. **Require evidence** - Justify all scores 6. **Track over time** - Detect regressions 7. **Combine automated and human** - Catch what automation misses 8. **Test context strategies** - Validate progressive disclosure ## References Skills referenced for related evaluation capabilities. | When You Need To... | Use This Skill | Routing Command | | ------------------------------------ | -------------------- | ----------------------------- | | Implement progressive disclosure | `filesystem-context` | `Skill("filesystem-context")` | | Validate component quality | `quality-standards` | `Skill("quality-standards")` | | Build evaluation rubrics and scoring | (this skill) | (current context) | **Research References**: - BrowseComp evaluation on performance drivers - Eugene Yan on LLM-evaluators - Position bias in pairwise comparison **Key Principle**: Evaluation should be outcome-focused, multi-dimensional, and continuously validated. Judge whether components achieve right outcomes while following reasonable processes. --- ## Genetic Code This component carries essential Seed System principles for context: fork isolation: MANDATORY: All components MUST be self-contained (zero .claude/rules dependency) MANDATORY: Achieve 80-95% autonomy (0-5 AskUserQuestion rounds per session) MANDATORY: Description MUST use What-When-Not format in third person MANDATORY: No component references another component by name in description MANDATORY: Progressive disclosure - references/ for detailed content MANDATORY: Use XML for control (mission_control, critical_constraint), Markdown for data No exceptions. Portability invariant must be maintained. **Delta Standard**: Good Component = Expert Knowledge − What Claude Already Knows **Recognition Questions**: - "Would Claude know this without being told?" → Delete (zero delta) - "Can this work standalone?" → Fix if no (non-self-sufficient) - "Did I read the actual file, or just see it in grep?" → Verify before claiming MANDATORY: Use multi-dimensional rubrics (not single metrics) MANDATORY: Require evidence for all scores MANDATORY: Implement position swapping for pairwise comparisons MANDATORY: Block below threshold (≥0.7 for production) No exceptions. Evaluation without evidence is opinion, not assessment.