--- name: eval-harness description: Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring. allowed-tools: Read, Write, Edit, Bash, Grep, Glob --- # Eval Harness ## Overview Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites. ## Evaluation Types ### 1. Agent Performance Benchmark - Define test cases with known-correct outputs - Run agent against each test case - Score: accuracy, completeness, relevance - Compare against baseline performance - Track performance over time ### 2. Skill Quality Testing - Verify skill instructions produce expected outcomes - Test edge cases and boundary conditions - Measure consistency across multiple runs - Check for harmful or incorrect outputs - Validate against ground truth ### 3. Regression Suite - Collection of previously-passing test cases - Run after any agent/skill modification - Flag regressions with before/after comparison - Maintain pass rate threshold (>= 95%) ### 4. Process Verification - End-to-end process execution with known inputs - Verify each phase produces expected outputs - Check task ordering and dependency satisfaction - Measure total execution time ## Quality Scoring ### Accuracy Score (0-100) - Correctness of output vs expected - Partial credit for partially correct outputs - Penalty for hallucinated or fabricated content ### Completeness Score (0-100) - Coverage of required output elements - Missing sections flagged and scored - Bonus for useful additional context ### Consistency Score (0-100) - Run same input 3 times - Compare outputs for semantic similarity - Flag inconsistencies ### Composite Score - (accuracy * 0.4 + completeness * 0.3 + consistency * 0.3) - Threshold: 80 to pass ## When to Use - After creating new agents or skills - After modifying existing agents or skills - Periodic quality audits - Before promoting skills to production ## Agents Used - Used by process-level evaluation orchestrators - No specific agent dependency (evaluates other agents)