--- name: agent-evaluation description: Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration. allowed-tools: [Read, Write, Shell, Grep, Glob] tags: [agent-evaluation, evals, AI-agents, benchmarks, graders, testing, quality-assurance] platforms: [Claude, ChatGPT, Gemini] --- # Agent Evaluation (AI Agent Evals) > Based on Anthropic's "Demystifying evals for AI agents" ## When to use this skill - Designing evaluation systems for AI agents - Building benchmarks for coding, conversational, or research agents - Creating graders (code-based, model-based, human) - Implementing production monitoring for AI systems - Setting up CI/CD pipelines with automated evals - Debugging agent performance issues - Measuring agent improvement over time ## Core Concepts ### Eval Evolution: Single-turn → Multi-turn → Agentic | Type | Turns | State | Grading | Complexity | |------|-------|-------|---------|------------| | **Single-turn** | 1 | None | Simple | Low | | **Multi-turn** | N | Conversation | Per-turn | Medium | | **Agentic** | N | World + History | Outcome | High | ### 7 Key Terms | Term | Definition | |------|------------| | **Task** | Single test case (prompt + expected outcome) | | **Trial** | One agent run on a task | | **Grader** | Scoring function (code/model/human) | | **Transcript** | Full record of agent actions | | **Outcome** | Final state for grading | | **Harness** | Infrastructure running evals | | **Suite** | Collection of related tasks | ## Instructions ### Step 1: Understand Grader Types #### Code-based Graders (Recommended for Coding Agents) - **Pros**: Fast, objective, reproducible - **Cons**: Requires clear success criteria - **Best for**: Coding agents, structured outputs ```python # Example: Code-based grader def grade_task(outcome: dict) -> float: """Grade coding task by test passage.""" tests_passed = outcome.get("tests_passed", 0) total_tests = outcome.get("total_tests", 1) return tests_passed / total_tests # SWE-bench style grader def grade_swe_bench(repo_path: str, test_spec: dict) -> bool: """Run tests and check if patch resolves issue.""" result = subprocess.run( ["pytest", test_spec["test_file"]], cwd=repo_path, capture_output=True ) return result.returncode == 0 ``` #### Model-based Graders (LLM-as-Judge) - **Pros**: Flexible, handles nuance - **Cons**: Requires calibration, can be inconsistent - **Best for**: Conversational agents, open-ended tasks ```yaml # Example: LLM Rubric for Customer Support Agent rubric: dimensions: - name: empathy weight: 0.3 scale: 1-5 criteria: | 5: Acknowledges emotions, uses warm language 3: Polite but impersonal 1: Cold or dismissive - name: resolution weight: 0.5 scale: 1-5 criteria: | 5: Fully resolves issue 3: Partial resolution 1: No resolution - name: efficiency weight: 0.2 scale: 1-5 criteria: | 5: Resolved in minimal turns 3: Reasonable turns 1: Excessive back-and-forth ``` #### Human Graders - **Pros**: Highest accuracy, catches edge cases - **Cons**: Expensive, slow, not scalable - **Best for**: Final validation, ambiguous cases ### Step 2: Choose Strategy by Agent Type #### 2.1 Coding Agents **Benchmarks**: - SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable) - Terminal-Bench: Complex terminal tasks - Custom test suites with your codebase **Grading Strategy**: ```python def grade_coding_agent(task: dict, outcome: dict) -> dict: return { "tests_passed": run_test_suite(outcome["code"]), "lint_score": run_linter(outcome["code"]), "builds": check_build(outcome["code"]), "matches_spec": compare_to_reference(task["spec"], outcome["code"]) } ``` **Key Metrics**: - Test passage rate - Build success - Lint/style compliance - Diff size (smaller is better) #### 2.2 Conversational Agents **Benchmarks**: - τ2-Bench: Multi-domain conversation - Custom domain-specific suites **Grading Strategy** (Multi-dimensional): ```yaml success_criteria: - empathy_score: >= 4.0 - resolution_rate: >= 0.9 - avg_turns: <= 5 - escalation_rate: <= 0.1 ``` **Key Metrics**: - Task resolution rate - Customer satisfaction proxy - Turn efficiency - Escalation rate #### 2.3 Research Agents **Grading Dimensions**: 1. **Grounding**: Claims backed by sources 2. **Coverage**: All aspects addressed 3. **Source Quality**: Authoritative sources used ```python def grade_research_agent(task: dict, outcome: dict) -> dict: return { "grounding": check_citations(outcome["report"]), "coverage": check_topic_coverage(task["topics"], outcome["report"]), "source_quality": score_sources(outcome["sources"]), "factual_accuracy": verify_claims(outcome["claims"]) } ``` #### 2.4 Computer Use Agents **Benchmarks**: - WebArena: Web navigation tasks - OSWorld: Desktop environment tasks **Grading Strategy**: ```python def grade_computer_use(task: dict, outcome: dict) -> dict: return { "ui_state": verify_ui_state(outcome["screenshot"]), "db_state": verify_database(task["expected_db_state"]), "file_state": verify_files(task["expected_files"]), "success": all_conditions_met(task, outcome) } ``` ### Step 3: Follow the 8-Step Roadmap #### Step 0: Start Early (20-50 Tasks) ```bash # Create initial eval suite structure mkdir -p evals/{tasks,results,graders} # Start with representative tasks # - Common use cases (60%) # - Edge cases (20%) # - Failure modes (20%) ``` #### Step 1: Convert Manual Tests ```python # Transform existing QA tests into eval tasks def convert_qa_to_eval(qa_case: dict) -> dict: return { "id": qa_case["id"], "prompt": qa_case["input"], "expected_outcome": qa_case["expected"], "grader": "code" if qa_case["has_tests"] else "model", "tags": qa_case.get("tags", []) } ``` #### Step 2: Ensure Clarity + Reference Solutions ```yaml # Good task definition task: id: "api-design-001" prompt: | Design a REST API for user management with: - CRUD operations - Authentication via JWT - Rate limiting reference_solution: "./solutions/api-design-001/" success_criteria: - "All endpoints documented" - "Auth middleware present" - "Rate limit config exists" ``` #### Step 3: Balance Positive/Negative Cases ```python # Ensure eval suite balance suite_composition = { "positive_cases": 0.5, # Should succeed "negative_cases": 0.3, # Should fail gracefully "edge_cases": 0.2 # Boundary conditions } ``` #### Step 4: Isolate Environments ```yaml # Docker-based isolation for coding evals eval_environment: type: docker image: "eval-sandbox:latest" timeout: 300s resources: memory: "4g" cpu: "2" network: isolated cleanup: always ``` #### Step 5: Focus on Outcomes, Not Paths ```python # GOOD: Outcome-focused grader def grade_outcome(expected: dict, actual: dict) -> float: return compare_final_states(expected, actual) # BAD: Path-focused grader (too brittle) def grade_path(expected_steps: list, actual_steps: list) -> float: return step_by_step_match(expected_steps, actual_steps) ``` #### Step 6: Always Read Transcripts ```python # Transcript analysis for debugging def analyze_transcript(transcript: list) -> dict: return { "total_steps": len(transcript), "tool_usage": count_tool_calls(transcript), "errors": extract_errors(transcript), "decision_points": find_decision_points(transcript), "recovery_attempts": find_recovery_patterns(transcript) } ``` #### Step 7: Monitor Eval Saturation ```python # Detect when evals are no longer useful def check_saturation(results: list, window: int = 10) -> dict: recent = results[-window:] return { "pass_rate": sum(r["passed"] for r in recent) / len(recent), "variance": calculate_variance(recent), "is_saturated": all(r["passed"] for r in recent), "recommendation": "Add harder tasks" if saturated else "Continue" } ``` #### Step 8: Long-term Maintenance ```yaml # Eval suite maintenance checklist maintenance: weekly: - Review failed evals for false negatives - Check for flaky tests monthly: - Add new edge cases from production issues - Retire saturated evals - Update reference solutions quarterly: - Full benchmark recalibration - Team contribution review ``` ### Step 4: Integrate with Production #### CI/CD Integration ```yaml # GitHub Actions example name: Agent Evals on: [push, pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Evals run: | python run_evals.py --suite=core --mode=compact - name: Upload Results uses: actions/upload-artifact@v4 with: name: eval-results path: results/ ``` #### Production Monitoring ```python # Real-time eval sampling class ProductionMonitor: def __init__(self, sample_rate: float = 0.1): self.sample_rate = sample_rate async def monitor(self, request, response): if random.random() < self.sample_rate: eval_result = await self.run_eval(request, response) self.log_result(eval_result) if eval_result["score"] < self.threshold: self.alert("Low quality response detected") ``` #### A/B Testing ```python # Compare agent versions def run_ab_test(suite: str, versions: list) -> dict: results = {} for version in versions: results[version] = run_eval_suite(suite, agent_version=version) return { "comparison": compare_results(results), "winner": determine_winner(results), "confidence": calculate_confidence(results) } ``` ## Best Practices ### Do's ✅ 1. **Start with 20-50 representative tasks** 2. **Use code-based graders when possible** 3. **Focus on outcomes, not paths** 4. **Read transcripts for debugging** 5. **Monitor for eval saturation** 6. **Balance positive/negative cases** 7. **Isolate eval environments** 8. **Version your eval suites** ### Don'ts ❌ 1. **Don't over-rely on model-based graders without calibration** 2. **Don't ignore failed evals (false negatives exist)** 3. **Don't grade on intermediate steps** 4. **Don't skip transcript analysis** 5. **Don't use production data without sanitization** 6. **Don't let eval suites become stale** ## Success Patterns ### Pattern 1: Graduated Eval Complexity ``` Level 1: Unit evals (single capability) Level 2: Integration evals (combined capabilities) Level 3: End-to-end evals (full workflows) Level 4: Adversarial evals (edge cases) ``` ### Pattern 2: Eval-Driven Development ``` 1. Write eval task for new feature 2. Run eval (expect failure) 3. Implement feature 4. Run eval (expect pass) 5. Add to regression suite ``` ### Pattern 3: Continuous Calibration ``` Weekly: Review grader accuracy Monthly: Update rubrics based on feedback Quarterly: Full grader audit with human baseline ``` ## Troubleshooting ### Problem: Eval scores at 100% **Solution**: Add harder tasks, check for eval saturation (Step 7) ### Problem: Inconsistent model-based grader scores **Solution**: Add more examples to rubric, use structured output, ensemble graders ### Problem: Evals too slow for CI **Solution**: Use toon mode, parallelize, sample subset for PR checks ### Problem: Agent passes evals but fails in production **Solution**: Add production failure cases to eval suite, increase diversity ## References - [Anthropic: Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) - [SWE-bench](https://www.swebench.com/) - [WebArena](https://webarena.dev/) - [τ2-Bench](https://github.com/sierra-research/tau2-bench) ## Examples ### Example 1: Simple Coding Agent Eval ```python # Task definition task = { "id": "fizzbuzz-001", "prompt": "Write a fizzbuzz function in Python", "test_cases": [ {"input": 3, "expected": "Fizz"}, {"input": 5, "expected": "Buzz"}, {"input": 15, "expected": "FizzBuzz"}, {"input": 7, "expected": "7"} ] } # Grader def grade(task, outcome): code = outcome["code"] exec(code) # In sandbox for tc in task["test_cases"]: if fizzbuzz(tc["input"]) != tc["expected"]: return 0.0 return 1.0 ``` ### Example 2: Conversational Agent Eval with LLM Rubric ```yaml task: id: "support-refund-001" scenario: | Customer wants refund for damaged product. Product: Laptop, Order: #12345, Damage: Screen crack expected_actions: - Acknowledge issue - Verify order - Offer resolution options max_turns: 5 grader: type: model model: claude-3-5-sonnet-20241022 rubric: | Score 1-5 on each dimension: - Empathy: Did agent acknowledge customer frustration? - Resolution: Was a clear solution offered? - Efficiency: Was issue resolved in reasonable turns? ```