--- name: agent-eval-framework description: Evaluate AI agent outputs systematically using rubrics, assertions, and reference comparisons. Detect quality drift over time. version: "1.0.0" last-updated: "2026-04-17" model_tested: "claude-sonnet-4-6" category: eval platforms: [claude-code, codex, gemini-cli, cursor, copilot, windsurf, cline] language: en geo_relevance: [global] priority: high dependencies: mcp: [] skills: [] apis: [] data: [] update_sources: - url: "https://arxiv.org/abs/2603.28052" check_frequency: "quarterly" last_checked: "2026-04-21" license: MIT --- # Agent Evaluation Framework ## When to Use - Before deploying an agent to production - After changing an agent's system prompt or skills - When agent output quality seems to degrade - During periodic quality reviews - When comparing two agent configurations ## Step 1: Define Evaluation Criteria Choose criteria relevant to your agent's purpose: ### Universal Criteria | Criterion | Question | Score | |-----------|----------|-------| | Correctness | Is the output factually/technically correct? | 0-10 | | Completeness | Does it cover all required aspects? | 0-10 | | Relevance | Is every part relevant to the request? | 0-10 | | Safety | Does it avoid harmful/insecure patterns? | 0-10 | ### Code-Specific Criteria | Criterion | Question | Score | |-----------|----------|-------| | Functionality | Does the code work as intended? | 0-10 | | Edge Cases | Are edge cases handled? | 0-10 | | Style | Does it match project conventions? | 0-10 | | Security | Are there vulnerabilities? | 0-10 | ### Content-Specific Criteria | Criterion | Question | Score | |-----------|----------|-------| | Accuracy | Are claims supported by evidence? | 0-10 | | Tone | Does it match the intended audience? | 0-10 | | Structure | Is it well-organized? | 0-10 | | Originality | Does it avoid generic/cliche content? | 0-10 | ## Step 2: Choose Evaluation Method ### A. Assertion-Based (Automated) Define pass/fail conditions: ``` ASSERT: output contains "disclaimer" ASSERT: output does NOT contain "TODO" ASSERT: code compiles without errors ASSERT: response length < 2000 tokens ASSERT: no PII detected in output ``` Best for: Regression testing, CI/CD pipelines. ### B. Reference-Based (Semi-Automated) Compare output against a known-good reference: - Exact match (strict) - Semantic similarity (using embeddings) - Key-point coverage (checklist) Best for: Consistent tasks with known expected outputs. ### C. Rubric-Based (Human + AI) Score each criterion 0-10 with justification: ``` Correctness: 8/10 — Accurate but missed one edge case Completeness: 7/10 — Covered 5 of 6 required points Safety: 10/10 — No security issues TOTAL: 25/30 (83%) — PASS (threshold: 70%) ``` Best for: Complex, subjective outputs. ## Step 3: Run Evaluation 1. Prepare 5-10 test cases covering: - Happy path (normal usage) - Edge cases (unusual inputs) - Adversarial inputs (injection, confusion) - Empty/minimal inputs - Maximum complexity inputs 2. Run each test case through the agent 3. Apply chosen evaluation method 4. Record results with timestamps ## Step 4: Set Thresholds | Level | Score | Action | |-------|-------|--------| | Excellent | >= 90% | Ship | | Good | 70-89% | Ship with monitoring | | Marginal | 50-69% | Fix before shipping | | Failing | < 50% | Do not ship | ## Step 5: Monitor Drift Track these metrics over time: - Average score per criterion (weekly) - Pass rate on test suite (per deployment) - Token cost per task (per session) - User satisfaction signals (if available) Drift signals: - Score drops >10% week-over-week - Pass rate drops below threshold - Token cost increases >20% without scope change - New failure modes not in original test suite ## Output Format ``` AGENT EVAL REPORT Agent: {name} Date: {ISO-8601} Test cases: {n} Method: {assertion|reference|rubric} Results: Pass: {n} ({%}) Fail: {n} ({%}) Average score: {x}/10 Per-criterion: Correctness: {x}/10 Completeness: {x}/10 Safety: {x}/10 Verdict: {PASS|MARGINAL|FAIL} Recommendation: {ship|fix|block} ``` ## What This Skill Does NOT Do - Does not test the LLM model itself (tests agent in context) - Does not perform adversarial red-teaming (different discipline) - Does not replace user feedback (complements it) - Does not measure latency or throughput (APM tools do this)