--- name: llm-evals description: Use when building LLM evaluations, testing prompts, comparing prompt versions, optimizing prompts, setting up CI gates for LLM outputs, or when the user mentions eval, benchmark, prompt testing, regression detection, or scoring LLM responses. Also use when someone says "test my prompt", "is this prompt better", "eval suite", or "prompt optimization". --- # LLM Evals ## Overview The `evals` framework at `/Users/rshah/evals` is a unified LLM eval + optimization engine. Use it to measure prompt quality with statistical rigor, detect regressions, auto-optimize prompts, and gate deployments in CI. **Core principle:** Eval with N trials and statistical tests, not vibes. ## When to Use - Setting up evaluation for any LLM prompt or pipeline - Comparing two prompt versions to see which is better - Optimizing a prompt automatically - Adding CI gates that block bad prompt changes - Scoring LLM outputs with custom or built-in metrics - Testing RAG pipelines, agents, or any async LLM workflow ## Quick Reference ### Installation ```bash cd /Users/rshah/evals pip install -e . # or: uv sync ``` ### CLI Commands | Command | Purpose | |---------|---------| | `evals run suite.yaml` | Run evaluation suite | | `evals run suite.yaml --trials 5 --baseline main` | Run with baseline comparison | | `evals run suite.yaml --ci junit --output report.xml` | Generate JUnit for CI | | `evals compare ID_A ID_B` | Statistical comparison of two experiments | | `evals optimize prompt.yaml --dataset data.yaml --metric exact_match --strategy opro` | Auto-optimize | | `evals server start --port 3000` | Start REST API | ### Built-in Metrics | Metric | What | |--------|------| | `exact_match` | output == expected | | `contains` | expected in output | | `json_schema` | Valid JSON matching schema | | `regex_match` | re.fullmatch(expected, output) | | `cosine_similarity` | Token-overlap cosine [0,1] | | `levenshtein_similarity` | 1 - edit_distance/max_len | ### Python API — Fastest Path ```python from evals.engine import Eval from evals.models.core import Dataset, Prompt, TestCase from evals.metrics.builtin import exact_match, contains experiment = Eval( name="my_eval", prompt=Prompt( name="qa", template="Answer: {{ question }}", model="claude-sonnet-4-20250514", ), dataset=Dataset(name="test", cases=[ TestCase(input={"question": "Capital of France?"}, expected="Paris"), ]), metrics=[exact_match, contains], trials=3, ).run() for m in experiment.summary.metrics: print(f"{m.metric}: {m.mean:.4f} CI=[{m.ci.lower:.4f}, {m.ci.upper:.4f}]") ``` ### Task-Based Eval (no prompt — for RAG, agents, pipelines) ```python async def my_pipeline(input_dict: dict) -> str: # Your RAG/agent/pipeline logic here return answer experiment = Eval( name="pipeline_eval", task=my_pipeline, # replaces prompt= dataset=dataset, metrics=[exact_match, relevancy], ).run() ``` ### Custom Metric ```python from evals.models.core import Metric @Metric.code("my_metric", threshold=0.8) def my_metric(output: str, expected: str) -> float: return 1.0 if some_condition(output, expected) else 0.0 ``` ### Suite YAML Format ```yaml name: suite_name prompt: path/to/prompt.yaml # or inline Prompt object dataset: name: test_data cases: - input: { question: "..." } expected: "..." metrics: - exact_match - contains trials: 3 concurrency: 10 ``` ### Prompt YAML Format ```yaml name: my_prompt template: "Answer: {{ question }}" system: "Be concise." model: claude-sonnet-4-20250514 parameters: temperature: 0.3 max_tokens: 256 ``` ### Statistical Comparison ```python from evals.engine.statistical import compare_experiments comparison = compare_experiments(baseline_exp, current_exp, "exact_match") # Returns: delta_mean, p_value (Wilcoxon), significant, effect_size (Cohen's d) # Plus per-case regressions and improvements ``` ### Optimization Strategies | Strategy | How | |----------|-----| | `opro` | Meta-prompt with history, LLM proposes candidates | | `textgrad` | Per-failure critiques, aggregated into edits | | `bootstrap` | Mines passing examples as few-shot demos | ### MCP Server ```bash # Add to Claude Code: claude mcp add --transport stdio evals-mcp -- uv run --directory /Users/rshah/evals evals-mcp # Or run standalone: uv run evals-mcp ``` Tools exposed: `run_eval`, `compare_experiments`, `list_experiments`, `get_experiment`, `list_metrics`, `score_output`, `create_test_cases`, `optimize_prompt`, `generate_junit`, `generate_pr_comment`. ## Key Architecture Decisions - **Content-hash versioning**: Prompt.version = SHA-256(template+system+model+params). Changes when content changes. - **N-trial statistical rigor**: Bootstrap CI (10K resamples), Wilcoxon signed-rank (non-parametric), Cohen's d effect size, pass@k. - **Async-first**: All I/O is async (httpx, aiosqlite). EvalEngine uses asyncio.Semaphore for concurrency. - **Provider auto-detect**: `claude-*` → Anthropic, `gpt-*` → OpenAI, `llama*` → Ollama. - **Cache**: Content-addressed file cache. Only caches temperature=0 calls. ## Project Structure ``` /Users/rshah/evals/src/evals/ ├── models/core.py # 20+ Pydantic models ├── gateway/ # OpenAI, Anthropic, Ollama providers ├── metrics/ # Builtin + LLM judge + registry ├── engine/ # EvalEngine, statistical, cache ├── optimize/ # OPRO, TextGrad, BootstrapFewShot ├── storage/ # SQLite + Postgres backends ├── ci/ # JUnit XML + GitHub PR comments ├── cli/ # Click CLI (run, compare, optimize, server) ├── server/ # FastAPI REST API (15 endpoints) └── mcp/ # MCP server (10 tools) ``` ## Common Mistakes | Mistake | Fix | |---------|-----| | Running 1 trial and trusting the result | Use trials=3+ for statistical validity | | Comparing prompts by eyeballing | Use `evals compare` with Wilcoxon test | | Optimizing without a val split | The optimizer auto-splits 80/20 if no splits defined | | Using temperature>0 and expecting cache hits | Cache only stores temperature=0 calls | | Forgetting to set API key env vars | CLI auto-detects ANTHROPIC_API_KEY, OPENAI_API_KEY | ## Troubleshooting | Problem | Solution | |---------|----------| | `ModuleNotFoundError: evals` | Run `cd /Users/rshah/evals && pip install -e .` | | API key not found | Set `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` env var | | Eval hangs | Check concurrency setting, reduce if rate-limited | | Cache not working | Only `temperature=0` calls are cached | | Low statistical power | Increase `trials` (minimum 5 for Wilcoxon) | ## Step-by-Step: Your First Eval 1. Install: `cd /Users/rshah/evals && pip install -e .` 2. Create a prompt YAML: ```yaml name: my_prompt template: "Answer: {{ question }}" model: claude-sonnet-4-20250514 ``` 3. Create a dataset YAML: ```yaml name: test_data cases: - input: { question: "Capital of France?" } expected: "Paris" ``` 4. Create a suite YAML: ```yaml name: my_eval prompt: prompt.yaml dataset: dataset.yaml metrics: [exact_match, contains] trials: 5 ``` 5. Run: `evals run suite.yaml --trials 5` 6. Compare: `evals compare ` ## When NOT to Use - Simple string matching that doesn't need statistical rigor - One-off manual prompt testing (just test in the Claude UI) - Evaluating non-text outputs (images, audio) - When you don't have test cases or expected outputs yet (write those first)