--- name: phoenix-evals description: Build and run evaluators for AI/LLM applications using Phoenix. license: Apache-2.0 metadata: author: oss@arize.com version: "1.0.0" languages: Python, TypeScript --- # Phoenix Evals Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans. ## Quick Reference | Task | Files | | ---- | ----- | | Setup | `setup-python`, `setup-typescript` | | Build code evaluator | `evaluators-code-{python\|typescript}` | | Build LLM evaluator | `evaluators-llm-{python\|typescript}`, `evaluators-custom-templates` | | Run experiment | `experiments-running-{python\|typescript}` | | Create dataset | `experiments-datasets-{python\|typescript}` | | Validate evaluator | `validation`, `validation-calibration-{python\|typescript}` | | Analyze errors | `error-analysis`, `axial-coding` | | RAG evals | `evaluators-rag` | | Production | `production-overview`, `production-guardrails` | ## Workflows **Starting Fresh:** `observe-tracing-setup` → `error-analysis` → `axial-coding` → `evaluators-overview` **Building Evaluator:** `fundamentals` → `evaluators-{code\|llm}-{python\|typescript}` → `validation-calibration-{python\|typescript}` **RAG Systems:** `evaluators-rag` → `evaluators-code-*` (retrieval) → `evaluators-llm-*` (faithfulness) **Production:** `production-overview` → `production-guardrails` → `production-continuous` ## Rule Categories | Prefix | Description | | ------ | ----------- | | `fundamentals-*` | Types, scores, anti-patterns | | `observe-*` | Tracing, sampling | | `error-analysis-*` | Finding failures | | `axial-coding-*` | Categorizing failures | | `evaluators-*` | Code, LLM, RAG evaluators | | `experiments-*` | Datasets, running experiments | | `validation-*` | Calibrating judges | | `production-*` | CI/CD, monitoring | ## Key Principles | Principle | Action | | --------- | ------ | | Error analysis first | Can't automate what you haven't observed | | Custom > generic | Build from your failures | | Code first | Deterministic before LLM | | Validate judges | >80% TPR/TNR | | Binary > Likert | Pass/fail, not 1-5 |