--- name: mlflow-evaluation description: "MLflow 3 GenAI evaluation for agent development. Use when (1) writing mlflow.genai.evaluate() code, (2) creating @scorer functions, (3) building evaluation datasets from traces, (4) using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), (5) analyzing traces for latency/errors/architecture, (6) optimizing agent context/prompts/token usage, (7) debugging evaluation failures. Covers the full eval workflow: trace analysis -> dataset building -> scorer creation -> evaluation execution." --- # MLflow 3 GenAI Evaluation ## Before Writing Any Code 1. **Read GOTCHAS.md** - 15+ common mistakes that cause failures 2. **Read CRITICAL-interfaces.md** - Exact API signatures and data schemas ## End-to-End Workflows Follow these workflows based on your goal. Each step indicates which reference files to read. ### Workflow 1: First-Time Evaluation Setup For users new to MLflow GenAI evaluation or setting up evaluation for a new agent. | Step | Action | Reference Files | |------|--------|-----------------| | 1 | Understand what to evaluate | `user-journeys.md` (Journey 0: Strategy) | | 2 | Learn API patterns | `GOTCHAS.md` + `CRITICAL-interfaces.md` | | 3 | Build initial dataset | `patterns-datasets.md` (Patterns 1-4) | | 4 | Choose/create scorers | `patterns-scorers.md` + `CRITICAL-interfaces.md` (built-in list) | | 5 | Run evaluation | `patterns-evaluation.md` (Patterns 1-3) | ### Workflow 2: Production Trace -> Evaluation Dataset For building evaluation datasets from production traces. | Step | Action | Reference Files | |------|--------|-----------------| | 1 | Search and filter traces | `patterns-trace-analysis.md` (MCP tools section) | | 2 | Analyze trace quality | `patterns-trace-analysis.md` (Patterns 1-7) | | 3 | Tag traces for inclusion | `patterns-datasets.md` (Patterns 16-17) | | 4 | Build dataset from traces | `patterns-datasets.md` (Patterns 6-7) | | 5 | Add expectations/ground truth | `patterns-datasets.md` (Pattern 2) | ### Workflow 3: Performance Optimization For debugging slow or expensive agent execution. | Step | Action | Reference Files | |------|--------|-----------------| | 1 | Profile latency by span | `patterns-trace-analysis.md` (Patterns 4-6) | | 2 | Analyze token usage | `patterns-trace-analysis.md` (Pattern 9) | | 3 | Detect context issues | `patterns-context-optimization.md` (Section 5) | | 4 | Apply optimizations | `patterns-context-optimization.md` (Sections 1-4, 6) | | 5 | Re-evaluate to measure impact | `patterns-evaluation.md` (Pattern 6-7) | ### Workflow 4: Regression Detection For comparing agent versions and finding regressions. | Step | Action | Reference Files | |------|--------|-----------------| | 1 | Establish baseline | `patterns-evaluation.md` (Pattern 4: named runs) | | 2 | Run current version | `patterns-evaluation.md` (Pattern 1) | | 3 | Compare metrics | `patterns-evaluation.md` (Patterns 6-7) | | 4 | Analyze failing traces | `patterns-trace-analysis.md` (Pattern 7) | | 5 | Debug specific failures | `patterns-trace-analysis.md` (Patterns 8-9) | ### Workflow 5: Custom Scorer Development For creating project-specific evaluation metrics. | Step | Action | Reference Files | |------|--------|-----------------| | 1 | Understand scorer interface | `CRITICAL-interfaces.md` (Scorer section) | | 2 | Choose scorer pattern | `patterns-scorers.md` (Patterns 4-11) | | 3 | For multi-agent scorers | `patterns-scorers.md` (Patterns 13-16) | | 4 | Test with evaluation | `patterns-evaluation.md` (Pattern 1) | ## Reference Files Quick Lookup | Reference | Purpose | When to Read | |-----------|---------|--------------| | `GOTCHAS.md` | Common mistakes | **Always read first** before writing code | | `CRITICAL-interfaces.md` | API signatures, schemas | When writing any evaluation code | | `patterns-evaluation.md` | Running evals, comparing | When executing evaluations | | `patterns-scorers.md` | Custom scorer creation | When built-in scorers aren't enough | | `patterns-datasets.md` | Dataset building | When preparing evaluation data | | `patterns-trace-analysis.md` | Trace debugging | When analyzing agent behavior | | `patterns-context-optimization.md` | Token/latency fixes | When agent is slow or expensive | | `user-journeys.md` | High-level workflows | When starting a new evaluation project | ## Critical API Facts - **Use:** `mlflow.genai.evaluate()` (NOT `mlflow.evaluate()`) - **Data format:** `{"inputs": {"query": "..."}}` (nested structure required) - **predict_fn:** Receives `**unpacked kwargs` (not a dict) See `GOTCHAS.md` for complete list.