--- name: run-eval description: Run EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes. --- # Run Eval Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke. ## What this does EvalView compares current agent behavior against saved golden baselines. It runs your test cases, evaluates the outputs, and reports a diff status for each test: - **PASSED** — behavior matches the baseline - **OUTPUT_CHANGED** — output shifted but may be intentional - **TOOLS_CHANGED** — different tools were called - **REGRESSION** — score dropped significantly (blocking failure) ## Steps 1. **Locate the test directory.** Look for `tests/evalview/` in the project. If it exists, use that. Otherwise check for a `tests/` directory with `.yaml` test files. 2. **Run a regression check** using the `run_check` MCP tool: - If checking all tests: call `run_check` with the detected `test_path` - If checking a specific test: also pass the `test` parameter with the test name 3. **Interpret results:** - If all tests pass, confirm to the user that no regressions were found - If REGRESSION is reported, show the diff (score delta, tool changes, output similarity) and offer to help fix it - If OUTPUT_CHANGED or TOOLS_CHANGED, flag it as a warning — the user should decide if the change is intentional 4. **If changes are intentional**, offer to update the baseline by calling `run_snapshot` with an explanatory `notes` parameter. 5. **Generate a visual report** (optional) by calling `generate_visual_report` for a detailed HTML breakdown of traces, diffs, scores, and timelines. ## CLI equivalent ``` evalview check tests/evalview/ evalview check tests/evalview/ --test "my-test" evalview snapshot tests/evalview/ --notes "updated after prompt refactor" ``` ## Tips - Use `run_check` frequently — it calls the Python API directly with no subprocess overhead. - A score delta near zero with TOOLS_CHANGED often means the agent found an equivalent path. - Always snapshot after confirming intentional changes so future checks compare against the new baseline.