--- name: promptfoo description: | Promptfoo evaluation framework for testing and comparing LLM outputs. Use when writing eval configs, creating test cases, debugging eval runs, or working with assertions. allowed-tools: - Bash(npx promptfoo:*) - Bash(npm run evals:*) - WebFetch(domain:www.promptfoo.dev) --- # Promptfoo [Promptfoo](https://www.promptfoo.dev/) is a CLI tool for testing and comparing LLM outputs. ## Config File The CLI auto-discovers `promptfooconfig.yaml` in the current directory. Use `-c path` for other locations. Supported extensions: `.yaml`, `.json`, `.js` ## Configuration ```yaml # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json description: "What this eval tests" prompts: - file://prompt.txt - | Inline prompt with {{variable}} substitution providers: - anthropic:messages:claude-sonnet-4-5-20250929 defaultTest: options: provider: config: temperature: 0.0 max_tokens: 4096 tests: - description: "What this case tests" vars: variable: "value" from_file: file://data/input.txt assert: - type: contains value: "expected substring" # Or load tests from files tests: file://cases/all.yaml outputPath: ./results.json evaluateOptions: maxConcurrency: 4 ``` ## Provider IDs | Model | ID | |-------|----| | Opus 4.5 | `anthropic:messages:claude-opus-4-5-20251101` | | Sonnet 4.5 | `anthropic:messages:claude-sonnet-4-5-20250929` | | Haiku 4.5 | `anthropic:messages:claude-haiku-4-5-20251001` | Provider config: `temperature`, `max_tokens`, `top_p`, `top_k`, `tools`, `tool_choice` ## Prompts - `file://path.txt` — load from file (path relative to config) - Inline string with `{{variable}}` Nunjucks substitution - Chat format via JSON: `[{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]` ## Assertion Types | Type | Use | Value | |------|-----|-------| | `contains` | Substring match | `"expected text"` | | `icontains` | Case-insensitive substring | `"expected text"` | | `equals` | Exact match | `"exact value"` | | `regex` | Pattern match | `"\\d{4}-\\d{2}-\\d{2}"` | | `is-json` | Valid JSON output | — | | `contains-json` | Output contains JSON | — | | `starts-with` | Prefix match | `"prefix"` | | `cost` | Max cost | `threshold: 0.01` | | `latency` | Max response time (ms) | `threshold: 5000` | | `javascript` | Custom JS expression | `output.includes('x')` | | `python` | Custom Python | `file://check.py:fn_name` | | `llm-rubric` | LLM-as-judge | rubric text | | `similar` | Semantic similarity | `value: "text"`, `threshold: 0.8` | | `model-graded-factuality` | Fact checking | — | Prefix any assertion with `not-` to negate (e.g., `not-contains`). ### llm-rubric Uses an LLM to grade output against a rubric: ```yaml assert: - type: llm-rubric value: | The response should: - Mention at least 3 factors - Include specific examples threshold: 0.7 provider: anthropic:messages:claude-sonnet-4-5-20250929 ``` ### javascript Inline expressions or functions. Access `output` (string) and `context` (with `vars`, `prompt`): ```yaml assert: - type: javascript value: output.length > 100 && output.includes('route') - type: javascript value: | const data = JSON.parse(output); return data.calories >= 200 && data.calories <= 300; ``` ## Test Organization Split cases into separate files and reference them: ```yaml tests: - file://cases/basic.yaml - file://cases/edge-cases.yaml ``` Each case file contains a YAML array of test objects. ## CLI ```bash npx promptfoo eval # Run with auto-discovered config npx promptfoo eval -c path/to/config.yaml # Specific config npx promptfoo eval --filter-metadata key=v # Filter tests npx promptfoo view # Web UI for results npx promptfoo cache clear # Clear result cache ``` ## References Consult the [configuration reference](https://www.promptfoo.dev/docs/configuration/reference/) and [Anthropic provider docs](https://www.promptfoo.dev/docs/providers/anthropic/) for full details.