---
name: llm-evaluation
description: |
  LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo.
  Invoke when:
  - Setting up prompt evaluation or regression testing
  - Integrating LLM testing into CI/CD pipelines
  - Configuring security testing (red teaming, jailbreaks)
  - Comparing prompt or model performance
  - Building evaluation suites for RAG, factuality, or safety
  Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing
---

# LLM Evaluation & Testing

Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.

## Quick Start

```bash
# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view

# Run security scan
npx promptfoo@latest redteam run
```

## Core Concepts

### Why Evaluate?

LLM outputs are non-deterministic. "It looks good" isn't testing. You need:

- **Regression detection**: Catch quality drops before production
- **Security scanning**: Find jailbreaks, injections, PII leaks
- **A/B comparison**: Compare prompts/models side-by-side
- **CI/CD gates**: Block bad changes from merging

### Evaluation Types

| Type | Purpose | Assertions |
|------|---------|------------|
| **Functional** | Does it work? | `contains`, `equals`, `is-json` |
| **Semantic** | Is it correct? | `similar`, `llm-rubric`, `factuality` |
| **Performance** | Is it fast/cheap? | `cost`, `latency` |
| **Security** | Is it safe? | `redteam`, `moderation`, `pii-detection` |

## Configuration

### Basic promptfooconfig.yaml

```yaml
description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000
```

### With Environment Variables

```yaml
providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}
```

## Assertions Reference

### Basic Assertions

```yaml
assert:
  # String matching
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # Date pattern

  # JSON validation
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]
```

### Semantic Assertions

```yaml
assert:
  # Semantic similarity (embeddings)
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1 similarity score

  # LLM-as-judge with custom criteria
  - type: llm-rubric
    value: |
      Response must:
      1. Be factually accurate
      2. Be under 100 words
      3. Not contain marketing language

  # Factuality check against reference
  - type: factuality
    value: "Paris is the capital of France"
```

### Performance Assertions

```yaml
assert:
  # Cost budget (USD)
  - type: cost
    threshold: 0.05  # Max $0.05 per request

  # Latency (milliseconds)
  - type: latency
    threshold: 2000  # Max 2 seconds
```

### Security Assertions

```yaml
assert:
  # Content moderation
  - type: moderation
    value: violence

  # PII detection
  - type: not-contains
    value: "{{email}}"  # From test vars
```

## CI/CD Integration

### GitHub Action

```yaml
name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # Cache for faster runs
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # Run evaluation and post results to PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # Or other provider keys
```

### Quality Gates

```yaml
# promptfooconfig.yaml
evaluateOptions:
  # Fail if any assertion fails
  maxConcurrency: 5

  # Or set pass threshold
  threshold: 0.9  # 90% of tests must pass
```

### Output to JSON (for custom CI)

```bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
  echo "Evaluation failed!"
  exit 1
fi
```

## Security Testing (Red Team)

### Quick Scan

```bash
# Run red team against your prompts
npx promptfoo@latest redteam run

# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
```

### Configuration

```yaml
# promptfooconfig.yaml
redteam:
  purpose: "Customer support chatbot"
  plugins:
    - harmful:hate
    - harmful:violence
    - harmful:self-harm
    - pii:direct
    - pii:session
    - hijacking
    - jailbreak
    - prompt-injection

  strategies:
    - jailbreak
    - prompt-injection
    - base64
    - leetspeak
```

### OWASP Top 10 Coverage

```yaml
redteam:
  plugins:
    # 1. Prompt Injection
    - prompt-injection
    # 2. Insecure Output Handling
    - harmful:privacy
    # 3. Training Data Poisoning (N/A for evals)
    # 4. Model Denial of Service
    - excessive-agency
    # 5. Supply Chain (N/A for evals)
    # 6. Sensitive Information Disclosure
    - pii:direct
    - pii:session
    # 7. Insecure Plugin Design
    - hijacking
    # 8. Excessive Agency
    - excessive-agency
    # 9. Overreliance (use factuality checks)
    # 10. Model Theft (N/A for evals)
```

## RAG Evaluation

### Context-Aware Testing

```yaml
prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # Common hallucination
```

### Retrieval Quality

```yaml
# Test that retrieval returns relevant documents
tests:
  - vars:
      query: "Python list comprehension"
    assert:
      - type: llm-rubric
        value: "Response discusses Python list comprehension syntax and examples"
      - type: not-contains
        value: "I don't know"  # Shouldn't punt on this query
```

## Comparing Models/Prompts

### A/B Testing

```yaml
# Compare two prompts
prompts:
  - file://prompts/v1.txt
  - file://prompts/v2.txt

# Same tests for both
tests:
  - vars: { question: "Explain recursion" }
    assert:
      - type: llm-rubric
        value: "Clear explanation of recursion with example"
```

### Model Comparison

```yaml
# Compare multiple models
providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-haiku-latest
  - openrouter:google/gemini-flash-1.5

# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side
```

## Best Practices

### 1. Golden Test Cases

Maintain a set of critical test cases that must always pass:

```yaml
# golden-tests.yaml
tests:
  - description: "Core functionality - must pass"
    vars:
      input: "critical test case"
    assert:
      - type: contains
        value: "expected output"
    options:
      critical: true  # Fail entire suite if this fails
```

### 2. Regression Suite Structure

```
prompts/
├── production.txt          # Current production prompt
├── candidate.txt           # New prompt being tested
tests/
├── golden/                 # Critical tests (run on every PR)
│   └── core-functionality.yaml
├── regression/             # Full regression suite (nightly)
│   └── full-suite.yaml
└── security/               # Red team tests
    └── redteam.yaml
```

### 3. Test Categories

```yaml
tests:
  # Happy path
  - description: "Standard query"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # Edge cases
  - description: "Empty input"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # Adversarial
  - description: "Injection attempt"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # Should refuse
```

## References

- `references/promptfoo-guide.md` - Detailed setup and configuration
- `references/evaluation-metrics.md` - Metrics deep dive
- `references/ci-cd-integration.md` - CI/CD patterns
- `references/alternatives.md` - Braintrust, DeepEval, LangSmith comparison

## Templates

Copy-paste ready templates:
- `templates/promptfooconfig.yaml` - Basic config
- `templates/github-action-eval.yml` - GitHub Action
- `templates/regression-test-suite.yaml` - Full regression suite