---
name: testing-llm
license: MIT
compatibility: "Claude Code 2.1.148+."
description: LLM and AI testing patterns — mock responses, evaluation with DeepEval/RAGAS, structured output validation, and agentic test patterns (generator, healer, planner). Use when testing AI features, validating LLM outputs, or building evaluation pipelines.
tags: [testing, llm, ai, deepeval, ragas, evaluation, mocking]
context: fork
agent: test-generator
version: 2.1.0
author: OrchestKit
user-invocable: false
disable-model-invocation: false
complexity: medium
persuasion-type: reference
targets:
  - library: "deepeval"
    version: ">=4.0.0"
  - library: "ragas"
    version: ">=0.4.0"
metadata:
  category: document-asset-creation
allowed-tools:
  - Read
  - Glob
  - Grep
  - WebFetch
  - WebSearch
---

# LLM & AI Testing Patterns

Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).

## Quick Reference

| Area | File | Purpose |
|------|------|---------|
| **Rules** | `rules/llm-evaluation.md` | DeepEval quality metrics, Pydantic schema validation, timeout testing |
| **Rules** | `rules/llm-mocking.md` | Mock LLM responses, VCR.py recording, custom request matchers |
| **Reference** | `references/deepeval-ragas-api.md` | Full API reference for DeepEval and RAGAS metrics |
| **Reference** | `references/generator-agent.md` | Transforms Markdown specs into Playwright tests |
| **Reference** | `references/healer-agent.md` | Auto-fixes failing tests (selectors, waits, dynamic content) |
| **Reference** | `references/planner-agent.md` | Explores app and produces Markdown test plans |
| **Checklist** | `checklists/llm-test-checklist.md` | Complete LLM testing checklist (setup, coverage, CI/CD) |
| **Example** | `examples/llm-test-patterns.md` | Full examples: mocking, structured output, DeepEval, VCR, golden datasets |

## When to Use This Skill

- Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
- Validating RAG pipeline output quality
- Setting up deterministic LLM tests in CI
- Building evaluation pipelines with quality gates
- Applying agentic test patterns (plan -> generate -> heal)

## LLM Mock Quick Start

Mock LLM responses for fast, deterministic unit tests:

```python
from unittest.mock import AsyncMock, patch
import pytest

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None
```

**Key rule:** NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.

## DeepEval Quality Quick Start

Validate LLM output quality with multi-dimensional metrics:

```python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])
```

## Library notes (DeepEval, RAGAS)

**DeepEval** metrics expose a `reason` field alongside the numeric score when `include_reason=True`, so a failing CI build gets a human-readable explanation without a second LLM call:

```python
metric = AnswerRelevancyMetric(threshold=0.7, include_reason=True)
metric.measure(test_case)
print(metric.score, metric.reason)
# 0.62  "Response addresses the topic but omits the date asked for."
```

**RAGAS** uses a class-based metric API — instantiate metric classes and pass an `EvaluationDataset`. `llm=` is optional; omit it to use the configured default grader:

```python
from ragas import evaluate
from ragas.metrics import Faithfulness, LLMContextRecall

result = evaluate(
    dataset,
    metrics=[Faithfulness(), LLMContextRecall()],
)
```

> Bump floors: `deepeval >= 4.0`, `ragas >= 0.4`.

## Quality Metrics Thresholds

| Metric | Threshold | Purpose |
|--------|-----------|---------|
| Answer Relevancy | >= 0.7 | Response addresses question |
| Faithfulness | >= 0.8 | Output matches context |
| Hallucination | <= 0.3 | No fabricated facts |
| Context Precision | >= 0.7 | Retrieved contexts relevant |
| Context Recall | >= 0.7 | All relevant contexts retrieved |

## Structured Output Validation

Always validate LLM output with Pydantic schemas:

```python
from pydantic import BaseModel, Field

class LLMResponse(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(default_factory=list)

async def test_structured_output():
    result = await get_llm_response("test query")
    parsed = LLMResponse.model_validate(result)
    assert 0 <= parsed.confidence <= 1.0
```

## VCR.py for Integration Tests

Record and replay LLM API calls for deterministic integration tests:

```python
@pytest.fixture(scope="module")
def vcr_config():
    import os
    return {
        "record_mode": "none" if os.environ.get("CI") else "new_episodes",
        "filter_headers": ["authorization", "x-api-key"],
    }

@pytest.mark.vcr()
async def test_llm_integration():
    response = await llm_client.complete("Say hello")
    assert "hello" in response.content.lower()
```

## Agentic Test Workflow

The three-agent pattern for end-to-end test automation:

```
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)
```

1. **Planner** (`references/planner-agent.md`): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires `seed.spec.ts` for app context.

2. **Generator** (`references/generator-agent.md`): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).

3. **Healer** (`references/healer-agent.md`): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.

## Edge Cases to Always Test

For every LLM integration, cover these paths:

- **Empty/null inputs** -- empty strings, None values
- **Long inputs** -- truncation behavior near token limits
- **Timeouts** -- fail-open vs fail-closed behavior
- **Schema violations** -- invalid structured output
- **Prompt injection** -- adversarial input resistance
- **Unicode** -- non-ASCII characters in prompts and responses

See `checklists/llm-test-checklist.md` for the complete checklist.

## Anti-Patterns

| Anti-Pattern | Correct Approach |
|-------------|-----------------|
| Live LLM calls in CI | Mock for unit, VCR for integration |
| Random seeds | Fixed seeds or mocked responses |
| Single metric evaluation | 3-5 quality dimensions |
| No timeout handling | Always set < 1s timeout in tests |
| Hardcoded API keys | Environment variables, filtered in VCR |
| Asserting only `is not None` | Schema validation + quality metrics |

## Related Skills

- `ork:testing-unit` — Unit testing fundamentals, AAA pattern
- `ork:testing-integration` — Integration testing for AI pipelines
- `ork:golden-dataset` — Evaluation dataset management