---
name: llm-testing
description: Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs.
context: fork
agent: test-generator
version: 2.0.0
tags: [testing, llm, ai, deepeval, ragas, 2026]
author: OrchestKit
user-invocable: false
---

# LLM Testing Patterns

Test AI applications with deterministic patterns using DeepEval and RAGAS.

## Quick Reference

### Mock LLM Responses

```python
from unittest.mock import AsyncMock, patch

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None
```

### DeepEval Quality Testing

```python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
]

assert_test(test_case, metrics)
```

### Timeout Testing

```python
import asyncio
import pytest

@pytest.mark.asyncio
async def test_respects_timeout():
    with pytest.raises(asyncio.TimeoutError):
        async with asyncio.timeout(0.1):
            await slow_llm_call()
```

## Quality Metrics (2026)

| Metric | Threshold | Purpose |
|--------|-----------|---------|
| Answer Relevancy | ≥ 0.7 | Response addresses question |
| Faithfulness | ≥ 0.8 | Output matches context |
| Hallucination | ≤ 0.3 | No fabricated facts |
| Context Precision | ≥ 0.7 | Retrieved contexts relevant |

## Anti-Patterns (FORBIDDEN)

```python
# ❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)

# ❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))

# ❌ NEVER skip timeout handling
await llm_call()  # No timeout!

# ✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm):
    result = await function_under_test()

# ✅ ALWAYS use VCR.py for integration tests
@pytest.mark.vcr()
async def test_llm_integration():
    ...
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Mock vs VCR | VCR for integration, mock for unit |
| Timeout | Always test with < 1s timeout |
| Schema validation | Test both valid and invalid |
| Edge cases | Test all null/empty paths |
| Quality metrics | Use multiple dimensions (3-5) |

## Detailed Documentation

| Resource | Description |
|----------|-------------|
| [references/deepeval-ragas-api.md](references/deepeval-ragas-api.md) | DeepEval & RAGAS API reference |
| [examples/test-patterns.md](examples/test-patterns.md) | Complete test examples |
| [checklists/llm-test-checklist.md](checklists/llm-test-checklist.md) | Setup and review checklists |
| [scripts/llm-test-template.py](scripts/llm-test-template.py) | Starter test template |

## Related Skills

- `vcr-http-recording` - Record LLM responses
- `llm-evaluation` - Quality assessment
- `unit-testing` - Test fundamentals

## Capability Details

### llm-response-mocking
**Keywords:** mock LLM, fake response, stub LLM, mock AI
**Solves:**
- Mock LLM responses in tests
- Create deterministic AI test fixtures
- Avoid live API calls in CI

### async-timeout-testing
**Keywords:** timeout, async test, wait for, polling
**Solves:**
- Test async LLM operations
- Handle timeout scenarios
- Implement polling assertions

### structured-output-validation
**Keywords:** structured output, JSON validation, schema validation, output format
**Solves:**
- Validate structured LLM output
- Test JSON schema compliance
- Assert output structure

### deepeval-assertions
**Keywords:** DeepEval, assert_test, LLMTestCase, metric assertion
**Solves:**
- Use DeepEval for LLM assertions
- Implement metric-based tests
- Configure quality thresholds

### golden-dataset-testing
**Keywords:** golden dataset, golden test, reference output, expected output
**Solves:**
- Test against golden datasets
- Compare with reference outputs
- Implement regression testing

### vcr-recording
**Keywords:** VCR, cassette, record, replay, HTTP recording
**Solves:**
- Record LLM API responses
- Replay recordings in tests
- Create deterministic test suites