---
name: llm-safety-patterns
description: Security patterns for LLM integrations including prompt injection defense and hallucination prevention. Use when implementing context separation, validating LLM outputs, or protecting against prompt injection attacks.
tags: [ai, safety, guardrails, security, llm]
context: fork
agent: security-auditor
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# LLM Safety Patterns

## The Core Principle

> **Identifiers flow AROUND the LLM, not THROUGH it.**
> **The LLM sees only content. Attribution happens deterministically.**

## Why This Matters

When identifiers appear in prompts, bad things happen:

1. **Hallucination:** LLM invents IDs that don't exist
2. **Confusion:** LLM mixes up which ID belongs where
3. **Injection:** Attacker manipulates IDs via prompt injection
4. **Leakage:** IDs appear in logs, caches, traces
5. **Cross-tenant:** LLM could reference other users' data

## The Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│   SYSTEM CONTEXT (flows around LLM)                                     │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ user_id │ tenant_id │ analysis_id │ trace_id │ permissions     │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│        │                                                       │        │
│        │                                                       │        │
│        ▼                                                       ▼        │
│   ┌─────────┐                                           ┌─────────┐    │
│   │ PRE-LLM │       ┌─────────────────────┐            │POST-LLM │    │
│   │ FILTER  │──────▶│        LLM          │───────────▶│ATTRIBUTE│    │
│   │         │       │                     │            │         │    │
│   │ Returns │       │ Sees ONLY:          │            │ Adds:   │    │
│   │ CONTENT │       │ - content text      │            │ - IDs   │    │
│   │ (no IDs)│       │ - context text      │            │ - refs  │    │
│   └─────────┘       │ (NO IDs!)           │            └─────────┘    │
│                     └─────────────────────┘                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

## What NEVER Goes in Prompts

### OrchestKit Forbidden Parameters

| Parameter | Type | Why Forbidden |
|-----------|------|---------------|
| `user_id` | UUID | Can be hallucinated, enables cross-user access |
| `tenant_id` | UUID | Critical for multi-tenant isolation |
| `analysis_id` | UUID | Job tracking, not for LLM |
| `document_id` | UUID | Source tracking, not for LLM |
| `artifact_id` | UUID | Output tracking, not for LLM |
| `chunk_id` | UUID | RAG reference, not for LLM |
| `session_id` | str | Auth context, not for LLM |
| `trace_id` | str | Observability, not for LLM |
| Any UUID | UUID | Pattern: `[0-9a-f]{8}-...` |

### Detection Pattern

```python
import re

FORBIDDEN_PATTERNS = [
    r'user[_-]?id',
    r'tenant[_-]?id',
    r'analysis[_-]?id',
    r'document[_-]?id',
    r'artifact[_-]?id',
    r'chunk[_-]?id',
    r'session[_-]?id',
    r'trace[_-]?id',
    r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}',
]

def audit_prompt(prompt: str) -> list[str]:
    """Check for forbidden patterns in prompt"""
    violations = []
    for pattern in FORBIDDEN_PATTERNS:
        if re.search(pattern, prompt, re.IGNORECASE):
            violations.append(pattern)
    return violations
```

## The Three-Phase Pattern

### Phase 1: Pre-LLM (Filter & Extract)

```python
async def prepare_for_llm(
    query: str,
    ctx: RequestContext,
) -> tuple[str, list[str], SourceRefs]:
    """
    Filter data and extract content for LLM.
    Returns: (content, context_texts, source_references)
    """
    # 1. Retrieve with tenant filter
    documents = await semantic_search(
        query_embedding=embed(query),
        ctx=ctx,  # Filters by tenant_id, user_id
    )

    # 2. Save references for attribution
    source_refs = SourceRefs(
        document_ids=[d.id for d in documents],
        chunk_ids=[c.id for c in chunks],
    )

    # 3. Extract content only (no IDs)
    content_texts = [d.content for d in documents]

    return query, content_texts, source_refs
```

### Phase 2: LLM Call (Content Only)

```python
def build_prompt(content: str, context_texts: list[str]) -> str:
    """
    Build prompt with ONLY content, no identifiers.
    """
    prompt = f"""
    Analyze the following content and provide insights.

    CONTENT:
    {content}

    RELEVANT CONTEXT:
    {chr(10).join(f"- {text}" for text in context_texts)}

    Provide analysis covering:
    1. Key concepts
    2. Prerequisites
    3. Learning objectives
    """

    # AUDIT: Verify no IDs leaked
    violations = audit_prompt(prompt)
    if violations:
        raise SecurityError(f"IDs leaked to prompt: {violations}")

    return prompt

async def call_llm(prompt: str) -> dict:
    """LLM only sees content, never IDs"""
    response = await llm.generate(prompt)
    return parse_response(response)
```

### Phase 3: Post-LLM (Attribute)

```python
async def save_with_attribution(
    llm_output: dict,
    ctx: RequestContext,
    source_refs: SourceRefs,
) -> Analysis:
    """
    Attach context and references to LLM output.
    Attribution is deterministic, not LLM-generated.
    """
    return await Analysis.create(
        # Generated
        id=uuid4(),

        # From RequestContext (system-provided)
        user_id=ctx.user_id,
        tenant_id=ctx.tenant_id,
        analysis_id=ctx.resource_id,
        trace_id=ctx.trace_id,

        # From Pre-LLM refs (deterministic)
        source_document_ids=source_refs.document_ids,
        source_chunk_ids=source_refs.chunk_ids,

        # From LLM (content only)
        content=llm_output["analysis"],
        key_concepts=llm_output["key_concepts"],
        difficulty=llm_output["difficulty"],

        # Metadata
        created_at=datetime.now(timezone.utc),
        model_used=MODEL_NAME,
    )
```

## Output Validation

After LLM returns, validate:

1. **Schema:** Response matches expected structure
2. **Guardrails:** No toxic/harmful content
3. **Grounding:** Claims are supported by provided context
4. **No IDs:** LLM didn't hallucinate any IDs

```python
async def validate_output(
    llm_output: dict,
    context_texts: list[str],
) -> ValidationResult:
    """Validate LLM output before use"""

    # 1. Schema validation
    try:
        parsed = AnalysisOutput.model_validate(llm_output)
    except ValidationError as e:
        return ValidationResult(valid=False, reason=f"Schema error: {e}")

    # 2. Guardrails
    if await contains_toxic_content(parsed.content):
        return ValidationResult(valid=False, reason="Toxic content detected")

    # 3. Grounding check
    if not is_grounded(parsed.content, context_texts):
        return ValidationResult(valid=False, reason="Ungrounded claims")

    # 4. No hallucinated IDs
    if contains_uuid_pattern(parsed.content):
        return ValidationResult(valid=False, reason="Hallucinated IDs")

    return ValidationResult(valid=True)
```

## Integration Points in OrchestKit

### Content Analysis Workflow

```
backend/app/workflows/
├── agents/
│   ├── execution.py        # Add context separation
│   └── prompts/            # Audit all prompts
├── tasks/
│   └── generate_artifact.py  # Add attribution
```

### Services

```
backend/app/services/
├── embeddings/            # Pre-LLM filtering
└── analysis/              # Post-LLM attribution
```

## Checklist Before Any LLM Call

- [ ] RequestContext available
- [ ] Data filtered by tenant_id and user_id
- [ ] Content extracted without IDs
- [ ] Source references saved
- [ ] Prompt passes audit (no forbidden patterns)
- [ ] Output validated before use
- [ ] Attribution uses context, not LLM output

---

## Related Skills

- `input-validation` - Input sanitization patterns that complement LLM safety
- `rag-retrieval` - RAG pipeline patterns requiring tenant-scoped retrieval
- `llm-evaluation` - Output quality assessment including hallucination detection
- `security-scanning` - Automated security scanning for LLM integrations

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| ID handling | Flow around LLM, never through | Prevents hallucination, injection, and cross-tenant leakage |
| Output validation | Schema + guardrails + grounding | Defense-in-depth for LLM outputs |
| Attribution approach | Deterministic post-LLM | System context provides IDs, not LLM |
| Prompt auditing | Regex pattern matching | Fast detection of forbidden identifiers |

**Version:** 1.0.0 (December 2025)
## Capability Details

### context-separation
**Keywords:** context separation, prompt context, id in prompt, parameterized
**Solves:**
- How do I prevent IDs from leaking into prompts?
- How do I separate system context from prompt content?
- What should never appear in LLM prompts?

### pre-llm-filtering
**Keywords:** pre-llm, rag filter, data filter, tenant filter
**Solves:**
- How do I filter data before sending to LLM?
- How do I ensure tenant isolation in RAG?
- How do I scope retrieval to current user?

### post-llm-attribution
**Keywords:** attribution, source tracking, provenance, citation
**Solves:**
- How do I track which sources the LLM used?
- How do I attribute results correctly?
- How do I avoid LLM-generated IDs?

### output-guardrails
**Keywords:** guardrail, output validation, hallucination, toxicity
**Solves:**
- How do I validate LLM output?
- How do I detect hallucinations?
- How do I prevent toxic content generation?

### prompt-audit
**Keywords:** prompt audit, prompt security, prompt injection
**Solves:**
- How do I verify no IDs leaked to prompts?
- How do I audit prompts for security?
- How do I prevent prompt injection?