---
name: drift-detection
description: Statistical and quality drift detection for LLM applications. Use when monitoring model quality degradation, input distribution shifts, or output pattern changes over time.
context: fork
agent: metrics-architect
version: 1.0.0
author: OrchestKit
tags: [drift, monitoring, quality, statistical, psi, langfuse, evidently, 2026]
user-invocable: false
---

# Drift Detection

Monitor LLM quality degradation and input/output distribution shifts in production.

## Overview

- Detecting input distribution drift (data drift)
- Monitoring output quality degradation (concept drift)
- Implementing statistical methods (PSI, KS, KL divergence)
- Setting up dynamic thresholds with moving averages
- Integrating Langfuse scores with drift analysis

## Quick Reference

### Population Stability Index (PSI)

```python
import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    """
    Calculate Population Stability Index.

    Thresholds:
    - PSI < 0.1: No significant drift
    - 0.1 <= PSI < 0.25: Moderate drift, investigate
    - PSI >= 0.25: Significant drift, action needed
    """
    expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

    # Avoid division by zero
    expected_pct = np.clip(expected_pct, 0.0001, None)
    actual_pct = np.clip(actual_pct, 0.0001, None)

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

# Usage
psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
    alert("Significant quality drift detected!")
```

### EWMA Dynamic Threshold

```python
class EWMADriftDetector:
    """Exponential Weighted Moving Average for drift detection."""

    def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
        self.lambda_param = lambda_param  # Smoothing factor
        self.L = L  # Control limit multiplier
        self.ewma = None

    def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
        if self.ewma is None:
            self.ewma = value
        else:
            self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma

        # Calculate control limits
        factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
        ucl = baseline_mean + self.L * baseline_std * factor
        lcl = baseline_mean - self.L * baseline_std * factor

        return {
            "ewma": self.ewma,
            "ucl": ucl,
            "lcl": lcl,
            "drift_detected": self.ewma > ucl or self.ewma < lcl
        }
```

### Langfuse Score Trend Monitoring

```python
from langfuse import Langfuse

langfuse = Langfuse()

def check_quality_drift(days: int = 7, threshold_drop: float = 0.1):
    """Compare recent quality scores against baseline."""

    # Fetch recent scores
    current_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=1)
    )

    # Fetch baseline scores
    baseline_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=days),
        to_timestamp=datetime.now() - timedelta(days=1)
    )

    current_mean = np.mean([s.value for s in current_scores])
    baseline_mean = np.mean([s.value for s in baseline_scores])

    drift_pct = (baseline_mean - current_mean) / baseline_mean

    if drift_pct > threshold_drop:
        return {"drift": True, "drop_pct": drift_pct}
    return {"drift": False, "drop_pct": drift_pct}
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Statistical method | PSI for production (stable), KS for small samples |
| Threshold strategy | Dynamic (95th percentile of historical) over static |
| Baseline window | 7-30 days rolling window |
| Alert priority | Performance metrics > distribution metrics |
| Tool stack | Langfuse (traces) + Evidently/Phoenix (drift analysis) |

## PSI Threshold Guidelines

| PSI Value | Interpretation | Action |
|-----------|----------------|--------|
| < 0.1 | No significant drift | Monitor |
| 0.1 - 0.25 | Moderate drift | Investigate |
| >= 0.25 | Significant drift | Alert + Action |

## Anti-Patterns

```python
# ❌ NEVER use static thresholds without context
if psi > 0.2:  # May cause alert fatigue
    alert()

# ❌ NEVER retrain on time schedule alone
schedule.every(7).days.do(retrain)  # Wasteful if no drift

# ✅ ALWAYS use dynamic thresholds
threshold = np.percentile(historical_psi, 95)
if psi > threshold:
    alert()

# ✅ ALWAYS correlate with performance metrics
if psi > threshold AND quality_score < baseline:
    trigger_evaluation()
```

## Detailed Documentation

| Resource | Description |
|----------|-------------|
| [references/statistical-methods.md](references/statistical-methods.md) | PSI, KS, KL divergence, Wasserstein comparison |
| [references/embedding-drift.md](references/embedding-drift.md) | Arize Phoenix, cluster monitoring, semantic drift |
| [references/ewma-baselines.md](references/ewma-baselines.md) | Moving averages, dynamic thresholds, control charts |
| [references/langfuse-evidently-integration.md](references/langfuse-evidently-integration.md) | Combined pipeline pattern |
| [checklists/drift-detection-setup-checklist.md](checklists/drift-detection-setup-checklist.md) | Implementation checklist |

## Related Skills

- `langfuse-observability` - Score tracking for drift analysis
- `llm-evaluation` - Quality metrics that feed drift detection
- `quality-gates` - Threshold enforcement
- `observability-monitoring` - General monitoring patterns

## Capability Details

### psi-drift
**Keywords:** psi, population stability index, distribution drift, histogram comparison
**Solves:**
- Detect distribution shifts in LLM inputs/outputs
- Production-grade drift monitoring
- Stable drift metric for large datasets

### embedding-drift
**Keywords:** embedding drift, semantic drift, cluster, centroid, arize phoenix
**Solves:**
- Detect semantic changes in text data
- Monitor RAG retrieval quality
- Track embedding space shifts

### quality-regression
**Keywords:** quality drift, score degradation, trend, moving average
**Solves:**
- Detect LLM quality degradation over time
- Compare against historical baselines
- Early warning for model issues

### dynamic-thresholds
**Keywords:** ewma, dynamic threshold, adaptive, control chart
**Solves:**
- Reduce alert fatigue with adaptive thresholds
- Statistical process control for LLMs
- Context-aware drift alerting

### canary-monitoring
**Keywords:** canary prompt, fixed test, regression test, behavioral drift
**Solves:**
- Track consistency with fixed test inputs
- Detect behavioral changes in LLMs
- Regression testing for model updates