--- name: silent-failure-detection description: Detect quiet failures in LLM agents - tool skipping, gibberish outputs, infinite loops, and degraded quality. Use when agents appear to work but produce incorrect results. context: fork agent: monitoring-engineer version: 1.0.0 author: OrchestKit tags: [monitoring, alerting, anomaly, silent-failure, observability, agents, 2026] user-invocable: false --- # Silent Failure Detection Detect when LLM agents fail silently - appearing to work while producing incorrect results. ## Overview - Detecting when agents skip expected tool calls - Identifying gibberish or degraded output quality - Monitoring for infinite loops and token consumption spikes - Setting up statistical baselines for anomaly detection - Alerting on non-error failures (service up but logic broken) ## Quick Reference ### Tool Skipping Detection ```python from langfuse import Langfuse def check_tool_usage(trace_id: str, expected_tools: list[str]) -> dict: """ Detect when agent skips expected tool calls. Based on Akamai's middleware bug: agents stopped using tools when hidden middleware injected unexpected instructions. """ langfuse = Langfuse() trace = langfuse.fetch_trace(trace_id) # Extract tool calls from trace actual_tools = [ span.name for span in trace.observations if span.type == "tool" ] missing_tools = set(expected_tools) - set(actual_tools) if missing_tools: return { "alert": True, "type": "tool_skipping", "missing": list(missing_tools), "message": f"Agent skipped expected tools: {missing_tools}" } return {"alert": False} ``` ### Gibberish/Quality Detection ```python from langfuse.decorators import observe, langfuse_context @observe(name="quality_check") async def detect_gibberish(response: str) -> dict: """ Detect low-quality or gibberish outputs using LLM-as-judge. """ # Quick heuristics first if len(response) < 10: return {"alert": True, "type": "too_short"} if len(set(response.split())) / len(response.split()) < 0.3: return {"alert": True, "type": "repetitive"} # LLM-as-judge for quality judge_prompt = f""" Rate this response quality (0-1): - 0: Gibberish, nonsensical, or completely wrong - 0.5: Partially correct but missing key information - 1: High quality, accurate, complete Response: {response[:1000]} Score (just the number): """ score = await llm.generate(judge_prompt) score_value = float(score.strip()) langfuse_context.score(name="quality_check", value=score_value) if score_value < 0.5: return {"alert": True, "type": "low_quality", "score": score_value} return {"alert": False, "score": score_value} ``` ### Loop Detection ```python class LoopDetector: """Detect infinite loops and token consumption spikes.""" def __init__( self, max_iterations: int = 10, token_spike_multiplier: float = 3.0, baseline_tokens: int = 2000 ): self.max_iterations = max_iterations self.token_spike_multiplier = token_spike_multiplier self.baseline_tokens = baseline_tokens self.iteration_count = 0 self.total_tokens = 0 def check(self, tokens_used: int) -> dict: self.iteration_count += 1 self.total_tokens += tokens_used # Check iteration count if self.iteration_count > self.max_iterations: return { "alert": True, "type": "max_iterations", "iterations": self.iteration_count, "message": f"Agent exceeded {self.max_iterations} iterations" } # Check token spike expected_tokens = self.baseline_tokens * self.iteration_count if self.total_tokens > expected_tokens * self.token_spike_multiplier: return { "alert": True, "type": "token_spike", "tokens": self.total_tokens, "expected": expected_tokens, "message": f"Token consumption spike: {self.total_tokens} vs expected {expected_tokens}" } return {"alert": False} ``` ### Statistical Baseline Anomaly Detection ```python import numpy as np class BaselineAnomalyDetector: """Detect anomalies vs statistical baseline.""" def __init__(self, window_size: int = 100, z_threshold: float = 3.0): self.window_size = window_size self.z_threshold = z_threshold self.history = [] def add_observation(self, value: float) -> dict: self.history.append(value) if len(self.history) > self.window_size: self.history = self.history[-self.window_size:] if len(self.history) < 10: return {"alert": False, "reason": "insufficient_data"} mean = np.mean(self.history[:-1]) std = np.std(self.history[:-1]) if std == 0: return {"alert": False} z_score = abs(value - mean) / std if z_score > self.z_threshold: return { "alert": True, "type": "statistical_anomaly", "z_score": z_score, "value": value, "mean": mean, "std": std } return {"alert": False, "z_score": z_score} ``` ## Key Decisions | Decision | Recommendation | |----------|----------------| | Detection priority | Tool skipping > Gibberish > Loops > Anomalies | | Quality check | LLM-as-judge with heuristic pre-filter | | Loop threshold | 10 iterations or 3x baseline tokens | | Anomaly threshold | Z-score > 3.0 (99.7% confidence) | | Alert strategy | Alert on silent failure, not just errors | ## Silent Failure Types | Type | Detection Method | Alert Priority | |------|------------------|----------------| | Tool Skipping | Expected vs actual tool calls | Critical | | Gibberish Output | LLM-as-judge + heuristics | High | | Infinite Loop | Iteration count + token spike | Critical | | Quality Degradation | Score < baseline | Medium | | Latency Spike | p99 > threshold | Medium | ## Anti-Patterns ```python # ❌ NEVER assume success if no error raised result = await agent.run() # Missing: quality check, tool usage check # ❌ NEVER ignore abnormal patterns if len(response) > 0: # "Not empty" is not "correct" return response # ✅ ALWAYS validate tool usage expected_tools = ["search", "calculate"] tool_check = check_tool_usage(trace_id, expected_tools) if tool_check["alert"]: alert(tool_check) # ✅ ALWAYS check output quality quality = await detect_gibberish(response) if quality["alert"]: fallback_to_human_review() ``` ## Detailed Documentation | Resource | Description | |----------|-------------| | [references/tool-skipping-detection.md](references/tool-skipping-detection.md) | Agent tool usage monitoring patterns | | [references/gibberish-detection.md](references/gibberish-detection.md) | Output quality scoring, LLM-as-judge | | [references/loop-detection.md](references/loop-detection.md) | Token spikes, retry patterns, circuit breakers | | [references/baseline-comparison.md](references/baseline-comparison.md) | Statistical anomaly detection | | [checklists/silent-failure-setup-checklist.md](checklists/silent-failure-setup-checklist.md) | Implementation checklist | ## Related Skills - `langfuse-observability` - Trace analysis for tool usage - `quality-gates` - Quality threshold enforcement - `observability-monitoring` - General alerting patterns - `advanced-guardrails` - LLM output safety checks ## Capability Details ### tool-skipping **Keywords:** tool skip, missing tool, agent tools, expected behavior **Solves:** - Detect when agents don't use expected tools - Monitor agent behavior consistency - Debug middleware interference (Akamai scenario) ### gibberish-detection **Keywords:** gibberish, nonsense, quality check, llm judge **Solves:** - Detect low-quality LLM outputs - Identify repetitive or nonsensical responses - Quality gate for production outputs ### loop-detection **Keywords:** infinite loop, retry loop, token spike, stuck agent **Solves:** - Detect agents stuck in loops - Monitor token consumption anomalies - Prevent runaway costs ### baseline-anomaly **Keywords:** anomaly, baseline, z-score, statistical, deviation **Solves:** - Detect deviations from normal behavior - Statistical anomaly detection - Early warning for silent failures ### latency-monitoring **Keywords:** latency, slow, p99, degraded, performance **Solves:** - Detect degraded but non-failing service - Monitor response time anomalies - SLO compliance for LLM calls