---
name: langgraph-error-handling
description: Implement LangGraph error handling with current v1 patterns. Use when users need to classify failures, add RetryPolicy for transient issues, build LLM recovery loops with Command routing, add human-in-the-loop with interrupt()/resume, handle ToolNode errors, or choose a safe strategy between retry, recovery, and escalation.
---

# LangGraph Error Handling

## Use This Skill For
- Adding `RetryPolicy` to flaky nodes (API, DB, model/tool calls)
- Designing LLM recovery loops (`Command` + error state + retry counters)
- Adding human approval/escalation with `interrupt()` and resume
- Handling prebuilt `ToolNode` failures
- Debugging transactional failure behavior in parallel supersteps

## Strategy Selection

Use this order:

1. Transient/infrastructure issue (`429`, timeout, `5xx`, temporary DB lock) -> `RetryPolicy`
2. Recoverable by model/tool args correction -> store error in state and route back with `Command`
3. Needs user approval or missing info -> `interrupt()` + resume
4. Unknown/programming bug -> let it bubble up and debug

| Error Type | Owner | Primary Mechanism |
|---|---|---|
| Transient | System | `RetryPolicy` |
| LLM-recoverable | LLM | State update + `Command(goto=...)` |
| User-fixable | Human | `interrupt()` + `Command(resume=...)` |
| Unexpected | Developer | Raise/log/debug |

For full taxonomy, load [references/error-types.md](references/error-types.md).

## Minimal Patterns

### 1) Retry Transient Failures

```python
from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)
```

```ts
builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});
```

Notes:
- Python and JS default retry behavior differs by exception type.
- Prefer targeted `retry_on`/`retryOn` for non-transient domains.

### 2) LLM Recovery Loop

Use `MessagesState` in Python for message state.

```python
from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")
```

```ts
import { StateGraph, Command, END } from "@langchain/langgraph";

// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });
```

### 3) Human-In-The-Loop Escalation

```python
from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")

# resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})
```

```ts
import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});
```

Requirements:
- Compile with a checkpointer for interrupt flows.
- Reuse the same `thread_id` on resume.

For deep HITL patterns, load [references/human-escalation.md](references/human-escalation.md).

## ToolNode Error Handling

```python
from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))
```

Use custom handlers when you need deterministic error shaping for model recovery.
For broader tool-recovery design, load [references/llm-recovery.md](references/llm-recovery.md).

## Critical Behavior (Do Not Skip)

1. **Supersteps are transactional**: one failing parallel branch fails the whole superstep state update.
2. **RetryPolicy retries failing branches**, not successful siblings.
3. **`interrupt()` re-runs the node on resume**: side effects before interrupt must be idempotent, or moved after interrupt / separate node.
4. **JS `Command` routing requires `ends` metadata** on `addNode(...)`.
5. **Use explicit retry limits** (`max_attempts`, plus state counters for recovery loops).

## Local Assets In This Skill

### Scripts
- `scripts/classify_error.py`: classify exception category and recommended handling
- `scripts/wrap_with_retry.py`: generate boilerplate node wrappers with retry/recovery/escalation options

Run from repo root:

```bash
uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery
```

### Examples
- `assets/examples/retry-example/`: retry + recovery loop (Python and JS)
- `assets/examples/human-loop-example/`: interrupt/resume approval flow (Python and JS)

## Load References On Demand

- `references/error-types.md`: error taxonomy and classification rules
- `references/retry-strategies.md`: retry tuning, backoff, circuit-breaker-style patterns
- `references/llm-recovery.md`: recovery-loop and ToolNode strategies
- `references/human-escalation.md`: human approval, interrupts, and escalation patterns

## Common Failure Modes

| Symptom | Root Cause | Fix |
|---|---|---|
| `interrupt()` fails at runtime | no checkpointer | compile with checkpointer |
| Resume starts new run | different `thread_id` | reuse same `thread_id` |
| JS Command route not taken | missing `ends` | add `ends` to `addNode` |
| Infinite loop | no termination counter/condition | add retry counter + terminal branch |
| Retry never triggers | exception excluded by retry filter | set explicit `retry_on`/`retryOn` |