---
name: resilience-patterns
description: Production-grade fault tolerance for distributed systems. Use when implementing circuit breakers, retry with exponential backoff, bulkhead isolation patterns, or building resilience into LLM API integrations.
context: fork
agent: backend-system-architect
version: 1.0.0
author: OrchestKit AI Agent Hub
tags: [resilience, circuit-breaker, bulkhead, retry, fault-tolerance]
user-invocable: false
---

# Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

## Overview

- Building fault-tolerant multi-agent systems
- Implementing LLM API integrations with proper error handling
- Designing distributed workflows that need graceful degradation
- Adding observability to failure scenarios
- Protecting systems from cascade failures

## Core Patterns

### 1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

```
+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+
```

**Key Configuration:**
- `failure_threshold`: Failures before opening (default: 5)
- `recovery_timeout`: Seconds before attempting recovery (default: 30)
- `half_open_requests`: Probes to allow in half-open (default: 1)

### 2. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

```
+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+
```

**Tier Configuration (OrchestKit):**
| Tier | Workers | Queue | Timeout | Use Case |
|------|---------|-------|---------|----------|
| 1 (Critical) | 5 | 10 | 300s | Synthesis, quality gate |
| 2 (Standard) | 3 | 5 | 120s | Content analysis agents |
| 3 (Optional) | 2 | 3 | 60s | Enrichment, caching |

### 3. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

```
+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+
```

**Error Classification for Retries:**
```python
RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}
```

### 4. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

```
+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-20250514 (primary)                           |
|   2. gpt-4o-mini (fallback)                                       |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+
```

**Token Budget Management:**
```
+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+
```

## Quick Reference

| Pattern | When to Use | Key Benefit |
|---------|-------------|-------------|
| Circuit Breaker | External service calls | Prevent cascade failures |
| Bulkhead | Multi-tenant/multi-agent | Isolate failures |
| Retry + Backoff | Transient failures | Automatic recovery |
| Fallback Chain | Critical operations | Graceful degradation |
| Token Budget | LLM calls | Cost control, prevent failures |

## OrchestKit Integration Points

1. **Workflow Agents**: Each agent wrapped with circuit breaker + bulkhead tier
2. **LLM Calls**: All model invocations use fallback chain + retry logic
3. **External APIs**: Circuit breaker on YouTube, arXiv, GitHub APIs
4. **Database Ops**: Bulkhead isolation for read vs write operations

## Files in This Skill

### References (Conceptual Guides)
- `references/circuit-breaker.md` - Deep dive on circuit breaker pattern
- `references/bulkhead-pattern.md` - Bulkhead isolation strategies
- `references/retry-strategies.md` - Retry algorithms and error classification
- `references/llm-resilience.md` - LLM-specific patterns
- `references/error-classification.md` - How to categorize errors

### Templates (Code Patterns)
- `scripts/circuit-breaker.py` - Ready-to-use circuit breaker class
- `scripts/bulkhead.py` - Semaphore-based bulkhead implementation
- `scripts/retry-handler.py` - Configurable retry decorator
- `scripts/llm-fallback-chain.py` - Multi-model fallback pattern
- `scripts/token-budget.py` - Token budget guard implementation

### Examples
- `examples/orchestkit-workflow-resilience.md` - Full OrchestKit integration example

### Checklists
- `checklists/pre-deployment-resilience.md` - Production readiness checklist
- `checklists/circuit-breaker-setup.md` - Circuit breaker configuration guide

## 2026 Best Practices

1. **Adaptive Thresholds**: Use sliding windows, not fixed counters
2. **Observability First**: Every circuit trip = alert + metric + trace
3. **Graceful Degradation**: Always have a fallback, even if partial
4. **Health Endpoints**: Separate health check from circuit state
5. **Chaos Testing**: Regularly test failure scenarios in staging

---

## Related Skills

- `observability-monitoring` - Metrics and alerting for circuit breaker state changes
- `caching-strategies` - Cache as fallback layer in degradation scenarios
- `error-handling-rfc9457` - Structured error responses for resilience failures
- `background-jobs` - Async processing with retry and failure handling

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Circuit breaker recovery | Half-open probe | Gradual recovery, prevents immediate re-failure |
| Retry algorithm | Exponential backoff + jitter | Prevents thundering herd, respects rate limits |
| Bulkhead isolation | Semaphore-based tiers | Simple, efficient, prioritizes critical operations |
| LLM fallback | Model chain with cache | Graceful degradation, cost optimization, availability |

---

## Capability Details

### circuit-breaker
**Keywords:** circuit breaker, failure threshold, cascade failure, trip, half-open
**Solves:**
- Prevent cascade failures when external services fail
- Automatically recover when services come back online
- Fail fast instead of waiting for timeouts

### bulkhead
**Keywords:** bulkhead, isolation, semaphore, thread pool, resource pool, tier
**Solves:**
- Isolate failures to prevent entire system crashes
- Prioritize critical operations over optional ones
- Limit concurrent requests to protect resources

### retry-strategies
**Keywords:** retry, backoff, exponential, jitter, thundering herd
**Solves:**
- Handle transient failures automatically
- Avoid overwhelming recovering services
- Classify errors as retryable vs non-retryable

### llm-resilience
**Keywords:** LLM, fallback, model, token budget, rate limit, context length
**Solves:**
- Handle LLM API rate limits gracefully
- Fall back to alternative models when primary fails
- Manage token budgets to prevent context overflow

### error-classification
**Keywords:** error, retryable, transient, permanent, classification
**Solves:**
- Determine which errors should be retried
- Categorize errors by severity and recoverability
- Map HTTP status codes to resilience actions