---
name: api-resilience-patterns
description: Implement API resilience patterns — circuit breakers, retry with backoff, rate limiting, bulkhead isolation, timeout management, and graceful degradation.
version: "1.0.0"
last-updated: "2026-04-17"
model_tested: "claude-sonnet-4-6"
category: resilience
platforms: [claude-code, codex, gemini-cli, cursor, copilot, windsurf, cline]
language: en
geo_relevance: [global]
priority: medium
dependencies:
  mcp: []
  skills: []
  apis: []
  data: []
update_sources:
  - url: "https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker"
    check_frequency: "yearly"
    last_checked: "2026-04-21"
license: MIT
---

# API Resilience Patterns

## When to Use

- Calling external APIs that might fail or slow down
- Designing microservice communication
- Building agents that call multiple tools/APIs
- Handling rate limits from LLM providers
- Preventing cascade failures

## Pattern 1: Circuit Breaker

Prevents repeated calls to a failing service.

**States**: Closed (normal) → Open (failing) → Half-Open (testing)

| State | Behavior | Transition |
|-------|----------|-----------|
| Closed | Forward requests normally | → Open after N consecutive failures |
| Open | Reject immediately (fail fast) | → Half-Open after cooldown period |
| Half-Open | Allow 1 test request | → Closed if success, → Open if fail |

**Config**: threshold=3 failures, cooldown=30s, half-open-max=1.

## Pattern 2: Retry with Exponential Backoff

```
Attempt 1: immediate
Attempt 2: wait 1s + random(0-500ms)
Attempt 3: wait 2s + random(0-500ms)
Attempt 4: wait 4s + random(0-500ms)
Max: 5 attempts, 16s max wait
```

**Rules**:
- Only retry on transient errors (429, 500, 502, 503, timeout)
- Never retry on client errors (400, 401, 403, 404)
- Always add jitter to prevent thundering herd
- Set a total timeout budget (not just per-attempt)

## Pattern 3: Rate Limiting (Client-Side)

Respect provider limits proactively:

| Strategy | When | How |
|----------|------|-----|
| Token bucket | Steady rate with bursts | Refill N tokens/sec, consume per request |
| Sliding window | Strict per-minute limits | Track timestamps of last N requests |
| Queue-based | Ordered processing | FIFO queue with configurable concurrency |

## Pattern 4: Bulkhead Isolation

Isolate failures to prevent cascade:
- Separate connection pools per service
- Separate thread/worker pools per dependency
- If service A fails, services B and C are unaffected

## Pattern 5: Timeout Management

| Tier | Timeout | Purpose |
|------|---------|---------|
| Connection | 5s | Detect unreachable host |
| Request | 30s | Detect slow response |
| Total operation | 60s | Budget for retries included |

**Rule**: Total timeout > (max_retries × request_timeout). Always set all three.

## Pattern 6: Graceful Degradation

| Scenario | Fallback |
|----------|---------|
| Search API down | Return cached results + "results may not be current" |
| Payment API slow | Queue payment, confirm later |
| AI API rate-limited | Switch to cheaper/faster model |
| Database read replica down | Read from primary (accept perf hit) |

## Anti-Patterns

| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| Retry without backoff | Amplifies load on failing service | Exponential backoff + jitter |
| No timeout | Thread/connection leak | Always set timeouts |
| Retry on all errors | Retrying 401 wastes time | Only retry transient errors |
| Sync retry in UI thread | Blocks user interface | Async retry with status feedback |
| Cascading timeouts | Inner timeout > outer timeout | Budget timeouts from outside in |

## What This Skill Does NOT Do

- Does not implement specific libraries (guides patterns)
- Does not monitor uptime (use APM tools)
- Does not manage API keys or authentication
- Does not handle business logic fallbacks (only infrastructure patterns)