# Harness Resilience > Circuit breakers, rate limiting, bulkheads, retry patterns, and fault tolerance analysis. Detects missing resilience patterns, evaluates failure modes, and recommends concrete configurations for production-grade fault tolerance. ## When to Use - When adding new external service integrations (APIs, databases, message queues) that need fault tolerance - On PRs that modify service-to-service communication, HTTP clients, or middleware chains - To audit existing resilience patterns for correctness, completeness, and observability - NOT for load testing or capacity planning (use harness-load-testing) - NOT for incident response after a failure has occurred (use harness-incident-response) - NOT for security-focused rate limiting like DDoS protection (use harness-security-review) ## Process ### Phase 1: DETECT -- Identify Dependencies and Existing Patterns 1. **Inventory external dependencies.** Scan the codebase for outbound connections: - HTTP clients: `axios`, `fetch`, `got`, `HttpClient`, `RestTemplate`, `reqwest` - Database connections: connection pool configs, ORM initialization, query builders - Message queues: RabbitMQ, Kafka, SQS, Redis pub/sub client initialization - gRPC channels: proto client stubs, channel creation, dial options - Third-party SDKs: Stripe, Twilio, SendGrid, AWS SDK calls 2. **Map existing resilience patterns.** For each dependency found, check for: - Circuit breakers: `opossum`, `cockatiel`, `Polly`, `resilience4j`, `hystrix` usage - Retry logic: exponential backoff, jitter, max attempts configuration - Timeouts: connection and request timeout settings - Rate limiters: token bucket, sliding window, or fixed window implementations - Bulkheads: thread pool isolation, semaphore limits, connection pool sizing - Fallbacks: cache-aside patterns, default values, degraded responses 3. **Detect anti-patterns.** Flag common resilience mistakes: - Unbounded retries without backoff or max attempts - Missing timeouts on HTTP clients or database queries - Circuit breaker without a fallback handler - Retry on non-idempotent operations (POST, DELETE without idempotency keys) - Rate limiter with no monitoring or alerting on limit hits 4. **Build the dependency map.** Produce a structured inventory: - Dependency name, type (HTTP, gRPC, database, queue), criticality (critical, degraded, optional) - Current resilience patterns applied (or "none") - Identified gaps and anti-patterns --- ### Phase 2: ANALYZE -- Evaluate Failure Modes 1. **Classify failure modes per dependency.** For each external dependency: - **Timeout:** The dependency responds too slowly or not at all - **Error burst:** The dependency returns errors at a rate above normal - **Partial degradation:** The dependency responds but with reduced functionality - **Total outage:** The dependency is completely unreachable - **Data inconsistency:** The dependency returns stale or incorrect data 2. **Assess blast radius.** For each failure mode: - Which features become unavailable? - Which downstream services are affected? - What is the user-visible impact? - Can the system continue to serve other requests? 3. **Evaluate current coverage.** Score each dependency on resilience coverage: - **Full:** Circuit breaker + retry + timeout + fallback + monitoring - **Partial:** Some patterns present but gaps exist (e.g., retry without circuit breaker) - **None:** No resilience patterns applied 4. **Prioritize gaps by risk.** Combine criticality and coverage: - Critical dependency with no resilience = P0 (immediate) - Critical dependency with partial resilience = P1 (next sprint) - Optional dependency with no resilience = P2 (backlog) - Any dependency with anti-patterns = P0 (anti-patterns are active risks) 5. **Check observability.** For existing patterns, verify they emit metrics: - Circuit breaker state changes (open/half-open/closed) - Retry attempt counts and final outcomes - Rate limiter rejection counts - Timeout occurrences --- ### Phase 3: DESIGN -- Recommend Resilience Patterns 1. **Select patterns per dependency.** Based on the failure mode analysis: - **HTTP APIs:** Circuit breaker (opossum/cockatiel) + exponential backoff with jitter + request timeout + fallback - **Databases:** Connection pool sizing + query timeout + read replica fallback + bulkhead isolation - **Message queues:** Dead letter queue + retry with backoff + idempotent consumers + circuit breaker on publish - **gRPC services:** Deadline propagation + retry policy + load balancing + circuit breaker 2. **Provide concrete configurations.** For each recommended pattern, specify: - Library and version to use - Configuration values with rationale (e.g., "timeout: 3000ms based on p99 latency of 1200ms with 2.5x headroom") - Threshold values for circuit breakers (failure rate, sample window, reset timeout) - Retry parameters (max attempts, base delay, max delay, jitter factor) - Rate limits (requests per window, window size, burst allowance) 3. **Design fallback strategies.** For each critical dependency: - **Cache fallback:** Serve stale data from Redis/memory cache with a staleness indicator - **Default fallback:** Return a safe default value with a degraded flag - **Queue fallback:** Accept the request and process it asynchronously when the dependency recovers - **Feature flag fallback:** Disable the feature entirely via feature flag 4. **Generate implementation templates.** Produce code snippets for: - Circuit breaker wrapping an existing HTTP client - Retry middleware with exponential backoff and jitter - Rate limiter middleware for Express/Fastify/NestJS - Bulkhead pattern using semaphore or connection pool limits 5. **Define health check contracts.** Specify how each dependency should be health-checked: - Endpoint or query to use for liveness check - Timeout for the health check itself - Frequency and failure threshold before marking unhealthy --- ### Phase 4: VALIDATE -- Verify Implementation and Observability 1. **Check pattern correctness.** For each implemented pattern: - Circuit breaker: Verify threshold configuration, half-open behavior, and reset timeout - Retry: Verify idempotency of retried operations, backoff curve, and max attempts - Timeout: Verify timeout values are set on both client and server sides - Rate limiter: Verify limit values, window type, and rejection response format 2. **Verify test coverage.** Check that resilience patterns are tested: - Circuit breaker tests: closed-to-open transition, open rejection, half-open recovery - Retry tests: successful retry, max attempts exhaustion, non-retryable error bypass - Timeout tests: timeout triggers fallback, timeout does not leak connections - Rate limiter tests: under-limit passes, at-limit rejects, window reset behavior 3. **Verify observability.** Confirm that metrics are emitted: - Check for Prometheus counters/histograms or StatsD calls on pattern events - Verify structured logging includes circuit breaker state, retry attempt number, and rate limit headers - Confirm dashboard or alert configurations reference the new metrics 4. **Produce the resilience report.** Output a summary: - Number of dependencies analyzed - Coverage before and after (percentage with full/partial/none resilience) - Anti-patterns found and resolved - Remaining gaps with priority and recommended timeline 5. **Run integration verification.** If integration tests exist: - Execute tests that exercise the resilience patterns (chaos test stubs, fault injection) - Verify graceful degradation under simulated failure conditions - Confirm that fallbacks produce acceptable user-facing responses --- ## Harness Integration - **`harness skill run harness-resilience`** -- Primary CLI entry point. Runs all four phases. - **`harness validate`** -- Run after implementing recommended patterns to verify project integrity. - **`harness check-deps`** -- Verify that new resilience libraries are properly declared and within boundary rules. - **`emit_interaction`** -- Used at pattern selection (checkpoint:decision) when multiple valid patterns exist and trade-offs require human judgment. - **`Glob`** -- Discover HTTP clients, middleware chains, and existing resilience pattern files. - **`Grep`** -- Search for timeout configurations, retry logic, circuit breaker initialization, and anti-patterns. - **`Write`** -- Generate implementation templates and resilience configuration files. - **`Edit`** -- Add resilience wrappers to existing service clients. ## Success Criteria - All external dependencies are inventoried with their resilience coverage level - Anti-patterns are identified with specific file locations and line numbers - Recommendations include concrete library versions and configuration values, not just pattern names - Fallback strategies are defined for every critical dependency - Implementation templates compile and follow the project's existing code style - Observability is addressed: every pattern emits metrics or structured logs ## Examples ### Example: Express.js API with Stripe and PostgreSQL ``` Phase 1: DETECT Dependencies found: - Stripe API (HTTP, critical): axios client in src/payments/stripe-client.ts Resilience: timeout=5000ms, no retry, no circuit breaker, no fallback - PostgreSQL (database, critical): pg pool in src/db/pool.ts Resilience: pool max=20, no query timeout, no read replica fallback - SendGrid (HTTP, optional): @sendgrid/mail in src/notifications/email.ts Resilience: none Anti-patterns: - src/payments/stripe-client.ts:45 — retry on POST /charges without idempotency key - src/db/pool.ts — no statement_timeout configured Phase 2: ANALYZE Stripe failure modes: - Timeout: Payment page hangs, user retries, duplicate charges possible - Outage: All payments fail, revenue impact immediate - Blast radius: checkout flow, subscription renewal, refund processing Risk: P0 (critical + partial coverage + anti-pattern) Phase 3: DESIGN Stripe recommendations: - Add opossum circuit breaker: failureThreshold=50%, resetTimeout=30s - Add idempotency key to all Stripe charge requests - Set timeout to 8000ms (Stripe p99 is ~3s, 2.5x headroom) - Fallback: queue payment for async retry via Bull queue PostgreSQL recommendations: - Set statement_timeout=5000 in pool config - Add pg-pool error handler with connection retry - Configure read replica for GET endpoints via pgBouncer Phase 4: VALIDATE Resilience coverage: 33% -> 100% (3/3 dependencies covered) Anti-patterns resolved: 2/2 Tests needed: circuit breaker state transitions, idempotency key generation ``` ### Example: NestJS Microservices with gRPC and Redis ``` Phase 1: DETECT Dependencies found: - user-service (gRPC, critical): @grpc/grpc-js in src/clients/user.client.ts Resilience: deadline=5s, no retry, no circuit breaker - inventory-service (gRPC, critical): no resilience configured - Redis (cache, degraded): ioredis in src/cache/redis.ts Resilience: reconnectOnError, no bulkhead, no fallback Phase 2: ANALYZE inventory-service outage: - Product pages return 503, search results empty - Blast radius: catalog, search, cart validation - Risk: P0 (critical + no coverage) Phase 3: DESIGN inventory-service recommendations: - Add cockatiel circuit breaker with ConsecutiveBreaker(5) - Add retry with exponentialBackoff(1000, 2) maxAttempts=3 - Add deadline propagation from gateway timeout - Fallback: serve cached inventory from Redis with staleness header Redis recommendations: - Add bulkhead: maxPoolSize=50, separate pools for cache vs sessions - Add fallback: in-memory LRU cache (lru-cache, max 1000 items) - Monitor: emit redis.command.duration histogram Phase 4: VALIDATE Coverage: 33% -> 100% Tests verified: gRPC circuit breaker opens after 5 failures, Redis fallback serves from LRU when Redis is down ``` ## Rationalizations to Reject | Rationalization | Reality | | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | "That third-party API has 99.99% uptime — we don't need a circuit breaker" | 99.99% uptime means 52 minutes of downtime per year. That downtime will not occur as one predictable window — it will happen as degraded responses and timeouts during a traffic spike. Without a circuit breaker, every caller blocks for the full timeout duration, exhausting thread pools and cascading across the system. | | "We have retry logic, so failures are handled" | Retry logic without a circuit breaker amplifies failures. When the downstream service is degraded, retries multiply the load on an already struggling system. Circuit breakers and retries are complementary controls, not alternatives. | | "The fallback adds complexity — we'll add it if the circuit breaker actually opens" | A circuit breaker without a fallback is a different kind of failure mode, not resilience. When the circuit opens, users see an error instead of a degraded-but-functional experience. Fallbacks must be designed and tested before the circuit ever opens in production. | | "Our database connection pool is 100 connections — that's plenty" | Connection pool size without query timeouts means slow queries hold connections indefinitely. A single slow query spike can exhaust the pool, causing every subsequent request to wait. Pool sizing and query timeouts are both required. | | "The service is internal — it doesn't need rate limiting" | Internal services are often called by automated processes, CI pipelines, and batch jobs that can spike traffic in ways user-facing services do not. Missing rate limiting on internal services is a common cause of self-inflicted outages during deployments and data migrations. | ## Gates - **No retry on non-idempotent operations without idempotency keys.** Retrying a POST or DELETE that lacks an idempotency mechanism can cause data duplication or data loss. This is a blocking finding. The operation must be made idempotent before retry logic is added. - **No circuit breaker without a fallback.** A circuit breaker that opens and returns a raw error to the user is not resilience -- it is a different kind of failure. Every circuit breaker must have a defined fallback behavior (cache, default, queue, or feature flag). - **No unbounded retries.** Retry logic must have a max attempt limit and use exponential backoff with jitter. Unbounded retries with fixed delays cause thundering herd problems and amplify failures. - **No resilience pattern without observability.** A circuit breaker that opens silently is invisible to operations. Every pattern must emit metrics or structured logs that can trigger alerts. ## Escalation - **When a dependency has no documentation on failure behavior:** Report: "The [dependency] has no documented error codes or failure modes. Recommend contacting the provider for SLA details, or instrumenting the client to collect failure statistics over a 2-week baseline period." - **When resilience patterns conflict with latency requirements:** Adding retries and circuit breakers increases tail latency. Report: "The recommended retry configuration adds up to [N]ms to worst-case latency. If the latency budget is [M]ms, consider reducing max attempts or using a hedged request pattern instead." - **When the team has no experience with the recommended library:** Report: "The team has not used [library] before. Recommend starting with a single non-critical dependency as a pilot, with a production bake time of 2 weeks before rolling out to critical paths." - **When existing resilience patterns use a different library than recommended:** Do not recommend switching libraries mid-project. Report: "The project already uses [existing library] for resilience. Recommend continuing with [existing library] for consistency, adapting the configuration recommendations to its API."