--- name: sre-reliability-engineering user-invocable: false description: Use when building reliable and scalable distributed systems. allowed-tools: [] --- # SRE Reliability Engineering Building reliable and scalable distributed systems. ## Service Level Objectives (SLOs) ### Defining SLOs ``` SLI: Availability = successful requests / total requests SLO: 99.9% availability (measured over 30 days) Error Budget: 0.1% = 43 minutes downtime per month ``` ### SLO Document Template ```markdown # API Service SLO ## Availability SLO **Target**: 99.9% of requests succeed (measured over 30 days) **SLI Definition**: - Success: HTTP 200-399 responses - Failure: HTTP 500-599 responses, timeouts - Excluded: HTTP 400-499 (client errors) **Measurement**: ```prometheus sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total{status!~"4.."}[30d])) ``` **Error Budget**: 0.1% = ~43 minutes/month **Consequences**: - Budget remaining > 0: Ship features fast - Budget exhausted: Feature freeze, focus on reliability - Budget at 50%: Increase caution ``` ## Error Budgets ### Tracking ```prometheus # Error budget remaining error_budget_remaining = 1 - ( (1 - current_sli) / (1 - slo_target) ) # Example: 99.9% SLO, currently at 99.95% # Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999)) # = 1 - (0.0005 / 0.001) = 0.5 (50% remaining) ``` ### Burn Rate ```prometheus # How fast are we consuming error budget? error_budget_burn_rate = (1 - current_sli_1h) / (1 - slo_target) # Alert if burning budget 10x faster than sustainable - alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h ``` ### Policy ``` Error Budget > 75%: Ship aggressively Error Budget 25-75%: Normal velocity Error Budget < 25%: Slow down, increase testing Error Budget = 0%: Feature freeze, reliability only ``` ## Reliability Patterns ### Circuit Breaker ```javascript class CircuitBreaker { constructor({ threshold = 5, timeout = 60000 }) { this.state = 'CLOSED'; this.failures = 0; this.threshold = threshold; this.timeout = timeout; } async call(fn) { if (this.state === 'OPEN') { if (Date.now() - this.openedAt > this.timeout) { this.state = 'HALF_OPEN'; } else { throw new Error('Circuit breaker is OPEN'); } } try { const result = await fn(); this.onSuccess(); return result; } catch (error) { this.onFailure(); throw error; } } onSuccess() { this.failures = 0; this.state = 'CLOSED'; } onFailure() { this.failures++; if (this.failures >= this.threshold) { this.state = 'OPEN'; this.openedAt = Date.now(); } } } ``` ### Retry with Exponential Backoff ```javascript async function retryWithBackoff(fn, maxRetries = 3) { for (let i = 0; i < maxRetries; i++) { try { return await fn(); } catch (error) { if (i === maxRetries - 1) throw error; const delay = Math.min(1000 * Math.pow(2, i), 10000); const jitter = Math.random() * 1000; await sleep(delay + jitter); } } } ``` ### Rate Limiting ```javascript class TokenBucket { constructor({ capacity, refillRate }) { this.capacity = capacity; this.tokens = capacity; this.refillRate = refillRate; this.lastRefill = Date.now(); } tryConsume(tokens = 1) { this.refill(); if (this.tokens >= tokens) { this.tokens -= tokens; return true; } return false; } refill() { const now = Date.now(); const elapsed = (now - this.lastRefill) / 1000; const tokensToAdd = elapsed * this.refillRate; this.tokens = Math.min( this.capacity, this.tokens + tokensToAdd ); this.lastRefill = now; } } ``` ### Bulkhead ```javascript class Bulkhead { constructor({ maxConcurrent }) { this.maxConcurrent = maxConcurrent; this.current = 0; this.queue = []; } async execute(fn) { while (this.current >= this.maxConcurrent) { await new Promise(resolve => this.queue.push(resolve)); } this.current++; try { return await fn(); } finally { this.current--; if (this.queue.length > 0) { const resolve = this.queue.shift(); resolve(); } } } } ``` ## Graceful Degradation ```javascript async function getRecommendations(userId) { try { // Try personalized recommendations return await recommendationService.getPersonalized(userId, { timeout: 500, // Fail fast }); } catch (error) { logger.warn('Personalized recommendations failed, falling back', { userId, error: error.message, }); try { // Fall back to popular items return await cache.get('popular_items'); } catch (fallbackError) { // Final fallback return DEFAULT_RECOMMENDATIONS; } } } ``` ## Capacity Planning ### Utilization Tracking ```prometheus # Current utilization current_utilization = sum(rate(http_requests_total[5m])) / capacity_requests_per_second # Alert when approaching capacity - alert: HighUtilization expr: current_utilization > 0.80 for: 10m ``` ### Growth Projection ``` Current QPS: 1,000 Growth rate: 20% per month Capacity per instance: 100 QPS Current instances: 12 In 6 months: Projected QPS: 1,000 * (1.20)^6 = 2,986 Instances needed: 2,986 / 100 = 30 ``` ### Load Testing ```javascript // k6 load test import http from 'k6/http'; import { check, sleep } from 'k6'; export const options = { stages: [ { duration: '2m', target: 100 }, // Ramp up { duration: '5m', target: 100 }, // Steady state { duration: '2m', target: 200 }, // Spike { duration: '5m', target: 200 }, // Higher steady { duration: '2m', target: 0 }, // Ramp down ], thresholds: { http_req_duration: ['p(95)<500'], // 95% under 500ms http_req_failed: ['rate<0.01'], // Less than 1% errors }, }; export default function () { const res = http.get('https://api.example.com/endpoint'); check(res, { 'status is 200': (r) => r.status === 200, 'response time < 500ms': (r) => r.timings.duration < 500, }); sleep(1); } ``` ## Chaos Engineering ### Fault Injection ```javascript // Inject latency function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) { return async (...args) => { if (Math.random() < probability) { await sleep(delayMs); } return fn(...args); }; } // Inject failures function withFailureInjection(fn, { probability = 0.05 }) { return async (...args) => { if (Math.random() < probability) { throw new Error('Injected failure'); } return fn(...args); }; } ``` ## Best Practices ### Design for Failure - Assume all dependencies can fail - Have fallback options - Fail fast and timeout quickly - Implement retries with backoff ### Measure User Impact - SLOs should reflect user experience - Don't alert on internal metrics alone - Track real user monitoring (RUM) ### Balance Velocity and Reliability - Use error budgets to make decisions - Don't target 100% reliability - Spend error budget on innovation ### Automate Everything - Automate deployments - Automate rollbacks - Automate capacity scaling - Automate incident response