--- name: rate-limiting-patterns description: Use when implementing rate limiting, throttling, or API quotas. Covers algorithms like token bucket and sliding window, plus distributed rate limiting patterns. allowed-tools: Read, Glob, Grep --- # Rate Limiting Patterns Patterns for protecting APIs and services through rate limiting, throttling, and quota management. ## When to Use This Skill - Implementing API rate limiting - Choosing rate limiting algorithms - Designing distributed rate limiting - Setting up quota management - Protecting against abuse ## Why Rate Limiting ```text Protection against: - DDoS attacks - Brute force attempts - Resource exhaustion - Cost overruns (cloud APIs) - Cascading failures Business benefits: - Fair resource allocation - Predictable performance - Cost control - SLA enforcement ``` ## Rate Limiting Algorithms ### Token Bucket **Concept:** Tokens added at fixed rate, requests consume tokens ```text Configuration: - Bucket size (max tokens): 100 - Refill rate: 10 tokens/second Behavior: ┌─────────────────────────┐ │ Bucket (capacity: 100) │ │ ████████████░░░░░░░░░░ │ 60 tokens available └─────────────────────────┘ ↑ ↓ 10 tokens/s Request takes 1 token Allows bursts up to bucket size, then rate-limited. ``` **Characteristics:** - Allows controlled bursts - Simple to implement - Memory efficient - Most common algorithm **Implementation sketch:** ```text token_bucket: tokens = min(tokens + (now - last_update) * rate, capacity) if tokens >= cost: tokens -= cost return ALLOW return DENY ``` ### Leaky Bucket **Concept:** Requests queue and process at fixed rate ```text ┌─────────────────────────┐ │ Queue (capacity: 100) │ │ ██████████████████████ │ Requests waiting └──────────┬──────────────┘ │ ▼ Process at fixed rate (10/sec) [Processing] Smooths traffic to constant rate. ``` **Characteristics:** - Smooth output rate - No bursts allowed - Requests may queue - Good for downstream protection ### Fixed Window **Concept:** Count requests in fixed time windows ```text Window: 1 minute, Limit: 100 requests |-------- Window 1 --------|-------- Window 2 --------| 95 requests ? requests [Allow] [Reset to 0] Problem: Boundary burst End of window 1: 100 requests Start of window 2: 100 requests = 200 requests in ~1 second span ``` **Characteristics:** - Simple to implement - Memory efficient - Boundary burst problem - Good for simple use cases ### Sliding Window Log **Concept:** Track timestamp of each request ```text Window: 1 minute, Limit: 100 Requests: [t-55s, t-50s, t-45s, ..., t-5s, t-2s, now] Count all requests in [now - 60s, now] No boundary burst problem, but memory intensive. ``` **Characteristics:** - Precise limiting - No boundary issues - Memory intensive (stores all timestamps) - Good for strict limits ### Sliding Window Counter **Concept:** Weighted average of current and previous windows ```text Previous window: 80 requests Current window: 30 requests (40% through window) Weighted count = 80 * 0.6 + 30 = 78 Limit: 100 Result: ALLOW (78 < 100) ``` **Characteristics:** - Approximation (usually good enough) - Memory efficient - Smooths boundary issues - Best balance for most cases ## Algorithm Selection Guide | Algorithm | Burst Handling | Memory | Precision | Use Case | | --------- | -------------- | ------ | --------- | -------- | | Token Bucket | Allows bursts | Low | Good | General API limiting | | Leaky Bucket | No bursts | Low | Good | Smooth rate enforcement | | Fixed Window | Boundary burst | Very Low | Poor | Simple limits | | Sliding Log | No bursts | High | Exact | Strict compliance | | Sliding Counter | Minimal burst | Low | Good | Best general choice | ## Distributed Rate Limiting ### Challenge ```text Single node: Simple in-memory counter Multiple nodes: Need coordination Without coordination: Node 1: 50 requests (under 100 limit) Node 2: 50 requests (under 100 limit) Node 3: 50 requests (under 100 limit) Total: 150 requests (over 100 limit!) ``` ### Pattern 1: Centralized (Redis) ```text ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └───────────────┼───────────────┘ │ ┌──────▼──────┐ │ Redis │ │ (counters) │ └─────────────┘ Pros: Accurate, consistent Cons: Redis dependency, latency, single point of failure ``` ### Pattern 2: Local + Sync ```text Each node gets fraction of limit: - 3 nodes, 100 limit → 33 per node Periodically sync to rebalance unused capacity. Pros: Low latency, resilient Cons: Less precise, sync complexity ``` ### Pattern 3: Sticky Sessions ```text Route same client to same node (by IP, API key, etc.) Pros: Simple, no coordination needed Cons: Uneven load, failover complexity ``` ### Redis Implementation ```text Token Bucket with Redis: EVALSHA token_bucket_script 1 {key} {capacity} {refill_rate} {tokens_requested} Script: 1. Get current tokens and timestamp 2. Calculate tokens to add since last request 3. If enough tokens, decrement and allow 4. Return tokens remaining ``` ## Rate Limit Headers Standard headers to communicate limits to clients: ```text HTTP/1.1 200 OK X-RateLimit-Limit: 100 X-RateLimit-Remaining: 45 X-RateLimit-Reset: 1640000000 Retry-After: 30 (when rate limited) Or draft standard: RateLimit-Limit: 100 RateLimit-Remaining: 45 RateLimit-Reset: 30 ``` ## Rate Limit Response ```text HTTP/1.1 429 Too Many Requests Content-Type: application/json Retry-After: 30 { "error": { "code": "RATE_LIMITED", "message": "Rate limit exceeded", "retry_after": 30, "limit": 100, "window": "1m" } } ``` ## Multi-Tier Rate Limiting ```text Apply limits at multiple levels: Level 1: Global (protect infrastructure) - 10,000 req/sec across all clients Level 2: Per-tenant (fair allocation) - 1,000 req/min per organization Level 3: Per-user (prevent abuse) - 100 req/min per user Level 4: Per-endpoint (protect expensive operations) - 10 req/min for /export endpoint ``` ## Quota Management ### Quota vs Rate Limit ```text Rate Limit: Requests per time window (burst protection) - 100 requests/minute Quota: Total allocation over period (budget) - 10,000 API calls/month ``` ### Quota Tracking ```text Track usage: - Per API key - Per endpoint - Per operation type Alert thresholds: - 80% usage: Warning notification - 100% usage: Hard block or overage charges ``` ## Best Practices ### Graceful Degradation ```text Instead of hard block: 1. Reduce quality (lower resolution, fewer results) 2. Queue requests (process later) 3. Serve cached responses 4. Allow burst with penalty (slower recovery) ``` ### Client-Side Handling ```text Implement exponential backoff: 1. Receive 429 2. Wait Retry-After (or 1s) 3. Retry 4. If 429 again, wait 2s 5. Continue doubling up to max (e.g., 60s) ``` ### Testing Rate Limits ```text Test scenarios: - Burst traffic - Sustained high traffic - Clock skew (distributed systems) - Recovery after limit - Multiple client types ``` ## Related Skills - `api-design-fundamentals` - API design patterns - `idempotency-patterns` - Safe retries - `quality-attributes-taxonomy` - Performance attributes