--- title: Rate Limiting --- # Rate Limiting Rate limiting protects workers from being overwhelmed by too many concurrent requests. SMG uses a token bucket algorithm with optional request queuing. --- ## Overview

### :material-bucket: Token Bucket Smooth rate limiting with burst capacity using the token bucket algorithm.

### :material-tray-full: Request Queuing Queue excess requests instead of rejecting them immediately.

### :material-timer-outline: Configurable Timeouts Bound request and queue wait times to maintain system responsiveness.

### :material-chart-line: Observable Full Prometheus metrics for queue depth, wait times, and rejection rates.

--- ## Why Rate Limit? Without rate limiting: 1. **Worker overload**: Too many concurrent requests degrade performance 2. **Memory exhaustion**: Workers run out of GPU memory 3. **Cascading timeouts**: Slow responses cause client timeouts 4. **Poor user experience**: Some users get fast responses, others wait forever Rate limiting ensures **fair access** and **predictable performance**. --- ## How It Works SMG uses a **token bucket** algorithm:

![Token Bucket Rate Limiting](../../assets/images/rate-limiting.svg)

### Token Bucket - **Bucket capacity**: Maximum concurrent requests (`--max-concurrent-requests`) - **Refill rate**: Tokens added per second (`--rate-limit-tokens-per-second`) - **Request cost**: Each request consumes one token ### Request Queue When no tokens are available, requests can wait in a queue: - **Queue size**: Maximum waiting requests (`--queue-size`) - **Queue timeout**: Maximum wait time (`--queue-timeout-secs`) --- ## Configuration ```bash smg \ --worker-urls http://w1:8000 http://w2:8000 \ --max-concurrent-requests 100 \ --rate-limit-tokens-per-second 50 \ --queue-size 200 \ --queue-timeout-secs 30 ``` ### Rate Limit Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--max-concurrent-requests` | `-1` (disabled) | Token bucket capacity. When `<= 0` the limiter is disabled entirely and requests pass through. | | `--rate-limit-tokens-per-second` | unset (refills at `max_concurrent_requests`) | Token bucket refill rate in tokens per second. | | `--queue-size` | `100` | Maximum queued requests | | `--queue-timeout-secs` | `60` | Maximum queue wait time | ### Timeout Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--request-timeout-secs` | `1800` (30 min) | Maximum time for a request to complete | | `--queue-timeout-secs` | `60` | Maximum time a request waits in queue | | `--worker-startup-timeout-secs` | `1800` (30 min) | Timeout for worker startup/model loading | !!! note "Concurrency vs. Rate Limiting" Setting `--max-concurrent-requests` alone creates a token bucket whose capacity *and* refill rate both equal `max_concurrent_requests`, so it enforces both burst capacity and a sustained rate. Set `--rate-limit-tokens-per-second` when you want the sustained rate to differ from the burst capacity (for example, capacity `100` with refill `50` allows short bursts of 100 while sustaining 50 req/s). --- ## Response Codes | Code | Meaning | When | |------|---------|------| | **429** | Too Many Requests | Queue is full, or queuing is disabled and no token is available | | **408** | Request Timeout | Queue wait exceeded timeout | The local rate limiter returns a status-only response with no JSON body (clients should read the HTTP status and `X-Request-Id` to distinguish cases). SMG does not currently emit a `Retry-After` header with the response. When the mesh global rate limit is enabled and exceeded, the 429 response carries a JSON body: ```json { "error": "Rate limit exceeded", "current_count": 123, "limit": 100 } ``` --- ## Sizing Guidelines ### Concurrent Requests Base on worker capacity: ``` max_concurrent_requests = num_workers × requests_per_worker ``` | Worker Type | Requests per Worker | |-------------|---------------------| | Small GPU (16GB) | 4-8 | | Medium GPU (40GB) | 8-16 | | Large GPU (80GB) | 16-32 | ### Queue Size Base on acceptable latency: ``` queue_size = max_concurrent_requests × queue_depth_factor ``` | Latency Tolerance | Queue Depth Factor | |-------------------|-------------------| | Low (interactive) | 0.5-1x | | Medium (batch) | 2-4x | | High (async) | 4-8x | ### Token Refill Rate Base on sustainable throughput: ``` tokens_per_second = expected_requests_per_second × 1.2 ``` The 1.2 factor provides headroom for bursts. --- ## Example Configurations === "Interactive API" Low latency, reject excess traffic: ```bash smg \ --max-concurrent-requests 50 \ --queue-size 25 \ --queue-timeout-secs 5 ``` === "Batch Processing" Higher throughput, longer queues: ```bash smg \ --max-concurrent-requests 200 \ --queue-size 500 \ --queue-timeout-secs 60 ``` === "No Rate Limiting" Trust upstream rate limiting: ```bash smg \ --max-concurrent-requests -1 ``` --- ## Monitoring ### Metrics | Metric | Description | |--------|-------------| | `smg_http_rate_limit_total` | Rate limit decisions by result (allowed/rejected) | | `smg_http_request_duration_seconds` | Request duration histogram | ### Useful PromQL Queries

#### Rate Limit Decisions ```promql # Rate limit decisions per second rate(smg_http_rate_limit_total[5m]) # By decision type (allowed/rejected) sum by (result) ( rate(smg_http_rate_limit_total[5m]) ) ```

#### Request Duration ```promql # 99th percentile request duration histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m])) ```

### Alert Thresholds | Metric | Warning | Critical | Action | |--------|---------|----------|--------| | Queue utilization | >70% | >90% | Increase queue size or capacity | | Rejection rate | >5% | >20% | Increase limits or scale workers | | Avg queue wait | >10s | >30s | Reduce load or increase capacity | | Queue timeouts | >1/min | >10/min | Investigate bottlenecks | --- ## Client-Side Handling ### Retry Strategy Clients should implement exponential backoff when receiving 429. SMG does not set a `Retry-After` header today, so clients must compute their own wait: ```python import time import requests def request_with_retry(url, data, max_retries=5): for attempt in range(max_retries): response = requests.post(url, json=data) if response.status_code == 429: # SMG does not emit Retry-After; fall back to exponential backoff. time.sleep(2 ** attempt) continue return response raise Exception("Max retries exceeded") ``` ### Adaptive Rate Monitor 429 responses and adjust request rate: ```python class AdaptiveClient: def __init__(self, base_rate=10): self.rate = base_rate def on_success(self): self.rate = min(self.rate * 1.1, 100) # Increase slowly def on_rate_limit(self): self.rate = self.rate * 0.5 # Decrease quickly ``` --- ## What's Next?

### :material-electric-switch: Circuit Breakers Isolate failing workers to prevent cascade failures. [Circuit Breakers →](circuit-breakers.md)

### :material-refresh: Retries Automatic retry with exponential backoff for transient failures. [Retries →](retries.md)

### :material-heart-pulse: Health Checks Proactive worker monitoring and failure detection. [Health Checks →](health-checks.md)

### :material-chart-box: Metrics Reference Complete list of rate limiting metrics. [Metrics Reference →](../../reference/metrics.md)