--- title: Health Checks --- # Health Checks Background health checks continuously monitor worker availability, removing unhealthy workers from the selection pool before they can cause request failures. --- ## Overview

### :material-heart-pulse: Proactive Monitoring Detect worker failures before they impact requests—not after.

### :material-shield-check: Automatic Isolation Unhealthy workers are removed from the pool without manual intervention.

### :material-refresh: Self-Healing Workers automatically rejoin the pool when they recover.

### :material-tune: Configurable Sensitivity Tune detection speed vs. tolerance for temporary issues.

--- ## Why Health Checks? Without proactive health checks: - **Reactive detection**: Failures only discovered when real requests fail - **Wasted requests**: Multiple requests may fail before worker is marked unhealthy - **Slower recovery**: No way to know when a worker has recovered without trying it With health checks: - **Proactive detection**: Unhealthy workers removed before they cause failures - **Fast recovery**: Workers rejoin the pool as soon as they're healthy - **No wasted requests**: Real requests only go to verified healthy workers --- ## How It Works SMG sends periodic HTTP requests to each worker's health endpoint:

![Health Check Sequence Diagram](../../assets/images/health-checks-flow.svg)

### Worker States | State | Meaning | Traffic | |-------|---------|---------| | **Pending** | Freshly registered, not yet verified | No requests | | **Ready** | Passing health checks | Receives requests | | **NotReady** | Consecutive probe failures reached the readiness threshold | No requests | | **Failed** | Consecutive failures reached the liveness threshold, or `Pending` ran out of probe attempts | Terminal — receives no requests and is not probed further | The `smg_worker_health` gauge collapses these to `1` (Ready) and `0` (anything else), so existing dashboards continue to work. ### State Transitions **Pending → Ready**: When consecutive successful probes reach `--health-success-threshold`. **Pending → Failed**: If the worker accumulates `10 × failure_threshold` total probes without ever reaching the success threshold (prevents misconfigured URLs from lingering forever). **Ready → NotReady**: When consecutive failed probes reach `--health-failure-threshold`. **NotReady → Ready**: When consecutive successful probes reach `--health-success-threshold`. **NotReady → Failed**: When consecutive failures reach `3 × --health-failure-threshold` (the liveness threshold — analogous to a Kubernetes liveness probe, tolerating longer outages than the readiness threshold). **Failed is terminal**: Successful probes do not recover a `Failed` worker. A failed worker is removed via `--remove-unhealthy-workers` or requires manual re-registration. --- ## Configuration ```bash smg \ --worker-urls http://w1:8000 http://w2:8000 \ --health-check-interval-secs 60 \ --health-failure-threshold 3 \ --health-success-threshold 2 \ --health-check-timeout-secs 5 \ --health-check-endpoint /health ``` ### Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--health-check-interval-secs` | `60` | Interval between health checks | | `--health-failure-threshold` | `3` | Consecutive failures before marking unhealthy | | `--health-success-threshold` | `2` | Consecutive successes to mark healthy again | | `--health-check-timeout-secs` | `5` | Timeout for each health check request | | `--health-check-endpoint` | `/health` | Endpoint path for health checks | | `--disable-health-check` | `false` | Disable background health checks | | `--remove-unhealthy-workers` | `false` | Submit a removal job when a worker reaches the terminal `Failed` state | --- ## Recommended Configurations

### :material-lightning-bolt: Fast Detection Sensitive to failures—detect issues quickly. ```bash smg \ --health-check-interval-secs 10 \ --health-failure-threshold 2 \ --health-check-timeout-secs 3 ``` **Use when**: Critical availability, rapid failure response needed

### :material-shield: Conservative Detection Tolerant of network blips. ```bash smg \ --health-check-interval-secs 120 \ --health-failure-threshold 5 \ --health-success-threshold 3 ``` **Use when**: Flaky networks, workers with occasional slow responses

### :material-server-network: Production Balanced Balanced detection for typical deployments. ```bash smg \ --health-check-interval-secs 30 \ --health-failure-threshold 3 \ --health-success-threshold 2 \ --health-check-timeout-secs 5 ``` **Use when**: Standard production environments

### :material-close-circle: No Health Checks Disable health checks entirely. ```bash smg --disable-health-check ``` **Use when**: External health monitoring, testing scenarios

--- ## Worker Health Endpoint SMG expects workers to provide a health endpoint that returns: - **2xx status code**: Worker is healthy - **Any other status or timeout**: Worker is unhealthy ### Example Health Endpoint (vLLM) vLLM workers expose `/health` by default: ```bash # vLLM automatically provides /health endpoint vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 ``` ### Example Health Endpoint (SGLang) SGLang workers expose `/health` by default: ```bash # SGLang automatically provides /health endpoint python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 8000 ``` ### Custom Health Endpoint If your worker uses a different health endpoint: ```bash smg \ --worker-urls http://worker:8000 \ --health-check-endpoint /api/health ``` --- ## Interaction with Circuit Breakers Health checks and circuit breakers work together for comprehensive fault detection: | Health Check | Circuit Breaker | Worker State | |--------------|-----------------|--------------| | Passing | Closed | Healthy, receiving traffic | | Failing | Open | Unhealthy, no traffic | | Passing | Open | Recovering, limited traffic (half-open) | **Key differences**: - **Health checks**: Proactive background monitoring (no request impact) - **Circuit breakers**: Reactive detection based on real request failures Both are recommended for production deployments. --- ## Monitoring ### Metrics | Metric | Description | |--------|-------------| | `smg_worker_health_checks_total` | Health check results by worker type and result | | `smg_worker_health` | Current health status per worker (1=healthy, 0=unhealthy) | ### Useful PromQL Queries

#### Health Status ```promql # Current health status per worker smg_worker_health # Count of unhealthy workers count(smg_worker_health == 0) ```

#### Check Results ```promql # Health check success rate rate(smg_worker_health_checks_total{result="success"}[5m]) / rate(smg_worker_health_checks_total[5m]) # Failed checks per minute rate(smg_worker_health_checks_total{result="failure"}[1m]) * 60 ```

### Alert Thresholds | Metric | Warning | Critical | Action | |--------|---------|----------|--------| | Unhealthy workers | 1 worker | >50% workers | Investigate worker health | | Health check success rate | <90% | <70% | Check network connectivity | | Check duration | >timeout/2 | >timeout | Workers may be overloaded | ### Alerting Example ```yaml groups: - name: smg-health-checks rules: - alert: WorkerUnhealthy expr: smg_worker_health == 0 for: 5m labels: severity: warning annotations: summary: "Worker {{ $labels.worker }} is unhealthy" - alert: MajorityUnhealthy expr: count(smg_worker_health == 0) > count(smg_worker_health) / 2 for: 1m labels: severity: critical annotations: summary: "Majority of workers are unhealthy" ``` --- ## Tuning Guidelines | Symptom | Potential Adjustment | |---------|---------------------| | Workers marked unhealthy too quickly | Increase `--health-failure-threshold` | | Slow failure detection | Decrease `--health-check-interval-secs` | | Health checks timing out | Increase `--health-check-timeout-secs` | | Workers slow to rejoin | Decrease `--health-success-threshold` | | Too many health check requests | Increase `--health-check-interval-secs` | --- ## What's Next?

### :material-electric-switch: Circuit Breakers Reactive failure detection based on real request failures. [Circuit Breakers →](circuit-breakers.md)

### :material-refresh: Retries Automatic retry with exponential backoff for transient failures. [Retries →](retries.md)

### :material-power: Graceful Shutdown Allow in-flight requests to complete during shutdown. [Graceful Shutdown →](graceful-shutdown.md)