---
title: Health Checks
---
# Health Checks
Background health checks continuously monitor worker availability, removing unhealthy workers from the selection pool before they can cause request failures.
---
## Overview
### :material-heart-pulse: Proactive Monitoring
Detect worker failures before they impact requests—not after.
### :material-shield-check: Automatic Isolation
Unhealthy workers are removed from the pool without manual intervention.
### :material-refresh: Self-Healing
Workers automatically rejoin the pool when they recover.
### :material-tune: Configurable Sensitivity
Tune detection speed vs. tolerance for temporary issues.
---
## Why Health Checks?
Without proactive health checks:
- **Reactive detection**: Failures only discovered when real requests fail
- **Wasted requests**: Multiple requests may fail before worker is marked unhealthy
- **Slower recovery**: No way to know when a worker has recovered without trying it
With health checks:
- **Proactive detection**: Unhealthy workers removed before they cause failures
- **Fast recovery**: Workers rejoin the pool as soon as they're healthy
- **No wasted requests**: Real requests only go to verified healthy workers
---
## How It Works
SMG sends periodic HTTP requests to each worker's health endpoint:

### Worker States
| State | Meaning | Traffic |
|-------|---------|---------|
| **Pending** | Freshly registered, not yet verified | No requests |
| **Ready** | Passing health checks | Receives requests |
| **NotReady** | Consecutive probe failures reached the readiness threshold | No requests |
| **Failed** | Consecutive failures reached the liveness threshold, or `Pending` ran out of probe attempts | Terminal — receives no requests and is not probed further |
The `smg_worker_health` gauge collapses these to `1` (Ready) and `0` (anything else), so existing dashboards continue to work.
### State Transitions
**Pending → Ready**: When consecutive successful probes reach `--health-success-threshold`.
**Pending → Failed**: If the worker accumulates `10 × failure_threshold` total probes without ever reaching the success threshold (prevents misconfigured URLs from lingering forever).
**Ready → NotReady**: When consecutive failed probes reach `--health-failure-threshold`.
**NotReady → Ready**: When consecutive successful probes reach `--health-success-threshold`.
**NotReady → Failed**: When consecutive failures reach `3 × --health-failure-threshold` (the liveness threshold — analogous to a Kubernetes liveness probe, tolerating longer outages than the readiness threshold).
**Failed is terminal**: Successful probes do not recover a `Failed` worker. A failed worker is removed via `--remove-unhealthy-workers` or requires manual re-registration.
---
## Configuration
```bash
smg \
--worker-urls http://w1:8000 http://w2:8000 \
--health-check-interval-secs 60 \
--health-failure-threshold 3 \
--health-success-threshold 2 \
--health-check-timeout-secs 5 \
--health-check-endpoint /health
```
### Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--health-check-interval-secs` | `60` | Interval between health checks |
| `--health-failure-threshold` | `3` | Consecutive failures before marking unhealthy |
| `--health-success-threshold` | `2` | Consecutive successes to mark healthy again |
| `--health-check-timeout-secs` | `5` | Timeout for each health check request |
| `--health-check-endpoint` | `/health` | Endpoint path for health checks |
| `--disable-health-check` | `false` | Disable background health checks |
| `--remove-unhealthy-workers` | `false` | Submit a removal job when a worker reaches the terminal `Failed` state |
---
## Recommended Configurations
### :material-lightning-bolt: Fast Detection
Sensitive to failures—detect issues quickly.
```bash
smg \
--health-check-interval-secs 10 \
--health-failure-threshold 2 \
--health-check-timeout-secs 3
```
**Use when**: Critical availability, rapid failure response needed
### :material-shield: Conservative Detection
Tolerant of network blips.
```bash
smg \
--health-check-interval-secs 120 \
--health-failure-threshold 5 \
--health-success-threshold 3
```
**Use when**: Flaky networks, workers with occasional slow responses
### :material-server-network: Production Balanced
Balanced detection for typical deployments.
```bash
smg \
--health-check-interval-secs 30 \
--health-failure-threshold 3 \
--health-success-threshold 2 \
--health-check-timeout-secs 5
```
**Use when**: Standard production environments
### :material-close-circle: No Health Checks
Disable health checks entirely.
```bash
smg --disable-health-check
```
**Use when**: External health monitoring, testing scenarios
---
## Worker Health Endpoint
SMG expects workers to provide a health endpoint that returns:
- **2xx status code**: Worker is healthy
- **Any other status or timeout**: Worker is unhealthy
### Example Health Endpoint (vLLM)
vLLM workers expose `/health` by default:
```bash
# vLLM automatically provides /health endpoint
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
```
### Example Health Endpoint (SGLang)
SGLang workers expose `/health` by default:
```bash
# SGLang automatically provides /health endpoint
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 8000
```
### Custom Health Endpoint
If your worker uses a different health endpoint:
```bash
smg \
--worker-urls http://worker:8000 \
--health-check-endpoint /api/health
```
---
## Interaction with Circuit Breakers
Health checks and circuit breakers work together for comprehensive fault detection:
| Health Check | Circuit Breaker | Worker State |
|--------------|-----------------|--------------|
| Passing | Closed | Healthy, receiving traffic |
| Failing | Open | Unhealthy, no traffic |
| Passing | Open | Recovering, limited traffic (half-open) |
**Key differences**:
- **Health checks**: Proactive background monitoring (no request impact)
- **Circuit breakers**: Reactive detection based on real request failures
Both are recommended for production deployments.
---
## Monitoring
### Metrics
| Metric | Description |
|--------|-------------|
| `smg_worker_health_checks_total` | Health check results by worker type and result |
| `smg_worker_health` | Current health status per worker (1=healthy, 0=unhealthy) |
### Useful PromQL Queries
#### Health Status
```promql
# Current health status per worker
smg_worker_health
# Count of unhealthy workers
count(smg_worker_health == 0)
```
#### Check Results
```promql
# Health check success rate
rate(smg_worker_health_checks_total{result="success"}[5m]) /
rate(smg_worker_health_checks_total[5m])
# Failed checks per minute
rate(smg_worker_health_checks_total{result="failure"}[1m]) * 60
```
### Alert Thresholds
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Unhealthy workers | 1 worker | >50% workers | Investigate worker health |
| Health check success rate | <90% | <70% | Check network connectivity |
| Check duration | >timeout/2 | >timeout | Workers may be overloaded |
### Alerting Example
```yaml
groups:
- name: smg-health-checks
rules:
- alert: WorkerUnhealthy
expr: smg_worker_health == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Worker {{ $labels.worker }} is unhealthy"
- alert: MajorityUnhealthy
expr: count(smg_worker_health == 0) > count(smg_worker_health) / 2
for: 1m
labels:
severity: critical
annotations:
summary: "Majority of workers are unhealthy"
```
---
## Tuning Guidelines
| Symptom | Potential Adjustment |
|---------|---------------------|
| Workers marked unhealthy too quickly | Increase `--health-failure-threshold` |
| Slow failure detection | Decrease `--health-check-interval-secs` |
| Health checks timing out | Increase `--health-check-timeout-secs` |
| Workers slow to rejoin | Decrease `--health-success-threshold` |
| Too many health check requests | Increase `--health-check-interval-secs` |
---
## What's Next?
### :material-electric-switch: Circuit Breakers
Reactive failure detection based on real request failures.
[Circuit Breakers →](circuit-breakers.md)
### :material-refresh: Retries
Automatic retry with exponential backoff for transient failures.
[Retries →](retries.md)
### :material-power: Graceful Shutdown
Allow in-flight requests to complete during shutdown.
[Graceful Shutdown →](graceful-shutdown.md)