---
title: Circuit Breakers
---
# Circuit Breakers
Circuit breakers prevent cascade failures by stopping traffic to unhealthy workers. They're essential for maintaining system stability when workers fail.
---
## Overview
### :material-electric-switch: Automatic Isolation
Automatically isolate unhealthy workers to prevent cascade failures across your inference fleet.
### :material-lightning-bolt: Fail Fast
Detect failing workers and stop sending traffic immediately—no wasted requests or timeout waits.
### :material-refresh: Self-Healing
Automatically test recovery and restore traffic when workers become healthy again.
### :material-chart-line: Observable
Full Prometheus metrics for state transitions, failure counts, and recovery timing.
---
## Why Circuit Breakers?
Without circuit breakers, a failing worker can cause:
1. **Wasted requests**: Requests sent to failing workers timeout
2. **Increased latency**: Clients wait for timeouts before retry
3. **Resource exhaustion**: Connections pile up to dead workers
4. **Cascade failures**: Retry storms overwhelm remaining workers
Circuit breakers **fail fast**—they detect failing workers and stop sending traffic immediately.
---
## How It Works

Each worker has its own circuit breaker with three states:
### :material-check-circle: Closed State
**Normal operation** - requests flow through.
- Failures increment counter
- Success resets counter to zero
- Opens when failures ≥ threshold
### :material-close-circle: Open State
**Circuit tripped** - requests rejected immediately.
- Worker isolated from pool
- No traffic sent to worker
- Transitions to half-open after timeout
### :material-help-circle: Half-Open State
**Testing recovery** - limited probe requests allowed.
- Success → close circuit
- Any failure → reopen circuit
- Gradual traffic restoration
---
## State Transitions
### Closed → Open
The circuit **opens** when:
```
consecutive_failures >= failure_threshold
```
A single successful request resets `consecutive_failures` to zero. `--cb-window-duration-secs` is accepted and validated but is not yet consumed by the state machine — failures are tracked via a running consecutive-failure counter rather than a sliding window.
### Open → Half-Open
After the circuit has been open for `--cb-timeout-duration-secs`, it transitions to half-open to test if the worker has recovered.
### Half-Open → Closed
If `--cb-success-threshold` consecutive requests succeed, the circuit closes and normal operation resumes.
### Half-Open → Open
If any request fails during half-open, the circuit immediately reopens.
---
## Configuration
```bash
smg \
--worker-urls http://w1:8000 http://w2:8000 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60
```
### Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--cb-failure-threshold` | `10` | Consecutive failures before circuit opens |
| `--cb-success-threshold` | `3` | Successes in half-open state to close circuit |
| `--cb-timeout-duration-secs` | `60` | Seconds before open circuit transitions to half-open |
| `--cb-window-duration-secs` | `120` | Accepted and validated (must be `> 0`) but not yet consumed by the state machine; see note under *Closed → Open* |
| `--disable-circuit-breaker` | `false` | Disable circuit breakers entirely |
### Configuration Examples
#### :material-lightning-bolt: Fast Circuit Opening
Sensitive to failures—isolate workers quickly.
```bash
smg \
--cb-failure-threshold 3 \
--cb-timeout-duration-secs 30
```
**Use when**: Critical availability, latency-sensitive applications
#### :material-shield: Tolerant Configuration
Allow occasional failures before tripping.
```bash
smg \
--cb-failure-threshold 20 \
--cb-success-threshold 5 \
--cb-timeout-duration-secs 120
```
**Use when**: Flaky workers, network instability, batch processing
### Tuning Guidelines
| Scenario | Recommendation |
|----------|---------------|
| **Flaky workers** | Higher `failure_threshold`, shorter `timeout` |
| **Critical availability** | Lower `failure_threshold`, longer `timeout` |
| **Fast recovery workers** | Lower `timeout`, lower `success_threshold` |
| **Slow recovery workers** | Higher `timeout`, higher `success_threshold` |
---
## Example Scenarios
### Normal Operation

### Worker Fails

### Recovery

---
## Monitoring
### Metrics
| Metric | Description |
|--------|-------------|
| `smg_worker_cb_state` | Current state per worker (0=closed, 1=open, 2=half-open) |
| `smg_worker_cb_transitions_total` | State transitions by worker and direction |
| `smg_worker_cb_consecutive_failures` | Current failure count per worker |
| `smg_worker_cb_consecutive_successes` | Current success count per worker |
### Useful PromQL Queries
#### Current States
```promql
# Current circuit breaker states
smg_worker_cb_state
# Workers with open circuits
count(smg_worker_cb_state == 1)
```
#### Transitions
```promql
# State transitions rate
rate(smg_worker_cb_transitions_total[5m])
# Consecutive failures per worker
smg_worker_cb_consecutive_failures
```
### Alert Thresholds
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Open circuits | 1 worker | All workers | Investigate worker health |
| Transition rate | >10/min | >50/min | Check for flapping |
| Consecutive failures | >5 | >threshold | Worker likely failing |
### Alerting Example
```yaml
groups:
- name: smg-circuit-breakers
rules:
- alert: CircuitBreakerOpen
expr: smg_worker_cb_state == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker open for {{ $labels.worker }}"
- alert: AllCircuitsOpen
expr: count(smg_worker_cb_state == 1) == count(smg_worker_cb_state)
for: 30s
labels:
severity: critical
annotations:
summary: "All worker circuit breakers are open"
```
---
## Interaction with Other Features
### Retries
When a circuit is **open**:
- Requests are rejected **immediately** (no retry to that worker)
- Other workers may be tried if available
When a circuit is **half-open**:
- Only a limited number of test requests are sent
- Failures don't count against retry budget
### Health Checks
Circuit breakers and health checks work together:
| Health Check | Circuit Breaker | Worker State |
|--------------|-----------------|--------------|
| Passing | Closed | Healthy, receiving traffic |
| Failing | Open | Unhealthy, no traffic |
| Passing | Open | Recovering, limited traffic |
---
## Disabling Circuit Breakers
In some cases, you may want to disable circuit breakers:
```bash
smg --worker-urls http://w1:8000 --disable-circuit-breaker
```
!!! warning "Not Recommended"
Disabling circuit breakers removes an important safety mechanism. Only do this if you have another layer providing similar protection.
---
## What's Next?
### :material-refresh: Retries
Automatic retry with exponential backoff for transient failures.
[Retries →](retries.md)
### :material-heart-pulse: Health Checks
Proactive worker monitoring and failure detection.
[Health Checks →](health-checks.md)
### :material-traffic-light: Rate Limiting
Protect workers from overload with token bucket rate limiting.
[Rate Limiting →](rate-limiting.md)
### :material-chart-box: Metrics Reference
Complete list of circuit breaker metrics.
[Metrics Reference →](../../reference/metrics.md)