--- title: "Resilience Guide" version: 3.8.40 lastUpdated: 2026-06-28 --- # Resilience Guide OmniRoute has three distinct but related resilience mechanisms. Each has a different scope and purpose. Keep them separate when debugging routing behavior. ![3-layer resilience model](../diagrams/exported/resilience-3layers.svg) > Source: [diagrams/resilience-3layers.mmd](../diagrams/resilience-3layers.mmd) ## 1. Provider Circuit Breaker **Scope:** entire provider (e.g., `glm`, `openai`, `anthropic`). **Purpose:** stop sending traffic to a provider that is repeatedly failing at the upstream/service level. **Implementation:** - Core class: `src/shared/utils/circuitBreaker.ts` - Wiring: `src/sse/handlers/chatHelpers.ts`, `src/sse/handlers/chat.ts` - Status API: `GET /api/monitoring/health` - Reset API: `POST /api/resilience/reset` - Wrappers: `open-sse/services/accountFallback.ts` - DB table: `domain_circuit_breakers` **States:** - `CLOSED` — normal traffic allowed - `DEGRADED` — traffic still allowed, but elevated provider failures are being tracked - `OPEN` — provider temporarily blocked; combo routing skips it - `HALF_OPEN` — reset timeout elapsed; probe request allowed **Configurable defaults (`open-sse/config/constants.ts`, exposed in Dashboard → Settings → Resilience):** | Class | Degraded at | Opens at | Reset timeout | | ------- | ----------- | ----------- | ------------- | | OAuth | 5 failures | 8 failures | 60s | | API-key | 7 failures | 12 failures | 30s | | Local | derived | 2 failures | 15s | `degradationThreshold` controls when a provider enters `DEGRADED`; `failureThreshold` controls when it opens and is skipped. Local provider profiles are not exposed on the Resilience settings page yet. **Trip codes:** only provider-level statuses `[408, 500, 502, 503, 504]`. Do NOT trip for account-level errors (most 401/403/429 — those belong to cooldown or lockout). **Lazy recovery:** when `OPEN` expires, `getStatus()`, `canExecute()`, `getRetryAfterMs()` refresh state to `HALF_OPEN`. No background timer needed. --- ## 2. Connection Cooldown **Scope:** single provider connection/account/key. **Purpose:** skip one bad key while other connections for the same provider keep serving. **Implementation:** - Mark unavailable: `src/sse/services/auth.ts::markAccountUnavailable()` - Selection: `getProviderCredentials*` in same file - Cooldown calc: `open-sse/services/accountFallback.ts::checkFallbackError()` - Settings: `src/lib/resilience/settings.ts` **Fields per connection:** - `rateLimitedUntil` — timestamp until cooldown expires - `testStatus: "unavailable"` - `lastError`, `lastErrorType`, `errorCode` - `backoffLevel` — exponential backoff counter **Default cooldowns:** - OAuth base: 5s - API-key base: 3s - API-key 429: prefers upstream `Retry-After`/reset headers/parseable reset text - Backoff: `baseCooldownMs * 2 ** failureIndex` **Anti-thundering-herd guard:** prevents concurrent failures from over-extending cooldown or double-incrementing `backoffLevel`. **Terminal states (NOT cooldowns):** - `banned` - `expired` - `credits_exhausted` These persist until credentials change or an operator resets them. Do not overwrite terminal states with transient cooldown state. **Lazy recovery:** when `rateLimitedUntil` is past, connection becomes eligible again. On successful use, `clearAccountError()` clears all error fields. --- ## 3. Model Lockout **Scope:** provider + connection + model triple. **Purpose:** avoid disabling a whole connection when only one model is unavailable or quota-limited. **Examples:** - Per-model quota providers returning 429 - Local providers returning 404 for one missing model - Provider-specific mode/model permission failures (e.g., Grok modes) **Implementation:** `open-sse/services/accountFallback.ts` — `lockModel()`, `clearModelLock()`, `getAllModelLockouts()`. ### Model Cooldowns Dashboard (v3.8.0) UI: Settings → Model Cooldowns (`src/app/(dashboard)/dashboard/settings/components/ModelCooldownsCard.tsx`) Lists active lockouts with: provider, connection, model, reason, expiresAt. Operators can manually re-enable a model from the card. **REST API:** - `GET /api/resilience/model-cooldowns` — list active lockouts - `DELETE /api/resilience/model-cooldowns` — manual re-enable. Body: `{provider, connection, model}`. Auth: management. ### Lockout settings UI + success-decay recovery (v3.8.23) Model lockout went from always-on hardcoded behavior to a fully configurable, opt-in feature with its own settings card and a self-healing recovery path. **Settings card:** Settings → Model Lockout (`src/app/(dashboard)/dashboard/settings/components/ModelLockoutCard.tsx`). This is **distinct** from the read-only `ModelCooldownsCard` above (which only _lists_ active lockouts) — the new card _configures the parameters_. Defaults live in `DEFAULT_MODEL_LOCKOUT_SETTINGS` (`src/lib/resilience/modelLockoutSettings.ts`): | Setting | Default | Meaning | | ----------------------- | -------------------------------- | -------------------------------------------------------------- | | `enabled` | `false` | Master toggle — model lockout is **off by default**. | | `errorCodes` | `[403, 404, 429, 502, 503, 504]` | Upstream statuses that count as a model-scoped failure. | | `baseCooldownMs` | `120_000` (120 s) | Initial lockout duration for the first failure. | | `maxCooldownMs` | `1_800_000` (30 min) | Cap on the escalated cooldown. | | `maxBackoffSteps` | `10` | Max exponential-backoff escalation steps. | | `useExponentialBackoff` | `true` | Whether repeated failures escalate the cooldown exponentially. | Settings persist through the normal settings store and validate via the resilience settings schema; the card clamps `baseCooldownMs`/`maxCooldownMs` (with `maxCooldownMs ≥ baseCooldownMs`) and `maxBackoffSteps`. **Success-decay recovery:** recovery is **not** purely timer expiry. A healthy response walks the model's failure count back down so a model that recovered mid-window stops escalating (and clears) before its timer would. On a successful combo target, `open-sse/services/combo.ts` calls `decayModelFailureCount()` (`open-sse/services/accountFallback.ts`), which **halves** the stored `failureCount` (`Math.floor(failureCount / 2)`); when it reaches `0` the lockout entry is deleted entirely. The counterpart `recordModelLockoutFailure()` increments the count (and escalates the cooldown) on failures within the escalation window. This success-decay is in addition to plain timer expiry — either path can re-enable a model. **State:** lockouts are held **in-memory** (per-process `Map`s of `ModelLockoutEntry` keyed by `provider:connectionId:model`), not persisted to the DB — they are lost on restart. The _settings_ are persisted; the active lockout _state_ is ephemeral. --- ## 4. Quota-Share Concurrency Control (v3.8.36) Subscription accounts (GLM, MiniMax, etc.) often accept only ~1–3 concurrent requests; exceeding that triggers 429s and cooldowns. This is acute under **quota-share** (`qtSd/…`) combos, where several API keys share one upstream account. Three layers keep a shared account from being flooded. ### Per-connection concurrency cap (`max_concurrent`) Each provider connection can declare a `max_concurrent` ceiling (`provider_connections.max_concurrent`, set in the connection modal / API / DB). Leave it empty for no limit. This is the single knob that drives the serialization layer below — set it to the account's real concurrency (e.g. GLM ~1, MiniMax ~2). ### Quota-share request serialization When a quota-share dispatch targets a connection that declares a positive `max_concurrent`, concurrent requests to that **account** are serialized through a per-connection semaphore (key `qsconn:`): excess requests **wait in the queue** instead of flooding the account. It is **fail-open** — a saturated queue or timeout proceeds without a slot rather than ever rejecting a dispatchable request. Toggle in **Settings → Resilience → Quota-share per-connection concurrency** (`resilienceSettings.quotaShareConcurrencyLimit.enabled`, default on). Without a `max_concurrent` cap the behavior is unchanged. > The quota-share routing gate (`selectQuotaShareTarget`, DRR + P2C) is itself > fail-open and only _deprioritizes_ an at-cap connection — with a > single-connection pool it cannot hard-limit, so this semaphore is what actually > contains the flood. ### Combo cooldown-aware retry For quota-share combos only, a request that would crystallize a 429 for a SHORT transient cooldown waits it out and re-dispatches instead of returning the 429. Bounded by `comboCooldownWait` (`enabled`, `maxWaitMs` 5s, `maxAttempts` 2, `budgetMs` 8s) in **Settings → Resilience**. It never waits on `quota_exhausted` (locked until midnight) or auth/not-found reasons. --- ## Other Resilience Features - **17 routing strategies** (priority, weighted, round-robin, context-relay, fill-first, p2c, random, least-used, cost-optimized, reset-aware, reset-window, headroom, strict-random, auto, lkgp, context-optimized, fusion) — see [AUTO-COMBO.md](../routing/AUTO-COMBO.md). - **Reset-aware routing** (v3.8.0) — prioritizes connections by quota reset time. - **Background mode degradation** — Responses API `background: true` degraded to sync with warning. - **Dynamic tool limit detection** — backs off providers when tool count limits hit. - **Emergency fallback** — controlled by `OMNIROUTE_EMERGENCY_FALLBACK`; operators can override it from the Feature Flags page without a restart. --- ## Debugging - All keys for a provider skipped → check both circuit breaker state AND each connection's `rateLimitedUntil`/`testStatus`. - Provider permanently excluded after reset window → code reading raw `state` instead of `getStatus()`/`canExecute()`. - One key fails, others should work → prefer connection cooldown over circuit breaker. - Only one model fails → prefer model lockout over connection cooldown. - State should self-recover but doesn't → check for future timestamp + read path that refreshes expired state. Permanent statuses require manual changes. --- ## TLS Fingerprinting & Stealth Provider-specific stealth (JA3/JA4, CCH, obfuscation) is separately documented — see [STEALTH_GUIDE.md](../security/STEALTH_GUIDE.md). --- ## Resilience testing (Fase 8 · Bloco C) Além dos unit tests da lógica de resiliência, três testes exercitam o runtime sob estresse/falha real (todos integração/nightly — nenhum bloqueia PR): | Teste | O quê | Rodar | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- | | Chaos | Fake-upstream node injeta latência/reset/timeout/503 reais; valida que o circuit breaker abre/recupera e `checkFallbackError` classifica 503 como fallback recuperável. | `RUN_CHAOS_INT=1 npm run test:chaos` | | Heap-growth | ~500 streams por `createSSEStream` sob `--expose-gc`; falha se o heap crescer além do teto (guarda OOM #3069). | `npm run test:heap` | | k6 soak | Carga sustentada contra `/api/monitoring/health`; thresholds p95/erro. | `k6 run tests/load/k6-soak.js` (nightly) | Orquestrados por `.github/workflows/nightly-resilience.yml` (cron + dispatch). No `test:integration` default, chaos e heap se auto-skipam (sem `RUN_CHAOS_INT`/`--expose-gc`). --- ## See Also - [Architecture Guide](./ARCHITECTURE.md) — System architecture and internals - [User Guide](../guides/USER_GUIDE.md) — Providers, combos, CLI integration - [Auto-Combo Engine](../routing/AUTO-COMBO.md) — 12-factor scoring, mode packs