--- name: eng-fallback-model-cascade description: Use when designing the model-selection and fallback logic for a legal AI product — defining which model to use for which skill tier, how to cascade to a cheaper or faster model when the primary model is unavailable or over budget, and how to handle failures gracefully without exposing errors to legal practitioners. Engineering skill with direct impact on availability SLOs and cost management. license: MIT metadata: id: eng.fallback-model-cascade category: eng jurisdictions: [__multi__] priority: P2 intent: [fallback, model-selection, availability, resilience, cascade] related: - eng-latency-slo-by-skill - eng-cost-per-message-tracker - eng-context-cache-key-design - eng-feature-flag-rollout-skills source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal) version: "1.0" --- # Fallback Model Cascade ## What it does A fallback model cascade is the routing logic that selects which LLM to use for a given request and what to do when the preferred model is unavailable, slow, or over the cost budget. In a legal AI product, where lawyers depend on reliable responses for deadline-critical work, a cascade must: 1. **Serve the best model for the skill tier** (heavier skills get the capable model; lightweight skills can run on a faster, cheaper model). 2. **Degrade gracefully** when the primary model returns a 529 (overload), 500, or timeout. 3. **Stay within cost budgets** if a tenant is on a metered plan. 4. **Be transparent** to the user when a fallback occurred — a legal practitioner may care that a "draft NDA" was produced by a less capable model. ## Model tier definitions For a Claude-based legal AI product, define three tiers: | Tier | Models (examples) | Use case | |---|---|---| | Tier 1 (Primary) | claude-opus-4, claude-sonnet-4-6 | Complex drafting, legal analysis, multi-step reasoning, P0 skills | | Tier 2 (Secondary) | claude-haiku-3-5, claude-sonnet-3-7 | Shorter form content, classification, routing, summarization | | Tier 3 (Minimal) | Fast inference models, local models | Intent classification, short Q&A, health checks | Assign each skill a **minimum tier**: - `efirm-conflict-check`, `efirm-engagement-letter-draft`: Tier 1 minimum (legal accuracy critical). - `efirm-client-update-email-draft`, `efirm-deadline-tracker`: Tier 2 acceptable. - Routing/classification skills: Tier 3 acceptable. ## Cascade sequence ``` For a given request: 1. Determine minimum_tier from skill configuration 2. Try primary_model for that tier → On success: use response → On 429 (rate limit): wait retry_delay; retry once; then cascade to next model → On 529 (overload): immediately cascade (no wait) → On 500/503: immediately cascade → On timeout (> latency_slo_ms + 2000): cascade 3. Try secondary_model (if tier permits) → Same error handling 4. If all models fail: return structured error (see below) ``` ## Cascade configuration (per skill) ```yaml skills: efirm-conflict-check: min_tier: 1 preferred_model: claude-sonnet-4-6 fallback_models: - claude-opus-4 # higher capability fallback (if primary is overloaded) - claude-haiku-3-5 # cost fallback only if budget exceeded fallback_disclosure: true # Tell user which model was used budget_fallback: true # Allow tier-2 if org over budget efirm-deadline-tracker: min_tier: 2 preferred_model: claude-haiku-3-5 fallback_models: - claude-sonnet-4-6 fallback_disclosure: false budget_fallback: false ``` ## Error handling When all cascade options are exhausted: ```json { "error": { "code": "MODEL_UNAVAILABLE", "message": "Legal AI is temporarily unavailable. All models in the cascade have returned errors.", "user_message": "Our AI service is temporarily experiencing high demand. Your request has been queued. Please try again in 2–3 minutes, or contact support if this persists.", "retry_after_seconds": 120, "trace_id": "..." } } ``` **Never** expose raw API error messages to end users (no "Error 529 Overloaded" in the UI). Always translate to a professional, calm user message appropriate for a legal professional audience. ## Fallback disclosure When a lower-tier model is used as a fallback on a P0 skill (conflict check, engagement letter), the system should: 1. Log the fallback event with: `{skill_id, preferred_model, actual_model, reason}`. 2. Optionally display a subtle notice in the UI: "This response was generated by [Model X] (backup mode). Review carefully." 3. Alert engineering if fallback rate exceeds the threshold (see [[eng-latency-slo-by-skill]]). For P2/P3 skills, silent fallback is acceptable — the quality difference between tiers is smaller. ## BYO key considerations On BYO-key model: the user's API key may be rate-limited at a lower tier than the platform key. The cascade must: - Detect 429 errors on the user's key. - Surface a helpful message: "Your Anthropic API key has reached its rate limit for this model. Upgrade your Anthropic plan or wait [retry_after] seconds." - Not silently fall back to a platform key (that would create unexpected cost for the platform). ## Retry policy | Error type | Retry | Wait | |---|---|---| | 429 Rate limit | Yes, once | `Retry-After` header value, or 60s | | 529 Overload | No — cascade immediately | — | | 500 Server error | Yes, once | 2s exponential backoff | | 503 Unavailable | No — cascade immediately | — | | Network timeout | Yes, once | 1s | | Connection error | Yes, once | 1s | Cap total retry + cascade time at the skill's latency SLO (see [[eng-latency-slo-by-skill]]). If the cascade would exceed the SLO, fail fast rather than timeout the user. ## Monitoring Track in metrics: - `fallback_events_total` by `{org_id, skill_id, from_model, to_model, reason}` - `cascade_failure_rate` (all models failed) — alert if > 0.1% of requests - `fallback_rate_by_model` — alert if primary model fallback rate > 5% over 15 min ## Related skills - [[eng-latency-slo-by-skill]] - [[eng-cost-per-message-tracker]] - [[eng-context-cache-key-design]] - [[eng-feature-flag-rollout-skills]]