--- name: eng-feature-flag-rollout-skills description: Use when managing the rollout of new or updated legal AI skills using feature flags — controlling which organizations, users, or tenant tiers have access to a skill, enabling gradual rollout, A/B testing skill versions, and instant kill-switches for problematic skills. Engineering skill critical for safe deployment of legal AI capabilities without firm-wide disruption. license: MIT metadata: id: eng.feature-flag-rollout-skills category: eng jurisdictions: [__multi__] priority: P2 intent: [feature-flags, rollout, A/B-test, gradual-release, kill-switch] related: - eng-fallback-model-cascade - eng-langfuse-eval-runner - eng-audit-log-schema - eng-latency-slo-by-skill source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal) version: "1.0" --- # Feature Flag Rollout — Skills ## What it does Feature flags for skills control which skills are available to which users, at which versions, under what conditions. In a legal AI product, skills are the primary feature surface — a new skill for drafting a Saudi employment contract, an updated conflict-check algorithm, or an experimental WIP report format. Feature flags allow: 1. **Safe rollout**: ship a new skill to 5% of users before all users. 2. **Org-scoped enablement**: a skill that requires firm-specific configuration (e.g., billing-system integration) can be enabled only for that firm. 3. **Version A/B testing**: run two skill versions in parallel; evaluate quality via [[eng-langfuse-eval-runner]] before full promotion. 4. **Instant kill-switch**: if a skill produces harmful or inaccurate legal output in production, disable it for all users in under 60 seconds. 5. **Tier-gated access**: reserve advanced skills for pro/enterprise tier users. ## Flag schema Each skill flag is an entry in the flag configuration store: ```json { "flag_id": "skill:efirm-conflict-check", "skill_id": "efirm-conflict-check", "skill_version": "1.1", "enabled_global": true, "rollout_percentage": 100, "overrides": [ { "condition": "org_id == 'org_haqq'", "skill_version": "1.2-beta", "rollout_percentage": 100, "note": "HAQQ is piloting v1.2-beta" }, { "condition": "user_tier == 'free'", "enabled": false, "note": "Conflict check not available on free tier" } ], "kill_switch": false, "created_at": "ISO-8601", "updated_at": "ISO-8601", "owner": "eng-team | product-team", "review_date": "ISO-8601" } ``` ## Evaluation logic For every incoming request, the skill router evaluates flags in order: ``` 1. If kill_switch == true: skill unavailable → return graceful error 2. If org_id matches an override: apply override (version + rollout%) 3. If user_tier matches an override: apply override 4. Else: apply global rollout_percentage (hash(user_id + flag_id) % 100 < rollout_percentage) 5. If skill available: load skill_version ``` Hash-based rollout ensures stability — the same user always gets the same variant within a rollout, avoiding inconsistent experiences across sessions. ## Rollout stages for a new skill | Stage | Rollout % | Duration | Promotion criteria | |---|---|---|---| | Internal only | 0% global; 100% for eng/product org_ids | 1 week | No errors; latency within SLO | | Alpha | 5% of consenting opt-in users | 1 week | Quality score ≥ target per Langfuse eval | | Beta | 20% | 1–2 weeks | Feedback positive; no critical flags | | GA | 100% | Permanent | — | For P0 skills (conflict check, engagement letter), the rollout stages are mandatory and cannot be skipped. For P2 skills, a single internal-only → GA progression is acceptable if quality is confirmed. ## Version-parallel A/B testing To compare two skill versions: ```json { "flag_id": "skill:efirm-engagement-letter-draft", "experiment": { "enabled": true, "variant_a": {"skill_version": "1.0", "traffic_pct": 50}, "variant_b": {"skill_version": "1.1-test", "traffic_pct": 50}, "eval_metric": "langfuse_score:quality", "min_samples": 100, "auto_promote_threshold": 0.05 } } ``` Route users stably (same variant across sessions) using `hash(user_id + experiment_id) % 100`. Evaluate using [[eng-langfuse-eval-runner]]: if variant B shows a statistically significant improvement (p < 0.05) after 100+ samples, auto-promote to 100% or alert the product team for manual promotion. ## Kill-switch procedure When a skill must be immediately disabled: 1. Set `kill_switch: true` in the flag store (change propagates in < 10s with a cache-busting mechanism). 2. All in-flight requests complete (no mid-stream interruption). 3. New requests for that skill receive: ``` "This feature is temporarily unavailable. Please try again later or contact support." ``` 4. Incident is logged with: who triggered the kill-switch, timestamp, reason. 5. Engineering investigates root cause. 6. To re-enable: set `kill_switch: false` after fix is deployed and verified. **Never** roll back a kill-switch without a fix — the reason it was triggered still applies. ## Tier-gated skills Map skill IDs to minimum tiers in configuration: ```json { "efirm-conflict-check": "pro", "efirm-engagement-letter-draft": "pro", "efirm-fee-quote-builder": "pro", "efirm-client-update-email-draft": "basic", "efirm-deadline-tracker": "basic" } ``` When a free-tier user attempts to invoke a pro-skill, the router returns a paywall response with an upgrade CTA — not an error. ## Audit trail All flag changes (create, update, kill_switch toggle) are recorded in the audit log ([[eng-audit-log-schema]]) with: - Admin user ID who made the change - Before and after state - Timestamp - Reason (free text) Flag changes to P0 skills require a second admin approval before taking effect. ## Related skills - [[eng-fallback-model-cascade]] - [[eng-langfuse-eval-runner]] - [[eng-audit-log-schema]] - [[eng-latency-slo-by-skill]]