--- name: cascade-workflow version: 1.0.0 description: Graceful degradation through cascading fallback strategies - ensures system always completes while maintaining acceptable functionality auto_activates: - "external API" - "external service" - "timeout handling" - "high availability" - "fallback strategy" - "resilient operation" explicit_triggers: - /amplihack:cascade confirmation_required: false token_budget: 3500 --- # Cascade Workflow with Graceful Degradation Skill ## Purpose Implement graceful degradation through cascading fallback strategies. When optimal approaches fail or timeout, the system automatically falls back to simpler, more reliable alternatives while maintaining acceptable functionality. ## When to Use This Skill **USE FOR:** - External service dependencies (APIs, databases) - Time-sensitive operations with acceptable degraded modes - Operations where partial results are better than no results - High-availability requirements (system must always respond) - Scenarios where waiting for perfect solution is worse than good-enough solution **AVOID FOR:** - Operations requiring exact correctness (no acceptable degradation) - Security-critical operations (authentication, authorization) - Financial transactions (no room for "approximate") - When failures must surface to user (diagnostic operations) - Simple operations with no meaningful fallback ## Configuration ### Core Parameters **Timeout Strategy:** - `aggressive` - Fast failures, quick degradation (5s / 2s / 1s) - `balanced` - Reasonable attempts (30s / 10s / 5s) - **DEFAULT** - `patient` - Thorough attempts before fallback (120s / 30s / 10s) - `custom` - Define your own timeouts **Fallback Types:** - `service` - External API → Cached data → Static defaults - `quality` - Comprehensive → Standard → Minimal analysis - `freshness` - Real-time → Recent → Historical data - `completeness` - Full dataset → Sample → Summary - `accuracy` - Precise → Approximate → Estimate **Degradation Notification:** - `silent` - Log only, no user notification - `warning` - Inform user of degradation - `explicit` - Detailed explanation of what degraded and why ## Cascade Level Requirements **PRIMARY (Optimal):** - Best possible outcome - May depend on external services - May be slow or resource-intensive - Can fail or timeout **SECONDARY (Acceptable):** - Reduced quality but functional - More reliable than primary - Faster or fewer dependencies - Acceptable for users **TERTIARY (Guaranteed):** - Always succeeds, never fails - No external dependencies - Fast and reliable - Minimal but functional - **CRITICAL: Must be designed to never fail** ## Execution Process ### Step 1: Define Cascade Levels - **Use architect agent** to identify cascade levels - Define PRIMARY approach (optimal solution) - Define SECONDARY approach (acceptable degradation) - Define TERTIARY approach (guaranteed completion) - Set timeout for each level - Document what degrades at each level - **CRITICAL: Ensure tertiary ALWAYS succeeds** **Example Cascade Definitions:** **Code Analysis with AI:** - PRIMARY: GPT-4 comprehensive analysis (timeout: 30s) - SECONDARY: GPT-3.5 standard analysis (timeout: 10s) - TERTIARY: Static analysis with regex (timeout: 5s) **External API Data Fetch:** - PRIMARY: Live API call (timeout: 10s) - SECONDARY: Cached data (timeout: 2s) - TERTIARY: Default values (timeout: 0s) **Test Execution:** - PRIMARY: Full test suite (timeout: 120s) - SECONDARY: Critical tests only (timeout: 30s) - TERTIARY: Smoke tests (timeout: 10s) ### Step 2: Attempt Primary Approach - Execute optimal solution - Set timeout based on strategy configuration - Monitor execution progress - If completes successfully: DONE (best outcome) - If fails or times out: Continue to Step 3 - Log attempt and reason for failure ```python # Pseudocode for primary attempt try: result = execute_primary_approach(timeout=PRIMARY_TIMEOUT) log_success(level="PRIMARY", result=result) return result # DONE - best outcome achieved except TimeoutError: log_failure(level="PRIMARY", reason="timeout") # Continue to Step 3 except ExternalServiceError as e: log_failure(level="PRIMARY", reason=f"service_error: {e}") # Continue to Step 3 ``` ### Step 3: Attempt Secondary Approach - Log degradation to secondary level - Execute acceptable fallback solution - Set shorter timeout (typically 1/3 of primary) - Monitor execution progress - If completes successfully: DONE (acceptable outcome) - If fails or times out: Continue to Step 4 - Log attempt and reason for failure ```python # Pseudocode for secondary attempt log_degradation(from_level="PRIMARY", to_level="SECONDARY") try: result = execute_secondary_approach(timeout=SECONDARY_TIMEOUT) log_success(level="SECONDARY", result=result, degraded=True) return result # DONE - acceptable outcome except TimeoutError: log_failure(level="SECONDARY", reason="timeout") # Continue to Step 4 ``` ### Step 4: Attempt Tertiary Approach - Log degradation to tertiary level - Execute guaranteed completion approach - Set minimal timeout (typically 1s) - **MUST succeed - no failures allowed** - Return minimal but functional result - Log success (degraded but functional) - DONE (guaranteed completion) ```python # Pseudocode for tertiary attempt log_degradation(from_level="SECONDARY", to_level="TERTIARY") try: result = execute_tertiary_approach(timeout=TERTIARY_TIMEOUT) log_success(level="TERTIARY", result=result, heavily_degraded=True) return result # DONE - minimal but functional except Exception as e: # THIS SHOULD NEVER HAPPEN log_critical_failure("TERTIARY approach failed - this is a bug!") raise SystemError("Cascade safety violation: tertiary failed") ``` ### Step 5: Report Degradation - Determine notification level from configuration - **Silent:** Log only, no user message - **Warning:** Brief notification to user - **Explicit:** Detailed degradation explanation - Document which level succeeded - Explain impact of degradation - Log cascade path taken for analysis **Degradation Reporting Templates:** **Silent Mode:** ``` [LOG] CASCADE: PRIMARY timeout (30s) → SECONDARY success (6s) Result: standard_analysis (degraded from comprehensive) ``` **Warning Mode:** ``` ⚠️ Using cached data (less than 1 hour old) Current real-time data unavailable. ``` **Explicit Mode:** ``` ℹ️ Analysis Quality Notice We attempted to provide comprehensive code analysis using GPT-4, but encountered slow response times (>30s timeout). Fallback Applied: - Used: GPT-3.5 standard analysis (completed in 6s) - Quality: Standard (vs. Comprehensive) - Impact: Advanced semantic insights not included What You're Getting: ✓ Basic pattern detection ✓ Standard recommendations ✓ Code quality assessment What's Missing: ✗ Complex architectural insights ✗ Deep semantic analysis ✗ Advanced refactoring suggestions ``` ### Step 6: Log Cascade Metrics - Record cascade path taken - Document level reached (primary/secondary/tertiary) - Log timing for each level attempted - Track degradation frequency - Identify patterns in failures - Update cascade strategy if needed **Metrics to Track:** - Success rate by level - Average response times - Degradation frequency - User impact assessment ### Step 7: Continuous Optimization - **Use analyzer agent** to review cascade metrics - Identify optimization opportunities - Adjust timeouts based on success rates - Improve secondary approaches if frequently used - Update tertiary if inadequate - Store learnings in memory using `store_discovery()` from `amplihack.memory.discoveries` **Optimization Criteria:** - **If PRIMARY succeeds < 50%:** Timeout too aggressive → Increase timeout - **If SECONDARY used > 40%:** Secondary is really the "normal" case → Swap primary and secondary - **If TERTIARY used > 10%:** Secondary not reliable enough → Improve secondary ## Trade-Offs **Benefit:** System always completes, never fully fails **Cost:** Users may receive degraded responses **Best For:** User-facing features where responsiveness matters ## Examples ### Example 1: Weather API Integration **Configuration:** - Strategy: Balanced (30s / 10s / 5s) - Type: Service fallback - Notification: Warning **Implementation:** ```python async def get_weather(location: str) -> WeatherData: """Get weather data with cascade fallback""" # PRIMARY: Live weather API try: return await fetch_weather_api(location, timeout=30) except (TimeoutError, APIError): log.warning("PRIMARY weather API failed, trying cache") # SECONDARY: Cached weather data try: cached = await get_cached_weather(location, max_age=3600) if cached: notify_user("Using weather data from cache (< 1 hour old)") return cached except CacheError: log.warning("SECONDARY cache failed, using defaults") # TERTIARY: Default weather data return get_default_weather(location) # Never fails ``` **Outcome:** System always returns weather data, quality degrades gracefully ### Example 2: Code Review with AI **Configuration:** - Strategy: Patient (120s / 30s / 10s) - Type: Quality fallback - Notification: Explicit **Cascade Path:** 1. PRIMARY: GPT-4 comprehensive review - TIMEOUT after 120s 2. SECONDARY: GPT-3.5 standard review - SUCCESS in 18s 3. TERTIARY: Not attempted ### Example 3: Search Results Ranking **Configuration:** - Strategy: Aggressive (5s / 2s / 1s) - Type: Accuracy fallback - Notification: Silent **Implementation:** ```python def search_and_rank(query: str) -> List[Result]: """Search with ML ranking, fallback to simple ranking""" results = fetch_results(query) # PRIMARY: ML-based ranking (sophisticated) try: return ml_rank(results, timeout=5) except TimeoutError: pass # Silent fallback # SECONDARY: Heuristic ranking (good enough) try: return heuristic_rank(results, timeout=2) except TimeoutError: pass # TERTIARY: Simple text match ranking (basic) return simple_rank(results) # Always fast ``` ## Philosophy Alignment This workflow enforces: - **Resilience:** System always completes, never completely fails - **User Experience:** Better degraded service than error message - **Transparency:** Users understand what they're getting (if explicit mode) - **Progressive Enhancement:** Optimal by default, degrade when necessary - **Measurable Quality:** Clear definition of what degrades at each level - **Continuous Improvement:** Metrics drive timeout optimization - **Guaranteed Completion:** Tertiary level must never fail ## Key Principle **Better to deliver degraded service than no service**