--- name: intermittent-issue-debugging description: Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior. --- # Intermittent Issue Debugging ## Overview Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential. ## When to Use - Sporadic errors in logs - Users report occasional issues - Flaky tests - Race conditions suspected - Timing-dependent bugs - Resource exhaustion issues ## Instructions ### 1. **Capturing Intermittent Issues** ```javascript // Strategy 1: Comprehensive Logging // Add detailed logging around suspected code function processPayment(orderId) { const startTime = Date.now(); console.log(`[${startTime}] Payment start: order=${orderId}`); try { const result = chargeCard(orderId); console.log(`[${Date.now()}] Payment success: ${orderId}`); return result; } catch (error) { const duration = Date.now() - startTime; console.error(`[${Date.now()}] Payment FAILED:`, { order: orderId, error: error.message, duration_ms: duration, error_type: error.constructor.name, stack: error.stack }); throw error; } } // Strategy 2: Correlation IDs // Track requests across systems const correlationId = generateId(); logger.info({ correlationId, action: 'payment_start', orderId: 123 }); chargeCard(orderId, {headers: {correlationId}}); logger.info({ correlationId, action: 'payment_end', status: 'success' }); // Later, can grep logs by correlationId to see full trace // Strategy 3: Error Sampling // Capture full error context when occurs window.addEventListener('error', (event) => { const errorData = { message: event.message, url: event.filename, line: event.lineno, col: event.colno, stack: event.error?.stack, userAgent: navigator.userAgent, memory: performance.memory?.usedJSHeapSize, timestamp: new Date().toISOString() }; sendToMonitoring(errorData); // Send to error tracking }); ``` ### 2. **Common Intermittent Issues** ```yaml Issue: Race Condition Symptom: Inconsistent behavior depending on timing Example: Thread 1: Read count (5) Thread 2: Read count (5), increment to 6, write Thread 1: Increment to 6, write (overrides Thread 2) Result: Should be 7, but is 6 Debug: 1. Add detailed timestamps 2. Log all operations 3. Look for overlapping operations 4. Check if order matters Solution: - Use locks/mutexes - Use atomic operations - Use message queues - Ensure single writer --- Issue: Timing-Dependent Bug Symptom: Test passes sometimes, fails others Example: test_user_creation: 1. Create user (sometimes slow) 2. Check user exists 3. Fails if create took too long Debug: - Add timeout logging - Increase wait time - Add explicit waits - Mock slow operations Solution: - Explicit wait for condition - Remove time-dependent assertions - Use proper test fixtures --- Issue: Resource Exhaustion Symptom: Works fine, but after time fails Example: - Memory grows over time - Connections pool exhausted - Disk space fills up - Max open files reached Debug: - Monitor resources continuously - Check for leaks (memory growth) - Monitor connection count - Check long-running processes Solution: - Fix memory leak - Increase resource limits - Implement cleanup - Add monitoring/alerts --- Issue: Intermittent Network Failure Symptom: API calls occasionally fail Debug: - Check network logs - Identify timeout patterns - Check if time-of-day dependent - Check if load dependent Solution: - Implement exponential backoff retry - Add circuit breaker - Increase timeout - Add redundancy ``` ### 3. **Systematic Investigation Process** ```yaml Step 1: Understand the Pattern Questions: - How often does it occur? (1/100, 1/1000?) - When does it occur? (time of day, load, specific user?) - What are the conditions? (network, memory, load?) - Is it reproducible? (deterministic or random?) - Any recent changes? Analysis: - Review error logs - Check error rate trends - Identify patterns - Correlate with changes Step 2: Reproduce Reliably Methods: - Increase test frequency (run 1000 times) - Stress test (heavy load) - Simulate poor conditions (network, memory) - Run on different machines - Run in production-like environment Goal: Make issue consistent to analyze Step 3: Add Instrumentation - Add detailed logging - Add monitoring metrics - Add trace IDs - Capture errors fully - Log system state Step 4: Capture the Issue - Recreate scenario - Capture full context - Note system state - Document conditions - Get reproduction case Step 5: Analyze Data - Review logs - Look for patterns - Compare normal vs error cases - Check timing correlations - Identify root cause Step 6: Implement Fix - Based on root cause - Verify with reproduction case - Test extensively - Add regression test ``` ### 4. **Monitoring & Prevention** ```yaml Monitoring Strategy: Real User Monitoring (RUM): - Error rates by feature - Latency percentiles - User impact - Trend analysis Application Performance Monitoring (APM): - Request traces - Database query performance - External service calls - Resource usage Synthetic Monitoring: - Regular test execution - Simulate user flows - Alert on failures - Trend tracking --- Alerting: Setup alerts for: - Error rate spike - Response time >threshold - Memory growth trend - Failed transactions --- Prevention Checklist: [ ] Comprehensive logging in place [ ] Error tracking configured [ ] Performance monitoring active [ ] Resource monitoring enabled [ ] Correlation IDs used [ ] Failed requests captured [ ] Timeout values appropriate [ ] Retry logic implemented [ ] Circuit breakers in place [ ] Load testing performed [ ] Stress testing performed [ ] Race conditions reviewed [ ] Timing dependencies checked --- Tools: Monitoring: - New Relic / DataDog - Prometheus / Grafana - Sentry / Rollbar - Custom logging Testing: - Load testing (k6, JMeter) - Chaos engineering (gremlin) - Property-based testing (hypothesis) - Fuzz testing Debugging: - Distributed tracing (Jaeger) - Correlation IDs - Detailed logging - Debuggers ``` ## Key Points - Comprehensive logging is essential - Add correlation IDs for tracing - Monitor for patterns and trends - Stress test to reproduce - Use detailed error context - Implement exponential backoff for retries - Monitor resource exhaustion - Add circuit breakers for external services - Log system state with errors - Implement proper monitoring/alerting