--- name: error-investigation description: AWS error investigation with multi-layer verification, CloudWatch analysis, and Lambda logging patterns. Use when debugging AWS service failures, investigating production errors, or troubleshooting Lambda functions. --- # Error Investigation Skill **Tech Stack**: AWS CLI, CloudWatch Logs, Lambda, boto3, jq **Source**: Extracted from CLAUDE.md error investigation principles and AWS diagnostic patterns. --- ## When to Use This Skill Use the error-investigation skill when: - ✓ AWS service returning errors - ✓ Lambda function failing in production - ✓ CloudWatch logs showing errors - ✓ Service completed but operation failed - ✓ Silent failures (no exception but wrong result) - ✓ Investigating production incidents **DO NOT use this skill for:** - ✗ Local Python debugging (use debugger instead) - ✗ Code refactoring (use refactor skill) - ✗ Performance optimization (use different skill) --- ## Quick Investigation Decision Tree ``` What's failing? ├─ Lambda function? │ ├─ Returns 200 but errors? → Check CloudWatch logs (Layer 3) │ ├─ Timeout? → Check duration metrics + external dependencies │ ├─ Permission denied? → Check IAM role policies │ └─ Cold start slow? → Module-level initialization pattern │ ├─ AWS service operation? │ ├─ DynamoDB write succeeded (200) but no data? → Check rowcount │ ├─ S3 upload succeeded but file missing? → Check bucket policy │ ├─ SQS message sent but not received? → Check DLQ │ └─ Step Function succeeded but workflow incomplete? → Check state outputs │ ├─ External API call? │ ├─ Timeout? → Check network path (security groups, VPC) │ ├─ 403 Forbidden? → Check API key, rate limits │ ├─ 500 Error? → Check API status page, retry logic │ └─ Silent failure? → Inspect response payload │ └─ Database query? ├─ INSERT affected 0 rows? → FK constraint, ENUM mismatch ├─ SELECT returns empty? → Check WHERE clause, data exists ├─ Connection timeout? → Security group, VPC routing └─ Query slow? → Missing index, full table scan ``` --- ## Loop Pattern: Retrying Loop → Synchronize Loop **Escalation Trigger**: - `/trace` shows root cause - Fix applied, `/validate` shows success - But error recurs later (knowledge drift) **Tools Used**: - `/trace` - Find root cause (backward trace from error) - `/validate` - Verify fix works (test the solution) - `/consolidate` - Update knowledge base (documentation, runbooks) - `/observe` - Monitor for recurring issues (drift detection) - `/reflect` - Assess if error represents pattern vs one-off **Why This Works**: Error investigation fits retrying loop (find root cause, fix execution), but recurring errors trigger synchronize loop (update knowledge/documentation). See [Thinking Process Architecture - Feedback Loops](../../.claude/diagrams/thinking-process-architecture.md#11-feedback-loop-types-self-healing-properties) for structural overview. --- ## Core Investigation Principles ### Principle 1: Execution Completion ≠ Operational Success **From CLAUDE.md:** > "Execution completion ≠ Operational success. Verify actual outcomes across multiple layers, not just the absence of exceptions." **Why This Matters:** ```python # ❌ WRONG: Assumes 200 = success response = lambda_client.invoke(FunctionName='worker', Payload='{}') assert response['StatusCode'] == 200 # ✗ Weak validation # ✅ RIGHT: Multi-layer verification response = lambda_client.invoke(FunctionName='worker', Payload='{}') # Layer 1: Status code assert response['StatusCode'] == 200 # Layer 2: Response payload payload = json.loads(response['Payload'].read()) assert 'errorMessage' not in payload # Layer 3: CloudWatch logs logs = cloudwatch.filter_log_events( logGroupName='/aws/lambda/worker', filterPattern='ERROR' ) assert len(logs['events']) == 0 ``` > **Note**: This is the AWS-specific application of **Progressive Evidence Strengthening** (CLAUDE.md Principle #2). The general pattern applies across all domains—here we show how it manifests in AWS Lambda/API debugging. ### Principle 2: Multi-Layer Verification (AWS Application) **The Three Layers:** | Layer | Signal Strength | What It Tells You | What It DOESN'T Tell You | |-------|----------------|-------------------|--------------------------| | **Status Code** | Weakest | Service responded | Whether it succeeded | | **Response Payload** | Stronger | Function returned data | Whether logs show errors | | **CloudWatch Logs** | Strongest | What actually happened | Future issues | **Pattern:** ```bash # Layer 1: Status code (weakest) aws lambda invoke --function-name worker --payload '{}' /tmp/response.json echo "Exit code: $?" # 0 = AWS CLI succeeded # Layer 2: Response payload (stronger) if grep -q "errorMessage" /tmp/response.json; then echo "❌ Lambda returned error" exit 1 fi # Layer 3: CloudWatch logs (strongest) ERROR_COUNT=$(aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --start-time $(($(date +%s) - 120))000 \ --filter-pattern "ERROR" \ --query 'length(events)' --output text) if [ "$ERROR_COUNT" -gt 0 ]; then echo "❌ Found errors in CloudWatch logs" exit 1 fi echo "✅ All 3 layers verified" ``` See [AWS-DIAGNOSTICS.md](AWS-DIAGNOSTICS.md) for AWS-specific diagnostic patterns. ### Principle 3: Log Level Determines Discoverability **From CLAUDE.md:** > "Log levels are not just severity indicators—they determine whether failures are discoverable by monitoring systems." **Log Level Impact:** | Log Level | Monitored? | Alerted? | Discoverable? | |-----------|------------|----------|---------------| | **ERROR** | ✅ Yes | ✅ Yes | ✅ Dashboards | | **WARNING** | ✅ Yes | ❌ No | ⚠️ Manual review | | **INFO** | ⚠️ Maybe | ❌ No | ❌ Active search | | **DEBUG** | ❌ No | ❌ No | ❌ Hidden | **Investigation Pattern:** ```bash # Step 1: Check ERROR level first aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --filter-pattern "ERROR" # Step 2: If no ERRORs but operation failed → Check WARNING aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --filter-pattern "WARNING" # Step 3: Check both application AND service logs # - Application logs: /aws/lambda/worker # - Service logs: Lambda execution errors, timeouts ``` **Why This Matters:** ```python # ❌ BAD: Error logged at WARNING (invisible to monitoring) try: result = db.execute(query, params) if result == 0: logger.warning("INSERT failed") # ⚠️ Not monitored! except Exception as e: logger.warning(f"DB error: {e}") # ⚠️ Not alerted! # ✅ GOOD: Error logged at ERROR (visible to monitoring) try: result = db.execute(query, params) if result == 0: logger.error("INSERT failed - 0 rows affected") # ✅ Monitored raise ValueError("Insert operation failed") except Exception as e: logger.error(f"DB error: {e}") # ✅ Alerted raise ``` ### Principle 4: Lambda Logging Configuration **From CLAUDE.md:** > "AWS Lambda pre-configures logging before your code runs. Never use `logging.basicConfig()` in Lambda handlers—it's a no-op." **The Problem:** ```python # ❌ This does NOTHING in Lambda import logging logging.basicConfig(level=logging.INFO) # No-op! logger = logging.getLogger(__name__) logger.info("Invisible in CloudWatch") # Filtered out ``` **Why It Fails:** - Lambda runtime adds handlers to root logger BEFORE your code runs - `basicConfig()` only works if root logger has NO handlers - Result: INFO-level logs are invisible **The Solution:** ```python # ✅ Works in both Lambda and local dev import logging root_logger = logging.getLogger() if root_logger.handlers: # Lambda (already configured) root_logger.setLevel(logging.INFO) else: # Local dev (needs configuration) logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) logger.info("Visible in CloudWatch") # ✅ Works ``` See [LAMBDA-LOGGING.md](LAMBDA-LOGGING.md) for comprehensive Lambda logging patterns. --- ## Common Investigation Scenarios ### Scenario 1: Lambda Returns 200 But Has Errors **Symptom:** Function completes, returns 200, but errors in logs. **Investigation Steps:** ```bash # 1. Invoke function aws lambda invoke \ --function-name worker \ --payload '{"ticker": "NVDA19"}' \ /tmp/response.json # 2. Check response (Layer 2) cat /tmp/response.json # Output: {"result": {...}} # Looks fine # 3. Check CloudWatch logs (Layer 3) aws logs tail /aws/lambda/worker --since 1m --filter-pattern "ERROR" # Output: # [ERROR] 2024-01-15 10:23:45 INSERT affected 0 rows for NVDA19 # [ERROR] 2024-01-15 10:23:46 FK constraint violation: symbol not found ``` **Root Cause:** Silent database failure (0 rowcount), logged at ERROR but caught exception. **Fix:** ```python # Before: def store_report(symbol, report): try: self.db.execute(query, params) return True # ❌ Always returns True except Exception as e: logger.error(f"DB error: {e}") return True # ❌ Still returns True! # After: def store_report(symbol, report): rowcount = self.db.execute(query, params) if rowcount == 0: logger.error(f"INSERT affected 0 rows for {symbol}") return False # ✅ Returns False on failure return True ``` ### Scenario 2: INFO Logs Not Showing in CloudWatch **Symptom:** `logger.info()` calls not appearing in CloudWatch. **Investigation Steps:** ```bash # 1. Check current log level aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --start-time $(($(date +%s) - 300))000 \ --filter-pattern "INFO" # No results (but INFO logs exist in code) # 2. Check root logger configuration # Add to Lambda handler: import logging print(f"Root logger level: {logging.getLogger().level}") print(f"Root logger handlers: {logging.getLogger().handlers}") ``` **Root Cause:** Root logger set to WARNING, filters out INFO. **Fix:** ```python # handler.py (entry point) import logging # Configure logging at module level root_logger = logging.getLogger() if root_logger.handlers: # Lambda environment root_logger.setLevel(logging.INFO) # ✅ Set root logger level else: # Local development logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def lambda_handler(event, context): logger.info("Handler invoked") # Now visible # ... ``` See [LAMBDA-LOGGING.md#troubleshooting](LAMBDA-LOGGING.md#troubleshooting) for complete debugging guide. ### Scenario 3: Lambda Timeout with Network Operations **Symptom:** Lambda times out after long execution (600s+), logs show "PDF generation..." but no completion message. **Investigation Steps:** ```bash # 1. Check execution duration pattern aws logs filter-log-events \ --log-group-name /aws/lambda/pdf-worker \ --filter-pattern "Duration:" \ --query 'events[*].message' \ | grep -o "Duration: [0-9]*" \ | sort -n # Look for pattern: # - First 5 requests: Duration: 2-3s # - Last 5 requests: Duration: 600s+ (timeout) # 2. Check for connection timeout errors aws logs filter-log-events \ --log-group-name /aws/lambda/pdf-worker \ --filter-pattern "ConnectTimeoutError" \ --query 'events[*].message' # Output: # botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: # "https://bucket.s3.region.amazonaws.com/..." # 3. Analyze timeline (deterministic vs random) aws logs tail /aws/lambda/pdf-worker --since 30m | \ grep -E "START RequestId|✅ PDF job completed|ConnectTimeoutError" | \ awk '{print $1, $2, $NF}' | sort # Deterministic pattern (first N succeed, last M fail) = infrastructure bottleneck # Random pattern (scattered failures) = performance issue ``` **Root Cause Analysis:** ```bash # 4. Check VPC configuration aws ec2 describe-vpc-endpoints \ --filters "Name=vpc-id,Values=vpc-xxx" \ "Name=service-name,Values=com.amazonaws.region.s3" # If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway) # 5. Verify NAT Gateway routing aws ec2 describe-route-tables \ --filters "Name=vpc-id,Values=vpc-xxx" \ --query 'RouteTables[*].Routes[?GatewayId!=`local`]' # If route 0.0.0.0/0 → nat-xxx → NAT Gateway saturated with concurrent connections ``` **Root Cause:** NAT Gateway connection saturation. When N concurrent Lambdas upload to S3: - NAT Gateway has limited connection establishment rate - First N connections succeed (2-3s upload time) - Remaining connections queue and timeout (600s = boto3 default timeout + retries) - Pattern is deterministic (always first N succeed, last M fail) **Fix:** ```hcl # terraform/s3_vpc_endpoint.tf resource "aws_vpc_endpoint" "s3" { vpc_id = data.aws_vpc.default.id service_name = "com.amazonaws.${var.aws_region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = data.aws_route_tables.vpc_route_tables.ids policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = "*" Action = "s3:*" Resource = "*" }] }) } ``` **Why This Works:** - S3 Gateway Endpoint adds routes to VPC route tables - S3 traffic bypasses NAT Gateway (direct AWS network path) - No connection establishment limits - FREE (Gateway endpoints have no hourly charge) - 200x faster (2-3s vs 600s timeout) **Verification:** ```bash # 1. Deploy VPC endpoint cd terraform && terraform apply # 2. Verify endpoint created terraform output s3_vpc_endpoint_state # Should be "available" # 3. Test full workflow aws stepfunctions start-execution \ --state-machine-arn \ --input '{"report_date":"2026-01-05"}' # 4. Monitor for 100% success rate aws logs tail /aws/lambda/pdf-worker --follow # Expected: All PDFs complete in 2-3s, no timeouts ``` **Critical Insight:** **Execution Time ≠ Hang Location** - 600s execution time doesn't mean code hangs for 600s - It means ENTIRE execution (including network timeout) took 600s - Check stack traces (Layer 3) to find WHERE timeout occurs - Don't assume "logs stop at line X" = "code hangs at line X" (logs lost when Lambda fails) **Pattern Recognition:** - **Deterministic failure** (first N succeed, last M fail) → Infrastructure bottleneck (NAT, VPC endpoint) - **Random failure** (scattered across all attempts) → Performance issue (slow API, memory pressure) - **All fail** → Configuration issue (missing permissions, wrong endpoint) See [Bug Hunt Report](../../bug-hunts/2026-01-05-pdf-s3-upload-timeout.md) for complete investigation. ### Scenario 4: DynamoDB PutItem Succeeds But No Data **Symptom:** `put_item()` returns 200, but item not in table. **Investigation Steps:** ```python # 1. Check response response = table.put_item(Item={'ticker': 'NVDA19', 'data': {...}}) print(f"HTTP Status: {response['ResponseMetadata']['HTTPStatusCode']}") # Output: 200 # 2. Verify item exists response = table.get_item(Key={'ticker': 'NVDA19'}) print(response.get('Item')) # Output: None (no item!) # 3. Check for conditional write response = table.put_item( Item={'ticker': 'NVDA19', 'data': {...}}, ConditionExpression='attribute_not_exists(ticker)' # ← Condition failed? ) ``` **Root Cause:** Conditional expression failed silently. **Fix:** ```python # Before: response = table.put_item(Item=item) # ❌ No verification # After: try: response = table.put_item(Item=item) # Verify write verify = table.get_item(Key={'ticker': item['ticker']}) if 'Item' not in verify: logger.error(f"Item not found after put_item: {item['ticker']}") raise ValueError("DynamoDB write verification failed") except botocore.exceptions.ClientError as e: if e.response['Error']['Code'] == 'ConditionalCheckFailedException': logger.warning(f"Conditional write failed: {item['ticker']}") else: logger.error(f"DynamoDB error: {e}") raise ``` --- ## AWS Boundary Verification **When to apply**: Distributed system errors (Lambda, Aurora, S3, SQS, Step Functions) **Problem**: Code looks correct locally but fails in AWS due to unverified execution boundaries **Common boundary-related error patterns**: ### Pattern 1: Missing Environment Variable ```bash # Error: KeyError: 'AURORA_HOST' # Symptom: Lambda invocation fails immediately # Root cause: Boundary violation (code → runtime) # Code expects: os.environ['AURORA_HOST'] # Runtime provides: No such variable # Verification: aws lambda get-function-configuration \ --function-name [PROJECT_NAME]-worker-dev \ --query 'Environment.Variables' # Compare with: Code's os.environ accesses grep "os.environ" src/lambda_handler.py ``` ### Pattern 2: Aurora Schema Mismatch ```bash # Error: Unknown column 'pdf_s3_key' in 'field list' # Symptom: INSERT query fails in production # Root cause: Boundary violation (code → database) # Code sends: INSERT INTO reports (symbol, pdf_s3_key) # Aurora has: No pdf_s3_key column # Verification: mysql> SHOW COLUMNS FROM precomputed_reports; # Compare with: Code's INSERT statements grep "INSERT INTO" src/data/aurora/precompute_service.py ``` ### Pattern 3: Lambda Timeout ```bash # Error: Task timed out after 30.00 seconds # Symptom: Lambda stops mid-execution # Root cause: Configuration mismatch (code requirements vs entity config) # Code requires: 60s API call + 45s processing = 105s total # Lambda configured: 30s timeout # Verification: aws lambda get-function-configuration \ --function-name [PROJECT_NAME]-worker-dev \ --query '{Timeout:Timeout, Memory:MemorySize}' # Analyze code execution time: grep "requests.get.*timeout" src/ -r # External API timeouts # Sum: timeout values + processing overhead ``` ### Pattern 4: Permission Denied ```bash # Error: AccessDeniedException: User is not authorized to perform: s3:PutObject # Symptom: S3 upload fails # Root cause: Permission boundary violation (principal → resource) # Code tries: s3.put_object(Bucket='reports', Key='file.pdf') # IAM role allows: Only s3:GetObject (read-only) # Verification: aws iam get-role-policy \ --role-name [PROJECT_NAME]-worker-role-dev \ --policy-name S3Access # Compare with: Code's boto3 operations grep "s3.*put_object\|s3.*upload" src/ -r ``` ### Pattern 5: Intention Violation ```bash # Error: API Gateway timeout after 30 seconds # Symptom: Client sees timeout, Lambda still processing # Root cause: Usage doesn't match intention (sync Lambda used for async work) # Entity designed for: Synchronous API (< 30s response) # Code uses it for: Long-running report generation (60s) # Verification: # Check Terraform comments cat terraform/lambdas.tf | grep -B 5 -A 10 "api-handler" # Check Lambda invocation type aws lambda get-function-configuration \ --function-name api-handler \ --query 'Timeout' # Compare: API Gateway 30s limit vs Lambda timeout ``` **Boundary verification workflow for AWS errors**: ``` 1. Identify error type → Map to boundary category - Missing env var → Process boundary (code → runtime) - Schema mismatch → Data boundary (code → database) - Timeout → Configuration boundary (requirements → entity config) - Permission denied → Permission boundary (principal → resource) - API Gateway timeout → Intention boundary (usage → design) 2. Identify physical entities involved - WHICH Lambda (name, ARN) - WHICH Aurora cluster (endpoint, database) - WHICH S3 bucket (name, region) - WHICH IAM role (name, policies) 3. Verify contract at boundary - Code expectations → Infrastructure reality - Use aws cli to inspect actual configuration - Compare code requirements vs entity properties 4. Apply Progressive Evidence Strengthening - Layer 1 (Surface): Error message - Layer 2 (Content): CloudWatch logs - Layer 3 (Observability): AWS resource configuration - Layer 4 (Ground Truth): Test actual execution ``` **Integration with investigation workflow**: - **Step 1 (Identify Error Layer)**: Check if error is boundary-related - **Step 2 (Collect Context)**: Identify which boundary violated - **Step 3 (Check Changes)**: Did code or infrastructure change? - **Step 4 (Fix)**: Repair boundary contract (update code or infrastructure) **See**: [Execution Boundary Checklist](../../checklists/execution-boundaries.md) for systematic AWS boundary verification **Related**: - Principle #20 (Execution Boundary Discipline) - CLAUDE.md - Principle #2 (Progressive Evidence Strengthening) - Multi-layer verification - Principle #15 (Infrastructure-Application Contract) - Sync code and infra --- ## Investigation Workflow ### Step 1: Identify Error Layer (5 minutes) ```bash # Check all three layers aws lambda invoke --function-name worker --payload '{}' /tmp/response.json # Layer 1: Exit code echo "Exit code: $?" # Layer 2: Response payload cat /tmp/response.json | jq . # Layer 3: CloudWatch logs aws logs tail /aws/lambda/worker --since 5m --filter-pattern "ERROR" ``` **Questions:** - Which layer shows the error? - If Layer 1 OK but Layer 3 ERROR → Silent failure - If all layers OK but wrong result → Logic error ### Step 2: Collect Error Context (10 minutes) ```bash # Get full error details aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --start-time $(($(date +%s) - 3600))000 \ --filter-pattern "ERROR" \ --query 'events[*].[timestamp,message]' \ --output table # Get surrounding context (±5 lines) aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --filter-pattern "ERROR" \ | jq -r '.events[0].message' \ | grep -C 5 "ERROR" ``` ### Step 3: Check Recent Changes (5 minutes) ```bash # When did errors start? aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --filter-pattern "ERROR" \ --query 'events[0].timestamp' \ --output text # What deployed around that time? gh run list --limit 10 # What changed in code? git log --since="2 hours ago" --oneline ``` ### Step 4: Reproduce and Fix (variable) See [AWS-DIAGNOSTICS.md](AWS-DIAGNOSTICS.md) for service-specific diagnostic patterns. --- ## Quick Reference ### Investigation Priority 1. **Check CloudWatch logs** (Layer 3 - strongest signal) 2. **Check response payload** (Layer 2 - structured errors) 3. **Check status code** (Layer 1 - weakest signal) 4. **Verify actual outcome** (database state, S3 files, etc.) ### Common Failure Modes | Symptom | Likely Cause | Investigation | |---------|--------------|---------------| | **200 OK but errors in logs** | Silent failure | Check rowcount, verify writes | | **INFO logs not showing** | Root logger level = WARNING | Set root logger to INFO | | **Timeout** | Cold start, external API slow | Check duration metrics | | **Permission denied** | IAM policy missing | Simulate permissions | | **0 rows affected** | FK constraint, ENUM mismatch | Check constraints | --- ## File Organization ``` .claude/skills/error-investigation/ ├── SKILL.md # This file (entry point) ├── AWS-DIAGNOSTICS.md # AWS-specific diagnostic patterns └── LAMBDA-LOGGING.md # Lambda logging configuration guide ``` --- ## Next Steps - **For AWS diagnostics**: See [AWS-DIAGNOSTICS.md](AWS-DIAGNOSTICS.md) - **For Lambda logging**: See [LAMBDA-LOGGING.md](LAMBDA-LOGGING.md) - **For general debugging**: See research skill --- ## References - [AWS Lambda Troubleshooting](https://docs.aws.amazon.com/lambda/latest/dg/lambda-troubleshooting.html) - [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) - [Python Logging HOWTO](https://docs.python.org/3/howto/logging.html) - [AWS SDK Error Handling](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/error-handling.html)