---
name: log-analysis
description: Analyze application and system logs to identify errors, patterns, and root causes. Use log aggregation tools and structured logging for effective debugging.
---

# Log Analysis

## Overview

Logs are critical for debugging and monitoring. Effective log analysis quickly identifies issues and enables root cause analysis.

## When to Use

- Troubleshooting errors
- Performance investigation
- Security incident analysis
- Auditing user actions
- Monitoring application health

## Instructions

### 1. **Structured Logging**

```javascript
// Good: Structured logs (machine-readable)
logger.info({
  level: 'INFO',
  timestamp: '2024-01-15T10:30:00Z',
  service: 'auth-service',
  user_id: '12345',
  action: 'user_login',
  status: 'success',
  duration_ms: 150,
  ip_address: '192.168.1.1'
});

// Bad: Unstructured logs (hard to parse)
console.log('User 12345 logged in successfully in 150ms from 192.168.1.1');

// JSON Format (Elasticsearch friendly)
{
  "@timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "api-gateway",
  "trace_id": "abc123",
  "message": "Database connection failed",
  "error": {
    "type": "ConnectionError",
    "code": "ECONNREFUSED"
  },
  "context": {
    "database": "users",
    "operation": "SELECT"
  }
}
```

### 2. **Log Levels & Patterns**

```yaml
Log Levels:

DEBUG: Detailed diagnostic info
  - Variable values
  - Function entry/exit
  - Intermediate calculations
  - Use: Development only

INFO: General informational messages
  - Startup/shutdown
  - User actions
  - Configuration changes
  - Use: Production (normal operations)

WARN: Warning messages (potential issues)
  - Deprecated API usage
  - Performance degradation
  - Resource limits approaching
  - Use: Production (investigate soon)

ERROR: Error conditions
  - Failed operations
  - Exceptions
  - Failed requests
  - Use: Production (action required)

FATAL/CRITICAL: System unusable
  - Critical failures
  - Out of memory
  - Data corruption
  - Use: Production (immediate action)

---

Log Patterns:

Request Logging:
  - Request ID (trace_id)
  - Method + Path
  - Status code
  - Duration
  - Request size / response size

Error Logging:
  - Error type/code
  - Error message
  - Stack trace
  - Context (user_id, session_id)
  - Timestamp

Business Events:
  - Event type
  - User involved
  - Impact/importance
  - Timestamp
  - Relevant context
```

### 3. **Log Analysis Tools**

```yaml
Log Aggregation:

ELK Stack (Elasticsearch, Logstash, Kibana):
  - Logstash: Parse and process logs
  - Elasticsearch: Search and analyze
  - Kibana: Visualization and dashboards
  - Use: Large scale, complex queries

Splunk:
  - Comprehensive log management
  - Real-time search and analysis
  - Dashboards and alerts
  - Use: Enterprise (expensive)

CloudWatch (AWS):
  - Integrated with AWS services
  - Log Insights for querying
  - Dashboards
  - Use: AWS-based systems

Datadog:
  - Application performance monitoring
  - Log management
  - Real-time alerts
  - Use: SaaS monitoring

---

Log Analysis Techniques:

Grep/Awk:
  grep "ERROR" app.log
  awk '{print $1, $4}' app.log

Filtering:
  Filter by timestamp
  Filter by service
  Filter by error type
  Filter by user

Searching:
  Search for error patterns
  Search for user actions
  Search trace IDs
  Search IP addresses

Aggregation:
  Count occurrences
  Group by error type
  Calculate duration percentiles
  Rate of errors over time
```

### 4. **Common Log Analysis Queries**

```yaml
Find errors in past hour:
  timestamp: last_1h AND level: ERROR

Track user activity:
  user_id: 12345 AND action: *

Find slow requests:
  duration_ms: >1000 AND level: INFO

Analyze error rate by service:
  level: ERROR | stats count by service

Find failed database operations:
  error.type: "DatabaseError" | stats count

Trace request flow:
  trace_id: "abc123" | sort by timestamp

---

Checklist:

[ ] Structured logging implemented
[ ] All errors logged with context
[ ] Request IDs/trace IDs used
[ ] Sensitive data not logged (passwords, tokens)
[ ] Log levels used appropriately
[ ] Log retention policy set
[ ] Log sampling for high-volume events
[ ] Alerts configured for errors
[ ] Dashboards created
[ ] Regular log review scheduled
[ ] Log analysis tools accessible
[ ] Team trained on querying logs
```

## Key Points

- Use structured JSON logging
- Include trace IDs for request tracking
- Log appropriate levels (DEBUG/INFO/ERROR)
- Never log sensitive data (passwords, tokens)
- Aggregate logs centrally
- Create dashboards for key metrics
- Alert on error rates and critical issues
- Retain logs appropriately
- Search logs by trace ID for troubleshooting
- Review logs regularly for patterns