---
name: datadog-operations
description: "Comprehensive Datadog operations: query APM/logs/metrics/RUM/database, create monitors/dashboards/synthetics, manage incidents, trigger workflows, analyze costs and LLM usage. 73% platform coverage including service catalog, uptime monitoring, and frontend performance. Use for debugging, automation, incident response, and cost optimization."
author: ryan-maclean
version: 3.0.0
---

# Datadog Operations

Complete Datadog automation: query APIs, create infrastructure, manage incidents, and automate responses. 73% platform coverage with 17 working scripts.

## What This Skill Does

**Investigation & Analysis:**
- Query APM traces to identify performance bottlenecks
- Search logs for error patterns and anomalies
- Detect security threats and attack attempts
- Analyze Watchdog anomaly detection alerts
- Query metrics with statistical analysis
- Analyze Datadog usage and costs (FinOps)
- Monitor LLM observability for GenAI applications
- Query SLO status and error budgets
- List services from service catalog
- Analyze database query performance
- Track frontend performance with RUM

**Automation & Creation:**
- Create and manage monitors with alert thresholds
- Generate dashboards for APM, security, costs, and LLM observability
- Trigger Datadog workflows for incident response
- Create and update incidents
- Mute/unmute monitors during maintenance
- Create synthetic uptime checks and browser tests

## Prerequisites

Set environment variables:

```bash
export DD_API_KEY=your_api_key
export DD_APP_KEY=your_application_key
export DD_SITE=datadoghq.com  # or datadoghq.eu, us3.datadoghq.com, etc.
```

Get keys from Datadog: Organization Settings > API Keys and Application Keys

## Working Scripts

### 1. Query APM Performance

Find slow endpoints and performance issues:

```bash
bash scripts/query-apm.sh --service my-service --duration 1h --limit 20
```

Returns:
- Endpoints sorted by P95 latency
- Request counts per endpoint
- P50, P95, P99 latency
- Problem endpoints (P95 > 500ms)

### 2. Query Security Signals

Find security threats and attack attempts:

```bash
bash scripts/query-security-signals.sh --service my-service --duration 24h
```

Returns:
- Security signals by severity (critical, high, medium, low)
- Attack types (SQL injection, XSS, auth failures)
- Affected services and hosts
- Recent security events with details

### 3. Query Watchdog Anomalies

Automated anomaly detection from Datadog Watchdog:

```bash
bash scripts/query-watchdog.sh --service my-service --type latency --duration 7d
```

Returns:
- Anomalies by type (latency, error_rate, traffic)
- Affected services and resources
- Start timestamps and severity
- Baseline vs observed values

### 4. Search Logs

Search logs for error patterns:

```bash
bash scripts/search-logs.sh --query "status:error service:my-service" --duration 1h
```

Returns:
- Error messages grouped by frequency
- Associated trace IDs for investigation
- Service and host breakdowns
- Common error patterns

### 5. Query Metrics

Fetch metric data with statistical analysis:

```bash
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 24h
```

Returns:
- Time series data
- Statistics (min, max, avg, p50, p95, p99)
- Trend analysis (increasing, decreasing, stable)
- Anomaly detection (values > 2 std dev)

### 6. Analyze Usage and Costs

FinOps cost analysis and optimization:

```bash
bash scripts/analyze-usage-cost.sh --duration 30d --product all
```

Returns:
- APM span ingestion (indexed vs ingested)
- Log volume breakdown
- Infrastructure hosts and container hours
- Custom metrics count
- Estimated monthly costs by product
- Cost optimization recommendations

### 7. Analyze LLM Performance

For GenAI applications, analyze LLM observability data:

```bash
bash scripts/analyze-llm.sh --service my-llm-app --duration 24h
```

Returns:
- Token usage statistics (prompt + completion)
- Cost estimates based on model pricing
- Model latency (P50, P95, P99)
- Error rates by model
- Most expensive operations
- Token usage trends

### 8. Manage Monitors

Create, list, mute, and manage Datadog monitors:

```bash
# List all monitors
bash scripts/manage-monitors.sh list

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "High Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:my-service}.as_count() > 10" \
  --message "Error rate is high @slack-alerts"

# Mute monitor for 2 hours
bash scripts/manage-monitors.sh mute --id 12345 --duration 2

# Unmute monitor
bash scripts/manage-monitors.sh unmute --id 12345
```

Returns:
- Monitor list with states (alert, warn, OK)
- Created monitor ID and details
- Mute/unmute confirmations

### 9. Create Dashboards

Generate dashboards from templates:

```bash
# Create APM performance dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security monitoring dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Security Dashboard" --type security

# Create cost analysis dashboard
bash scripts/create-dashboard.sh --title "Infrastructure Costs" --type cost

# Create LLM observability dashboard
bash scripts/create-dashboard.sh --service my-genai-app --title "LLM Performance" --type llm
```

Dashboard types:
- apm: Latency, errors, throughput by endpoint
- logs: Log volume and error analysis
- security: Security threats and attack patterns
- cost: APM, logs, infrastructure costs
- llm: Token usage, costs, model performance

### 10. Query SLOs

Check Service Level Objectives and error budgets:

```bash
# List all SLOs
bash scripts/query-slos.sh

# List SLOs for service
bash scripts/query-slos.sh --service payment-api

# List SLOs with tag
bash scripts/query-slos.sh --tag team:backend
```

Returns:
- SLO status (breaching, warning, OK)
- Current value vs target threshold
- Error budget remaining
- Error budget status (exhausted, low, healthy)

### 11. Trigger Workflows

Execute Datadog workflow automation:

```bash
# List available workflows
bash scripts/trigger-workflow.sh list

# Trigger workflow
bash scripts/trigger-workflow.sh run --id abc123

# Trigger with input data
bash scripts/trigger-workflow.sh run --id abc123 --input '{"service": "payment-api", "severity": "high"}'
```

Returns:
- Workflow list with IDs and descriptions
- Workflow instance ID when triggered
- Execution status

### 12. Manage Incidents

Create and manage incident response:

```bash
# List active incidents
bash scripts/manage-incidents.sh list --status active

# Create critical incident
bash scripts/manage-incidents.sh create \
  --title "Payment API Down" \
  --service payment-api \
  --severity SEV-1

# Update incident status
bash scripts/manage-incidents.sh update --id abc123 --status resolved

# Get incident details
bash scripts/manage-incidents.sh get --id abc123
```

Returns:
- Incident list with status and severity
- Created incident ID and details
- Incident timeline and updates

### 13. Query Service Catalog

List services and ownership metadata:

```bash
# List all services
bash scripts/query-service-catalog.sh list

# List services for team
bash scripts/query-service-catalog.sh list --team backend

# Get service details
bash scripts/query-service-catalog.sh get --service payment-api
```

Returns:
- Service metadata (kind, tier, lifecycle)
- Team ownership and contacts
- Repository links
- Integration details

### 14. Manage Synthetic Tests

Create uptime checks and API tests:

```bash
# List all synthetic tests
bash scripts/manage-synthetics.sh list

# Create API uptime check
bash scripts/manage-synthetics.sh create-api \
  --name "Payment API Uptime" \
  --url "https://api.example.com/health" \
  --method GET

# Create browser test
bash scripts/manage-synthetics.sh create-browser \
  --name "Login Flow" \
  --url "https://app.example.com/login"

# Get test results
bash scripts/manage-synthetics.sh get --id abc-123-def
```

Returns:
- Test list with status (active, paused)
- Created test ID and configuration
- Test results and uptime status

### 15. Query Database Performance

Analyze database queries and performance:

```bash
# Query database performance
bash scripts/query-database.sh --host postgres-prod --duration 1h

# Get slow queries
bash scripts/query-database.sh --host mysql-01 --duration 24h
```

Returns:
- Slow query patterns
- P95/avg query duration
- Connection metrics
- Top queries by latency

### 16. Query RUM (Real User Monitoring)

Analyze frontend performance and user experience:

```bash
# Query RUM data for application
bash scripts/query-rum.sh --application abc-123-def --duration 1h

# Get page load performance
bash scripts/query-rum.sh --application abc-123-def --duration 24h
```

Returns:
- Page load times (avg, P95)
- Frontend errors
- Top pages by traffic
- Error rate and types

### 17. Verify Setup

Validate Datadog configuration:

```bash
bash scripts/verify-setup.sh
```

Returns:
- Environment variable validation
- Agent connectivity check
- Tracer installation detection

## Incident Investigation Workflow

When investigating production issues:

**1. Identify scope**
```bash
# Check for security threats
bash scripts/query-security-signals.sh --severity critical --duration 1h

# Check for anomalies
bash scripts/query-watchdog.sh --service affected-service --duration 24h
```

**2. Find performance issues**
```bash
# Find slow endpoints
bash scripts/query-apm.sh --service affected-service --duration 1h

# Check error patterns
bash scripts/search-logs.sh --service affected-service --status error --duration 1h
```

**3. Analyze metrics**
```bash
# Check latency trends
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service affected-service --duration 24h

# Check error rate trends
bash scripts/query-metrics.sh --metric "trace.express.request.errors" --service affected-service --duration 24h
```

**4. Get specific traces**
```bash
# Get error traces
bash scripts/query-apm.sh --service affected-service --status error --limit 10

# Search logs for trace context
bash scripts/search-logs.sh --query "trace_id:abc123def456"
```

## Security Analysis Workflow

Monitor and investigate security threats:

```bash
# Check critical security signals
bash scripts/query-security-signals.sh --severity critical --duration 7d

# Analyze specific service
bash scripts/query-security-signals.sh --service payment-api --duration 24h

# Search for attack patterns in logs
bash scripts/search-logs.sh --query "sql injection OR xss OR authentication failed" --duration 24h
```

## Cost Optimization Workflow

Analyze and optimize Datadog costs:

```bash
# Get full cost breakdown
bash scripts/analyze-usage-cost.sh --duration 30d --product all

# Focus on APM costs
bash scripts/analyze-usage-cost.sh --duration 30d --product apm

# Extract high-priority recommendations
bash scripts/analyze-usage-cost.sh --duration 30d --product all | jq '.recommendations[] | select(.priority == "high")'

# Track weekly trends
bash scripts/analyze-usage-cost.sh --duration 7d --product all | jq '.cost_summary'
```

## LLM Observability Workflow

For GenAI applications, monitor token usage and costs:

```bash
# Analyze LLM performance
bash scripts/analyze-llm.sh --service my-genai-app --duration 24h

# Filter by specific model
bash scripts/analyze-llm.sh --service my-genai-app --model gpt-4 --duration 7d

# Find most expensive operations
bash scripts/analyze-llm.sh --service my-genai-app --duration 30d | jq '.operations | sort_by(.total_cost_usd) | reverse | .[0:5]'

# Track token usage trends
bash scripts/analyze-llm.sh --service my-genai-app --duration 7d | jq '.summary.token_usage'
```

## Deployment Impact Analysis

Compare metrics before/after deployment:

```bash
# Before deployment
bash scripts/query-apm.sh --service my-service --duration 1h > before.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> before_metrics.json

# Deploy...

# After deployment
bash scripts/query-apm.sh --service my-service --duration 1h > after.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> after_metrics.json

# Compare latency
jq -s '.[0].summary.avg_p95_ms - .[1].summary.avg_p95_ms' before.json after.json

# Check for new errors
bash scripts/search-logs.sh --service my-service --status error --duration 30m
```

## Monitor Creation Workflow

Set up monitoring for new services:

```bash
# Create latency monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - High Latency" \
  --query "avg(last_5m):avg:trace.express.request.duration{service:payment-api} > 500" \
  --message "P95 latency above 500ms @slack-ops"

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:payment-api}.as_count() / sum:trace.express.request.hits{service:payment-api}.as_count() > 0.05" \
  --message "Error rate above 5% @pagerduty"

# Create APM dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Security" --type security
```

## Incident Response Workflow

Automated incident management:

```bash
# Check for SLO breaches
bash scripts/query-slos.sh --service payment-api | jq '.slos[] | select(.status == "breaching")'

# Create incident if SLO breached
bash scripts/manage-incidents.sh create \
  --title "Payment API SLO Breach" \
  --service payment-api \
  --severity SEV-2

# Trigger remediation workflow
bash scripts/trigger-workflow.sh run --id remediation-workflow-123 --input '{"service": "payment-api"}'

# Mute non-critical monitors during incident
bash scripts/manage-monitors.sh list --service payment-api | \
  jq '.monitors[] | select(.name | contains("non-critical")) | .id' | \
  xargs -I {} bash scripts/manage-monitors.sh mute --id {} --duration 2

# Update incident when resolved
bash scripts/manage-incidents.sh update --id abc123 --status resolved
```

## SLO Monitoring Workflow

Track service level objectives:

```bash
# Check all SLOs
bash scripts/query-slos.sh

# Alert if error budget exhausted
EXHAUSTED=$(bash scripts/query-slos.sh | jq '.summary.budget_exhausted')
if [ "$EXHAUSTED" -gt 0 ]; then
  bash scripts/manage-incidents.sh create \
    --title "Error Budget Exhausted" \
    --service affected-service \
    --severity SEV-3
fi

# Weekly SLO report
bash scripts/query-slos.sh | jq '{
  total: .total_slos,
  breaching: .summary.breaching,
  low_budget: .summary.budget_low,
  at_risk: [.slos[] | select(.error_budget_remaining < 20) | {name, budget: .error_budget_remaining}]
}'
```

## Output Format

All scripts return structured JSON for programmatic parsing:

```json
{
  "status": "ok|warning|critical|error",
  "summary": {
    "...": "aggregated metrics"
  },
  "data": [...],
  "recommendations": [...]
}
```

Status messages go to stderr, JSON to stdout. This allows:

```bash
# Silent execution, capture JSON
bash scripts/query-apm.sh --service my-service --duration 1h 2>/dev/null | jq '.summary'

# Log messages only
bash scripts/query-apm.sh --service my-service --duration 1h >/dev/null

# Both
bash scripts/query-apm.sh --service my-service --duration 1h
```

## Best Practices

**Query Optimization**
- Use specific time ranges to reduce API calls
- Filter by service/environment early
- Paginate large result sets
- Cache results when appropriate

**Alert Investigation**
- Start with Watchdog anomalies (automated detection)
- Correlate security signals with application errors
- Check metrics for trend confirmation
- Search logs for detailed context

**Cost Control**
- Run analyze-usage-cost.sh monthly
- Implement high-priority recommendations first
- Monitor sampling rates for high-volume services
- Track custom metric growth

**Security Monitoring**
- Query security signals daily (automated check)
- Filter by critical severity for alerting
- Correlate with log patterns
- Track attack trends over time

## Limitations

- API rate limits apply (varies by endpoint)
- Historical data retention depends on Datadog plan
- Real-time queries have eventual consistency
- Requires live Datadog data (APM, logs, security monitoring)

## Resources

- [Datadog API Documentation](https://docs.datadoghq.com/api/)
- [APM Query Syntax](https://docs.datadoghq.com/tracing/trace_explorer/query_syntax/)
- [Log Query Syntax](https://docs.datadoghq.com/logs/explorer/search_syntax/)
- [Watchdog Alerts](https://docs.datadoghq.com/watchdog/alerts/)
- [Security Monitoring](https://docs.datadoghq.com/security/application_security/)
- [Usage Metering API](https://docs.datadoghq.com/api/latest/usage-metering/)

## Notes

This skill provides comprehensive Datadog automation: query live data to investigate issues AND create infrastructure (monitors, dashboards, incidents) for ongoing operations. It does not handle installation or initial setup - use Datadog documentation for that.

**Investigation:** Query APM, logs, metrics, security signals, SLOs, costs, and LLM usage to debug production issues.

**Automation:** Create monitors, generate dashboards, trigger workflows, manage incidents, and mute alerts during maintenance.

All scripts return structured JSON for integration with CI/CD pipelines, ChatOps workflows, and automation platforms.