--- name: ops-fires description: Production incidents dashboard. Reads ECS health, Sentry errors, CI failures. Offers to dispatch fix agents for active fires. argument-hint: "[project-alias|all]" allowed-tools: - Bash - Read - Grep - Glob - Skill - Agent - AskUserQuestion - TeamCreate - SendMessage - TaskCreate - TaskUpdate - Monitor - WebFetch - WebSearch - mcp__sentry__search_issues - mcp__sentry__get_issue_details effort: medium maxTurns: 30 --- # OPS ► FIRES ## Runtime Context Before executing, load available context: 1. **Daemon health**: Read `${CLAUDE_PLUGIN_DATA_DIR:-$HOME/.claude/plugins/data/ops-ops-marketplace}/daemon-health.json` - Check `infra-monitor` service status — if not running, pre-gathered infra data may be stale - If `action_needed` is not null → surface it immediately as a potential fire 2. **Secrets**: AWS credentials are required for ECS/CloudWatch queries. ### Secret Resolution - First: check `$AWS_ACCESS_KEY_ID` / `$AWS_PROFILE` env vars - Then: `doppler secrets get AWS_ACCESS_KEY_ID --plain` (if `doppler` configured in prefs) - Then: use `password_manager_config.query_cmd` from preferences - Sentry token: `$SENTRY_AUTH_TOKEN` → Doppler `SENTRY_AUTH_TOKEN` → vault 3. **Preferences**: Read `${CLAUDE_PLUGIN_DATA_DIR}/preferences.json` for `secrets_manager` config to know which vault to query. ## CLI/API Reference ### aws CLI | Command | Usage | Output | |---------|-------|--------| | `aws ecs list-services --cluster --query 'serviceArns'` | ECS services | ARN list | | `aws ecs describe-services --cluster --services --query 'services[0].{status:status,running:runningCount,desired:desiredCount}'` | Service health | JSON | | `aws logs tail /ecs/ --since 1h --format short` | ECS logs | Log lines (use with Monitor for live) | ### gh CLI (GitHub) | Command | Usage | Output | |---------|-------|--------| | `gh run list --limit 20 --json status,conclusion,name,headBranch,createdAt` | Recent CI runs | JSON array | | `gh run view --repo --log-failed` | Failed CI logs | Log output | ### sentry-cli / Sentry API | Command | Usage | Output | |---------|-------|--------| | `sentry-cli issues list --project --status unresolved` | Unresolved issues | Issue list | | `curl -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" "https://sentry.io/api/0/projects///issues/?query=is:unresolved"` | API fallback when MCP unavailable | JSON array | --- ## Agent Teams support If `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` is set, use **Agent Teams** when dispatching multiple fix agents simultaneously. This enables: - Fix agents share findings (e.g., API agent discovers DB is the root cause → infra agent pivots to DB fix) - You can prioritize: "CRITICAL ECS issue first, then CI failures" - Real-time progress: agents report as they find root causes, you can merge fixes in optimal order **Team setup** (only when flag is enabled, dispatch phase): ``` TeamCreate("fire-fixers") Agent(team_name="fire-fixers", name="fix-[service]", ...) ``` If the flag is NOT set, use standard parallel subagents. ## Pre-gathered infrastructure data ```! ${CLAUDE_PLUGIN_ROOT}/bin/ops-infra 2>/dev/null || echo '{"clusters":[],"error":"infra check failed"}' ``` ## CI failures (last 24h) ```! ${CLAUDE_PLUGIN_ROOT}/bin/ops-ci 2>/dev/null || echo '[]' ``` ## External projects health ```! ${CLAUDE_PLUGIN_ROOT}/bin/ops-external 2>/dev/null || echo '[]' ``` ## Your task Analyze the pre-gathered data — including external projects. Then run parallel checks: 1. **ECS health** — parse infra data for unhealthy services, stopped tasks, failed deployments. 2. **Sentry** — if Sentry MCP is connected, query recent unresolved errors. Otherwise note it's unavailable. 3. **CI** — parse CI data for failing pipelines, broken main/dev branches. 4. **GitHub Actions** — `gh run list --limit 20 --json status,conclusion,name,headBranch,createdAt 2>/dev/null` 5. **External projects** — parse ops-external data. Flag `auth_expired` as HIGH (credential rotation needed), `unreachable`/`degraded` as MEDIUM, `not_configured` as LOW. Classify each issue by severity: | Severity | Criteria | | -------- | ------------------------------------------------- | | CRITICAL | Service down, DB unreachable, auth broken | | HIGH | Elevated error rate, deploy stuck, CI main broken | | MEDIUM | Non-critical service degraded, flaky tests | | LOW | Warning-level, non-urgent | --- ## Output format ``` ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ OPS ► FIRES DASHBOARD — [timestamp] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ CRITICAL [service] — [issue] — [since] HIGH [service] — [issue] — [since] MEDIUM [service] — [issue] — [since] ECS HEALTH [cluster] [service] [desired/running] [status] CI STATUS [repo] [branch] [workflow] [status] [last run] SENTRY (top errors, 24h) [error] [count] [first seen] [project] EXTERNAL PROJECTS [alias] [source] [status] [details — e.g. auth_expired, unreachable] ────────────────────────────────────────────────────── ``` Use **batched AskUserQuestion calls** (max 4 options each). Only show relevant actions (e.g., skip dispatch options if no issues found): AskUserQuestion call 1: ``` [Dispatch fix agent for [top critical issue]] [Dispatch fix agent for [second issue]] [View logs for [service]] [More...] ``` AskUserQuestion call 2 (only if "More..."): ``` [Open Sentry dashboard] [Open GitHub Actions] [All clear — nothing to do] ``` If no fires: show "ALL SYSTEMS OPERATIONAL" with last-checked timestamps. --- ## Dispatch fix agent When user selects to fix an issue, use `AskUserQuestion` to confirm the scope before dispatching: ``` Dispatch fix agent for: [issue title] Severity: [CRITICAL/HIGH/MEDIUM] Repo: [repo] Error: [brief description] The agent will: - Investigate root cause in [repo] - Create feature branch with fix - Open PR for review [Dispatch agent] [Show me the logs first] [Skip — I'll fix manually] ``` On confirmation, spawn an Agent with: - The error details and logs - Access to the relevant repo - Instruction to create a feature branch, fix, and open a PR - Report back when done or blocked Use the `agents/infra-monitor.md` agent definition for infra issues. If `$ARGUMENTS` contains a project alias, filter to that project's services only. --- ## Native tool usage ### Monitor — live service health Use `Monitor` to stream ECS task logs or GitHub Actions runs when investigating fires: ``` Monitor(command: "aws logs tail /ecs/ --follow --since 5m") ``` ### Tasks — incident tracking Use `TaskCreate` for each active fire. Update with `TaskUpdate` as fires are investigated/fixed/escalated. ### WebFetch — status pages When diagnosing fires, use `WebFetch` to check AWS status page (`https://health.aws.amazon.com/health/status`), Vercel status, or third-party API status pages. ### WebSearch — known outage patterns Use `WebSearch` to find if the error pattern matches a known AWS/infrastructure issue (e.g., "ECS task stopped CannotPullContainerError" → known ECR throttling).