--- name: bmad-observability-readiness description: Establishes instrumentation, monitoring, and alerting foundations. allowed-tools: ["Read", "Write", "Grep", "Bash"] metadata: auto-invoke: true triggers: patterns: - "add logging" - "monitoring setup" - "no telemetry" - "instrument this" - "observability gaps" - "alert fatigue" - "SLO dashboard" keywords: - observability - logging - monitoring - tracing - metrics - alerting - telemetry capabilities: - instrumentation-design - metrics-cataloging - logging-standards - alert-tuning - slo-definition prerequisites: - bmad-architecture-design - bmad-test-strategy outputs: - observability-plan - instrumentation-backlog - slo-dashboard-spec --- # BMAD Observability Readiness Skill ## When to Invoke Use this skill when the user: - Mentions missing or low-quality logging, metrics, or tracing. - Requests monitoring/alerting setup before a launch or major release. - Needs SLOs, dashboards, or on-call runbooks. - Reports alert fatigue or noise that needs rationalization. - Wants to ensure performance and reliability work has data coverage. If instrumentation already exists and only specific bug fixes are required, hand over to `bmad-development-execution` with the backlog produced here. ## Mission Deliver a comprehensive observability plan that enables diagnosis, alerting, and measurement across the system. Ensure downstream performance, reliability, and security work has trustworthy telemetry. ## Inputs Required - Architecture diagrams and component inventory. - Existing logging/monitoring/tracing configuration (if any). - Current incidents, outages, or blind spots experienced by the team. - SLAs/SLOs, business KPIs, or compliance reporting requirements. ## Outputs - **Observability plan** detailing metrics, logs, traces, dashboards, and retention policies. - **Instrumentation backlog** with implementation tasks, owners, and acceptance criteria. - **SLO dashboard specification** covering golden signals, alert thresholds, and runbook links. - Updated runbook or escalation paths if gaps were discovered. ## Process 1. Audit current telemetry coverage, tooling, and data retention. Document gaps. 2. Define observability objectives aligned with user journeys and business KPIs. 3. Design instrumentation strategy: metrics taxonomy, structured logging, trace spans, event schemas. 4. Establish SLOs, SLIs, and alerting strategy with on-call expectations and noise controls. 5. Produce dashboards/reporting requirements and data governance notes. 6. Create backlog with prioritized instrumentation tasks and verification approach. ## Quality Gates - Every critical user journey has metrics and alerts defined (latency, errors, saturation, traffic). - Logging standards specify structure, PII handling, and retention. - Alert runbooks documented or flagged for creation. - Observability plan references integration with performance, security, and incident workflows. ## Error Handling - If telemetry tooling is undecided, present comparative options with trade-offs. - Highlight dependencies on platform teams or infrastructure before finalizing timeline. - Escalate when observability requirements conflict with compliance or privacy constraints.