--- name: design-monitoring description: > Implement monitoring for a project by consuming augur's monitoring-spec.yaml. Produces Grafana dashboard JSON, Prometheus alert rules, and validates that the running service emits the expected metrics. Aligns with the infra-atlas new_workload_contract for observability. argument-hint: " [--scope full|dashboards|alerts|validate] [--dry-run]" --- Implement a monitoring system for a project from augur's monitoring spec. Augur designs the spec (metrics, alerts, dashboards); sauron implements it as concrete Grafana JSON, Prometheus rules, and validates live metric emission. ## Arguments `$ARGUMENTS` — Required: ``. Optional: - `--scope full|dashboards|alerts|validate` — focus on a specific area (default: full) - `--dry-run` — generate configs but do not push to Grafana or apply alert rules ## Input: Augur's monitoring-spec.yaml This skill consumes the monitoring spec that augur produces during `/design --approve`. The spec follows this schema: ```yaml version: "1" project: generated_from: design-atlas.json metrics: - name: type: counter|gauge|histogram labels: [] source_pattern: description: alerts: - name: condition: severity: critical|warning source_pattern: dashboards: - name: panels: [] ``` ### How to locate the spec The daemon injects artifact paths into the job prompt. Look for: ``` [Artifacts] monitoring-spec: ``` Resolution order: 1. **Artifact path** — if the prompt contains `[Artifacts] monitoring-spec:`, read the file at that path. This is the primary mechanism when augur delegates to sauron after `/design --approve`. 2. **Augur project memory** — if no artifact path, read from `/kord/agents/augur/memory/projects//monitoring-spec.yaml`. 3. **Fail** — if neither exists, report that no monitoring spec is available and ask the user to run `augur /design --approve` first. ## Dependencies 1. **Augur** — provides monitoring-spec.yaml (input) and atlas.json (architectural context). - Atlas at `/kord/agents/augur/memory/projects//atlas.json` provides: components, flows, failure modes, external dependencies. - If atlas exists, use it to enrich dashboard panels and alert annotations. 2. **Charon/Alfred** — cluster access for live validation: ``` /kord alfred get config ``` Provides: Tailscale IPs, namespaces, service ports, kubeconfig context. 3. **Infra-atlas contract** — read from `$AGENT_PROJECT_DIR/memory/global/infra-atlas.json` (if available). The `new_workload_contract` section defines observability requirements all workloads must satisfy: - `health`: readiness and liveness endpoints (`GET /health`) - `metrics`: Prometheus endpoint (`/metrics`, prometheus format) - `logging`: stdout, JSON format - `labels`: `app: ` on all pods All generated configs must align with these contract requirements. 4. **Sauron monitoring model** — follow the two-layer model from `memory/monitoring.md`: - **Alloy layer**: pod-level collection (infra metrics via cAdvisor, app metrics via `/metrics` scrape, logs via stdout) - **Vitals layer**: app-level health evaluation (health gauges: 0=FAIL/1=WARNING/2=OK, derived metrics) Generated dashboards and alerts must target the correct layer. ## Procedure ### Step 1 — Load the monitoring spec Parse the monitoring-spec.yaml using the resolution order above. Validate: - `version` is `"1"` - `metrics` array is non-empty - Each metric has `name`, `type`, and `description` - Each alert has `name`, `condition`, and `severity` Also load: - Atlas (`/kord/agents/augur/memory/projects//atlas.json`) for component context - Infra-atlas (`$AGENT_PROJECT_DIR/memory/global/infra-atlas.json`) for contract requirements - Existing observability catalog (`$MEM/observability-catalog.yaml`) from a previous `/monitor` scan, if available ### Step 2 — Generate Grafana dashboard JSON For each dashboard entry in the spec, produce a complete Grafana dashboard JSON file. **Overview dashboard** (always generated): - Title: ` Overview` - Rows: one per component group (from atlas groups, or one row per spec dashboard) - Panels per row: - Request rate (counter metrics with `rate()`) - Error rate (counter metrics filtered by error status) - Latency (histogram metrics with `histogram_quantile()`) - Saturation (gauge metrics for resource utilization) - Variables: `$namespace`, `$app` (pre-filled from project name) - Datasource: `Prometheus` (uid: use cluster default) **Vitals dashboard** (generated if vitals metrics are in the spec): - Title: ` Vitals` - Health gauge panels: stat panels showing 0/1/2 state with value mappings (FAIL/WARNING/OK) - Derived metric panels: time series for throughput, latency, lag **Dashboard JSON structure**: ```json { "title": " Overview", "uid": "-overview", "tags": ["", "generated"], "templating": { "list": [/* $namespace, $app */] }, "panels": [/* generated from spec metrics */], "time": { "from": "now-1h", "to": "now" }, "refresh": "30s" } ``` Each panel must reference specific metrics from the spec by name and include proper PromQL queries. Use `app=""` label selector to align with the infra-atlas contract `app` label requirement. Write dashboard files to: `$MEM/dashboards/-overview.json`, `$MEM/dashboards/-vitals.json` ### Step 3 — Generate Prometheus alert rules For each alert in the spec, produce a Prometheus alerting rule in the standard format: ```yaml groups: - name: rules: - alert: expr: for: 5m labels: severity: app: annotations: summary: "" source_pattern: "" runbook_url: "" ``` **Required meta-alert** (always generated): ```yaml - alert: VitalsMissing expr: absent(vitals_process{app=""}) for: 5m labels: severity: critical app: annotations: summary: "Vitals pod for is not reporting. Health visibility lost." ``` **Severity routing** (document in annotations): - `critical` — pages on-call (PagerDuty) - `warning` — Slack notification Write alert rules to: `$MEM/alerts/-rules.yaml` ### Step 4 — Validate live metric emission If the service is running (cluster access available), verify that the expected metrics are actually being emitted: 1. **Get pod endpoint** — use cluster config to find the service's metrics endpoint: ```bash kubectl get pods -n -l app= -o jsonpath='{.items[0].status.podIP}' ``` 2. **Scrape metrics** — hit the `/metrics` endpoint: ```bash curl -s http://:/metrics ``` 3. **Check each spec metric** — for every metric in monitoring-spec.yaml: - Does a metric with this name appear in the scrape output? - Does it have the expected type (counter/gauge/histogram)? - Are the expected labels present? 4. **Check contract compliance** — verify infra-atlas requirements: - Pod has `app: ` label - Pod has `prometheus.io/scrape: "true"` annotation - Health endpoint responds at `/health` - Logs are JSON on stdout (check recent logs via `kubectl logs`) 5. **Classify results**: - `PASS` — metric exists with correct type and labels - `MISSING` — metric not found in scrape output (not yet instrumented) - `TYPE_MISMATCH` — metric exists but wrong type - `LABELS_MISSING` — metric exists but missing expected labels - `CONTRACT_VIOLATION` — infra-atlas requirement not met If cluster access is unavailable, skip this step and note it in the report. ### Step 5 — Write to sauron project memory Write all generated configs to sauron's project memory: ``` $MEM/ dashboards/ -overview.json # Grafana overview dashboard -vitals.json # Grafana vitals dashboard (if applicable) alerts/ -rules.yaml # Prometheus alert rules validation-report.yaml # Metric emission validation results implementation-status.yaml # What was created, what needs work ``` The `$MEM` path is the project memory directory injected by the daemon: `/kord/agents/sauron/memory/projects//` `implementation-status.yaml` tracks: ```yaml project: generated: source_spec: dashboards: - file: -overview.json status: generated|pushed panels: - file: -vitals.json status: generated|pushed panels: alerts: - file: -rules.yaml status: generated|applied rules: validation: total_metrics: pass: missing: type_mismatch: labels_missing: contract_violations: [] ``` ### Step 6 — Deploy (unless --dry-run) If not `--dry-run`: 1. **Push dashboards to Grafana** — use the Grafana API (see `grafana_api.py`): ```python push_dashboard("-overview.json", folder_uid="") ``` 2. **Apply alert rules** — deploy as ConfigMap for Prometheus to pick up: ```bash kubectl create configmap -alerts -n monitor \ --from-file=alerts/ --dry-run=client -o yaml | kubectl apply --server-side -f - ``` 3. **Provision dashboards** — deploy as ConfigMap for Grafana: ```bash kubectl create configmap -dashboards -n monitor \ --from-file=dashboards/ --dry-run=client -o yaml | kubectl apply --server-side -f - ``` If `--dry-run`, write the files but do not push or apply. Note in the report. ## Report ``` ## Monitoring Implementation: **Source spec**: **Generated from**: ### Dashboards | Dashboard | Panels | Status | |-----------|--------|--------| | -overview | N | pushed / generated (dry-run) | | -vitals | N | pushed / generated (dry-run) | Written to: $MEM/dashboards/ ### Alert Rules | Alert | Severity | Source Pattern | Status | |-------|----------|----------------|--------| | | critical/warning | | applied / generated (dry-run) | | VitalsMissing | critical | meta-alert | applied / generated (dry-run) | Written to: $MEM/alerts/-rules.yaml Rules: N (N critical, N warning) ### Metric Validation | Metric | Type | Status | |--------|------|--------| | | counter | PASS / MISSING / TYPE_MISMATCH | Summary: N/N metrics validated, N missing, N contract violations (or: "Skipped — cluster access unavailable") ### Contract Compliance (infra-atlas) - [x] app label present - [x] prometheus.io/scrape annotation - [x] /health endpoint responds - [x] JSON logs on stdout (or [ ] with explanation for failures) ### Files written - $MEM/dashboards/-overview.json - $MEM/dashboards/-vitals.json - $MEM/alerts/-rules.yaml - $MEM/validation-report.yaml - $MEM/implementation-status.yaml ```