--- name: diagnose-clickhouse-clusters description: Diagnose ClickHouse cluster health and provide concrete remediation. --- # Tool Usage Rules - Call `collect_cluster_status` before health conclusions about current cluster health. - For RCA questions, call `collect_rca_evidence` directly when the symptom and target are already clear. Use `collect_cluster_status` first only when you need current health context, severity/outliers, or help choosing the RCA symptom/scope. - Use only supported Phase 1 RCA symptoms: `high_part_count` and `unknown`. - For bounded-time questions, use `status_analysis_mode="windowed"` and reuse the same time window in follow-up calls. - If user asks for a chart, use the `visualization` skill. Do not emit chart specs directly from this skill. - Do not invent custom health-check SQL. Use tool outputs as the source of truth. # Workflow (MANDATORY) 1. Determine whether the user asks for status only, or root cause ("why", "root cause", "reason", "caused by", "explain"). 2. For RCA questions, pick one supported canonical symptom key based on user wording, explicit target details, and, when needed, status findings. 3. Explain from tool output only: top candidates, support score, evidence lists, gaps, and prioritized actions. # Severity Thresholds (Guidance) - CRITICAL: replication lag > 300s, disk usage > 90% - WARNING: replication lag > 60s, disk usage > 80% - OK: metrics within normal ranges Do not hardcode parts thresholds in responses. Use the thresholds and severities returned by `collect_cluster_status`. # Output Format (MANDATORY) Use one of these two formats: ## A) Status-only question 1. Summary table: Always print a table title line exactly before the table: `### Summary`. | Status | Nodes with Issues | Checks Run | Timestamp | |--------|-------------------|------------|-----------| | 🟢 OK / 🟠 WARNING / 🔴 CRITICAL | N | categories | ISO8601 | 2. Findings by category: Always print a table title line exactly before the table: `### Findings by Category`. Use a markdown table (not bullets) with one row per category. Required columns: | Category | Status | Key Metrics | Top Outlier / Scope | Notes | |----------|--------|-------------|----------------------|-------| | parts / errors / replication / ... | 🟢 OK / 🟠 WARNING / 🔴 CRITICAL | concise metric values with thresholds | node/table if present, else `-` | one short phrase | Table rules: - Include all categories returned by `collect_cluster_status` in stable order. - Status must include both emoji and text (for example `🟠 WARNING`), never emoji-only. - Markdown table cells do not reliably support line breaks in this UI. Do not try to render multi-line bullets in a cell. - In `Key Metrics`, put the 1-2 most important metrics only (single-line, semicolon-separated if needed). - Put additional metrics in `Notes` as compact key/value items (single-line). - Put numeric values first (for example `max_parts_per_table=533 (>500)`), avoid prose-heavy sentences. - Always wrap database/table identifiers in backticks (for example `` `db.table` `` or `` `db` ``) in all table cells. - If category has sub-findings (for example top errors), keep them in `Notes` as compact comma-separated items. - If no outlier exists, set `Top Outlier / Scope` to `-`. 3. Recommendations (max 3 items; each item = title + why + concrete SQL/command if needed). ## B) RCA question ("why", "cause", "reason", "explain") Use compact structure only: 1. **RCA Verdict**: one sentence, max 30 words. 2. **Top Candidates**: markdown table with max 3 rows: `cause | support_score | evidence`. In `evidence`, render up to 3 `evidence_for` items prefixed with `✓` and up to 2 `evidence_against` items prefixed with `✗`, separated by `
`. When `excluded_candidates` is non-empty, include at least one excluded reason as a `✗` item for the most relevant row. Evidence fidelity rules: - Use only `candidate.evidence_for` and `candidate.evidence_against` from `collect_rca_evidence` for that row. - Do not pull extra lines from top-level `observations`, other candidates, or status output into the evidence cell. - Do not restate raw metrics unless they already appear inside `candidate.evidence_for` or `candidate.evidence_against`. - Preserve the candidate/tool counts: if helpful, you may mention `indicators_matched/indicators_checked`, but never imply more matched checks than the tool returned. 3. **Possible Actions**: max 3 numbered items, sorted by impact. Formatting rule: print the line `3. **Possible Actions**`, then a blank line, then an indented nested numbered list using exactly ` 1.`, ` 2.`, ` 3.`. Do not continue the outer top-level numbering for action items. 4. **Gaps / Next Checks**: max 2 bullets. Formatting rule: print the line `4. **Gaps / Next Checks**`, then a blank line, then indented bullets using exactly ` -`. RCA brevity limits: - Keep total RCA response under 220 words (excluding SQL command blocks). - Do not add long background/theory paragraphs. - Use direct statements and numeric evidence. # Critical Rules - ALWAYS call `collect_cluster_status` before giving any opinion on current health. - Use `status_analysis_mode="windowed"` when user asks for a bounded time window or historical context. - For RCA questions, MUST call `collect_rca_evidence`. `collect_cluster_status` is optional unless current health context is needed. - Do NOT state root causes without RCA evidence output. - If `gaps[]` is non-empty, explicitly state what evidence is missing. - If all candidates have `support_score < 0.3`, state that the RCA is inconclusive and use candidate `next_checks` plus `gaps` to explain what to inspect next. - If best candidate is weak (`0.30-0.39`), present it as a possibility with caveats and emphasize candidate `next_checks`. - Never fabricate or merge evidence lines across candidates. Candidate rows must be traceable directly to that candidate's `evidence_for` and `evidence_against`. - If `collect_rca_evidence.related_symptoms` is non-empty, include a line `Related symptoms:` and list them. - When follow-up questions omit time range, reuse the most recent explicit time window/range from prior turns. - Never assume schema or table names; use only what tools return. - Do not invent custom health-check SQL; use tool outputs as source of truth. - Be concise and focus on remediation, not theory.