--- name: alibabacloud-elasticsearch-instance-diagnose description: | Alibaba Cloud Elasticsearch instance diagnosis skill. Use for cluster health checks, troubleshooting, and performance analysis on Elasticsearch instances. Triggers (English): Elasticsearch diagnosis, ES instance issues, slow search, write failures, cluster Red/Yellow, unassigned shards, node disconnected, load imbalance, thread pool 429, JVM/OOM/circuit breaker, disk watermark / read-only index, instance activating / change stuck, service avalanche / all shards failed. 触发词(中文): ES诊断、阿里云ES、Elasticsearch诊断、ES集群/实例故障排查、ES健康检查、集群红灯/变红/黄灯/变黄、集群异常、分片未分配、主分片未分配、节点掉线/离线、负载不均衡、搜索/查询变慢、慢查询、写入失败/变慢/拒绝、线程池打满、HTTP 429、内存过高、OOM、断路器、磁盘满/水位、索引只读、实例激活中/activating、变更卡住/未完成、雪崩、服务不可用、all shards failed。 --- # Alibaba Cloud Elasticsearch Instance Diagnosis Collect signals from **Alibaba Cloud OpenAPI (control plane)** and the **Elasticsearch REST API (data plane)**, combine them with the SOP knowledge base under `references/`, and produce **root-cause analysis**, an **evidence chain**, **prioritized remediation guidance**, and—when multiple dimensions fire—a **recency-ordered incident timeline** (severity vs time in window; see **Timeline and recency (MUST)** in §5 Step 4). **Architecture**: Alibaba Cloud Elasticsearch OpenAPI + Alibaba CloudMonitor (CMS) + Elasticsearch REST API + diagnostic SOPs **Closure**: If MUST applies and `ES_*` is set, finish authenticated ES API evidence before the final report (see **Feasibility order** in §5). --- ## 1. Prerequisites ### 1.1 Aliyun CLI > **Pre-check: Aliyun CLI >= 3.3.1 required** (for RAM permission checks and OpenAPI CLI fallback) > Run `aliyun version` to verify the version is >= 3.3.1. If the CLI is missing or too old, see `references/cli-installation-guide.md`. > After installation, run `aliyun configure set --auto-plugin-install true` to enable automatic plugin installation (**do not** pass plaintext AccessKey pairs on this command line; see §1.2). ### 1.2 Alibaba Cloud account authentication and security (MUST) > **Security rules (mandatory):** > - **NEVER** read, echo, or print AccessKey ID or AccessKey Secret values. > - **NEVER** prompt or ask the user to paste plaintext AccessKeys in the conversation. > - **NEVER** embed AccessKeys in scripts, CLI arguments, or `curl` URLs. > - **NEVER** use `aliyun configure set` (or similar) to pass **literal** AccessKey ID/Secret on the command line. > - **NEVER** accept AccessKeys that the user pastes into the chat, even if offered voluntarily. > - **ONLY** use configured CLI profiles (`aliyun configure`) or environment variables such as `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` that the user has set **in their local shell** (the agent **must not** echo those values in the session). > **⚠️ If the user provides AccessKeys in the chat (e.g. “my AK is xxx”)** > > 1. **Stop immediately**: do not run any Alibaba Cloud command that requires credentials. > 2. **Decline politely** and give **only** the names of approved configuration methods (**do not** repeat any secret the user may have leaked): > - Recommended: run `aliyun configure` in a local terminal and enter credentials when prompted; credentials are stored in the local profile file. > - Alternatively: set `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` in the local shell (the user types values **only in the terminal**, not in chat). > 3. Resume the diagnosis request only after credentials are configured correctly. > **Verify credentials without exposing secrets:** > > ```bash > aliyun configure list > aliyun --profile sts get-caller-identity > ``` > > **Credential policy:** > 1. Prefer an `aliyun configure` profile (default or `--profile`). > 2. If there is no valid identity (`configure list` / `get-caller-identity` fails), **STOP** and guide the user to configure locally; **do not** guess or fabricate credentials. > 3. Never pass plaintext AccessKeys through the conversation. ### 1.3 Elasticsearch direct-connect credential boundary > - **NEVER** ask the user to paste `ES_PASSWORD` in chat; **NEVER** echo, print, or log the password; **NEVER** copy a password from chat into commands, hooks, or repo files. > - Shell expansion for `curl -u "$ES_USERNAME:$ES_PASSWORD"` (or equivalent) is **allowed** when vars are **pre-exported in the user’s local shell**; **NEVER** put the secret as a literal in chat, scripts checked into repos, or command output. > - If the user tries to send a password in chat: **STOP** as well and ask them to set `ES_PASSWORD` only locally via `export` (see §2.2). --- ## 2. Environment setup ### 2.1 Control plane OpenAPI (via Aliyun CLI) All control-plane and CMS data collection for this skill uses the **Aliyun CLI**. > **[MUST] `elasticsearch` / `cms` — plugin-mode shell only (avoid legacy CLI)** > Whenever the agent emits **executable** `aliyun` lines (chat, reproducibility exports, or copy-paste steps), use **plugin subcommands** (lowercase-hyphenated) and **kebab-case** flags — the same shape as `scripts/openapi_cli_collect.py` and [references/verification-method.md](references/verification-method.md). > - **Do not** use legacy POP-style invocations: a **PascalCase verb** immediately after `elasticsearch` or `cms` on the same `aliyun` line (the old “action name = subcommand” style), or CamelCase flags like `--InstanceId`, `--Namespace`, `--StartTime` in **new** commands. Use plugin verbs only (`describe-instance`, `describe-metric-list`, …). > - **Naming split:** `DescribeInstance`, `ListSearchLog`, `DescribeMetricList`, etc. are **OpenAPI action names** (PascalCase — docs, RAM, console). The token **after** `aliyun elasticsearch` or `aliyun cms` in a shell must be the **CLI plugin** name (`describe-instance`, `list-search-log`, `describe-metric-list`, …). > - **Prefer** `python3 scripts/check_es_instance_health.py` for the standard control-plane + CMS bundle so subprocess calls stay aligned with this repo. > - **CLI references:** [Elasticsearch CLI 中心](https://api.aliyun.com/api-tools/cli/elasticsearch/2017-06-13), [云监控 CLI 中心](https://api.aliyun.com/api-tools/cli/Cms/2019-01-01). **AI-Mode and plugin baseline (required)** — wrap every diagnosis session that runs `aliyun` OpenAPI/CMS commands: ```bash aliyun configure ai-mode enable aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose" aliyun plugin update # … diagnosis: aliyun / python3 scripts/check_es_instance_health.py … aliyun configure ai-mode disable ``` > **`configure ai-mode` missing or failing:** Skip the wrapper above; use **`ALIBABA_CLOUD_USER_AGENT`** (next block). Log the CLI failure (e.g. subcommand unavailable). Whether the profile is **valid** is determined only by **`aliyun configure list`** and **`sts get-caller-identity`** — write **valid** / **validity**, not *vaild*. **User-Agent (required)**: set a User-Agent for Alibaba Cloud API calls: ```bash export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose" ``` **CLI hardening (recommended)**: when authoring raw `aliyun` commands, use **§2.1 MUST plugin shape** first, then add **`--connect-timeout 3 --read-timeout 10`** (increase `read-timeout` for large responses or CMS), consistent with the instance-management skill examples, to avoid indefinite hangs on network faults. If the global User-Agent is not set, add **`--user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose`** per invocation. For **optional Elasticsearch probes** inside `check_es_instance_health.py` (when `ES_*` is set), the same knobs exist as **`--connect-timeout`** / **`--read-timeout`** on that script — they map to `curl` for engine calls only, not to the Aliyun OpenAPI client. Run before diagnosis: ```bash aliyun version aliyun configure list aliyun --profile sts get-caller-identity ``` ### 2.2 Elasticsearch API direct access (`curl`) Have the user set connection variables in a **local terminal** after you confirm the Elasticsearch endpoint (VPC or public) and admin credentials—**do not** hardcode user-specific values in chat: ```bash export ES_ENDPOINT="http://:9200" export ES_USERNAME="elastic" export ES_PASSWORD="" ``` > **Public access and `http` vs `https`:** From **`DescribeInstance`**, use **`publicDomain`** / **`domain`** and the reported **`protocol`**. When **`protocol` is `HTTP`** (typical public listener), set **`ES_ENDPOINT` to `http://:9200`**. Using **`https://`** against an **HTTP-only** endpoint causes **TLS** errors (e.g. **`WRONG_VERSION_NUMBER`**). Use **`https://`** only when **`protocol` is `HTTPS`** (or TLS is actually enabled on the port you use), and supply CA / fingerprint options as in **HTTPS options** below. > > **If `http://` “does not work” — when to try `https://`:** Treat **`DescribeInstance` `protocol`** as the **source of truth** for the **REST listener**. **`000`**, **timeouts**, or **connection refused** on **`http://`** usually mean **network path / allowlist / security group / wrong host or port** — **not** “try HTTPS next” when **`protocol` is still `HTTP`**. **Do** switch to **`https://`** when **`protocol` is `HTTPS`** (or the console / product doc states TLS on that endpoint) and the failure on `http://` is a **TLS or scheme** symptom (e.g. **`WRONG_VERSION_NUMBER`**, **`error:0A00010B`**, immediate SSL alert while probing with the wrong scheme). If **`protocol` is `HTTP`** and only **plain TCP** is advertised, **HTTPS is not a fallback** for reachability. > **Credential safety** > - **NEVER** echo, print, or log `ES_PASSWORD`; **NEVER** copy credentials from chat into shell history or saved files. > - **NEVER** ask the user to paste the password in plaintext in chat. > - **ONLY** use the following checks to verify that variables are set: > ```bash > [[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET" > [[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET" > ``` > **Network connectivity and access control** > > | Issue | How to check | Mitigation | > |------|--------------|------------| > | Public network access disabled | Elasticsearch console → **Network** | Enable public access or use the VPC endpoint | > | Public access allowlist | Console → **Security** → **Public access allowlist** | Add the agent host’s public IP | > | VPC isolation | e.g. `telnet 9200` | VPC peering, Express Connect, or equivalent | > | Security group | Inbound rules on the ECS/security group hosting Elasticsearch | Allow TCP **9200** (or the configured port) | > **Connectivity probe**: `curl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"` — HTTP code `000` usually means the path is unreachable. **`401` without `-u` is normal** (auth required); if `ES_PASSWORD` is SET, proceed to **authenticated** `GET /_cluster/health` (§7). **`401` with `-u`** → wrong credentials. **`000` / refused / timeout** → network, allowlist, or TLS/scheme mismatch. > **HTTPS — prerequisites (what must be true)** > 1. **Listener:** The Elasticsearch HTTP port you call (**9200** unless changed) must actually speak **TLS** — align with **`DescribeInstance` `protocol`** (**`HTTPS`**) or console/network documentation. > 2. **URL:** **`https://:`** with the same **host** (e.g. **`publicDomain`**) you would use for HTTP. > 3. **Client trust of the server certificate:** Your client must trust the cluster’s certificate chain (cluster / cloud **CA PEM**, or corporate proxy CA if TLS is intercepted). **`curl`**: prefer **`curl --cacert /path/to/ca.crt ...`**; **`-k` / `--insecure`** only for **short, non-production** diagnosis. > 4. **Auth:** Same **`ES_USERNAME` / `ES_PASSWORD`** as for HTTP (Basic auth over TLS). > > **HTTPS — how this skill documents it** > - **Manual `curl` (§7 and [es-api-call-failures.md](references/es-api-call-failures.md)):** Add **`--cacert`** (or **`-k`** for testing) to every **`curl`** when using **`https://`** if the default trust store does not include your cluster CA. > - **`check_es_instance_health.py` optional ES probes:** They invoke **`curl`** with **`-u`** only; they **do not** read **`ES_CA_CERTS`** / **`ES_SSL_FINGERPRINT`** / **`ES_VERIFY_CERTS`** (those names are common for **Python Elasticsearch** clients). For **HTTPS** instances, use **§7 `curl`** with **`--cacert`** for deep checks, or extend the script later to pass **`--cacert`** from an env var. > - **Python-style env vars (reference for other tooling):** `ES_CA_CERTS`, `ES_SSL_FINGERPRINT`, `ES_VERIFY_CERTS=false` (testing only) — **not** wired into this repo’s optional **`curl`** path today. --- ## 3. RAM permission check > **[MUST] RAM permission pre-check** > > Before running this skill, verify the principal has the required RAM permissions. > See `references/ram-policies.md` for the full list. > If the user reports insufficient permissions, direct them to attach the corresponding policies in the RAM console. --- ## 4. Parameter confirmation > **IMPORTANT: Parameter confirmation** > Confirm the following with the user **before** any command or API call. > Do not assume undeclared defaults or hardcode user-specific parameters. > **Boundary controls (MUST)** > - **Region and `instance-id` must not be guessed** or taken from unverified defaults; if they disagree with `DescribeInstance` or the user’s explicit statement, reconfirm. > - **Do not** apply metrics, logs, or `DescribeInstance` conclusions from **instance A** to **instance B**; `ES_ENDPOINT` must match the instance under diagnosis (see **Pre-flight validation for Elasticsearch API** below). > - This skill is **read-only diagnosis**: **do not** invoke mutating control-plane APIs (create, resize, restart, delete instance, etc.). If the user requests a change, provide recommendations only; execution belongs in the console or an approved change workflow. | Parameter | Required | Description | Default | |-----------|----------|-------------|---------| | `instance-id` | Yes | Elasticsearch instance ID, e.g. `es-cn-xxxxx`. **`aliyun` flag is `--instance-id`** (not `--InstanceId`). | - | | `region` | Yes | Region ID (e.g. `cn-hangzhou`). **`aliyun` flag is `--region`** (not `--region-id`). | - | | `profile` | No | Aliyun CLI profile (explicit `--profile` recommended) | `default` | | `ES_ENDPOINT` | No | Elasticsearch endpoint (direct API access only) | - | | `ES_PASSWORD` | No | Elasticsearch admin password (direct API access only) | - | | `--window` | No | `check_es_instance_health.py`: analysis window in minutes (default **60**) | 60 | | `--connect-timeout`, `--read-timeout` | No | `check_es_instance_health.py`: `curl` timeouts for optional **ES engine** probes when `ES_*` is set (`--connect-timeout` → `curl --connect-timeout`; **`--read-timeout`** contributes to **`curl -m`** together with connect). Defaults **5** / **10** seconds. | 5 / 10 | --- ## 5. End-to-end diagnostic workflow ### Agent hard rules (non-negotiable) > **Aliyun CLI shape:** For **`aliyun elasticsearch`** and **`aliyun cms`**, follow **§2.1 MUST (plugin mode only)** in every new executable command — do not resurrect legacy `DescribeInstance` / `ListSearchLog`-as-subcommand lines or `--InstanceId`-style flags in session exports or user-facing step lists (they drift from `openapi_cli_collect.py` and fail static checks). > **OpenAPI/CMS cannot replace MUST engine APIs.** For any **§5 MUST** table row or **`check_es_instance_health.py` rule-engine MUST**, Alibaba Cloud OpenAPI and CloudMonitor do **not** replace the listed Elasticsearch REST calls for engine-level root cause—when **feasibility** holds, run those `curl` endpoints (see §7); they are complementary layers, not interchangeable. > > **Feasibility is decided only by checks, not by assumption.** Whether the agent may call Elasticsearch **must** be determined by actually running the **Feasibility order** (§5): at minimum verify `ES_ENDPOINT` / `ES_PASSWORD` per §2.2, align `ES_ENDPOINT` with `DescribeInstance`, then authenticated `GET /_cluster/health`. **Do not** assume `ES_*` is unset or the path is unreachable without performing these steps in the session. For Elasticsearch incidents, follow these **four steps**; each has a distinct role. ### Execution strategy (root-cause driven) > Full policy: [es-api-diagnosis-strategy.md](references/es-api-diagnosis-strategy.md) Data-plane `curl` collection requires **both**: 1. **Feasibility**: `ES_ENDPOINT` and `ES_PASSWORD` are set and the network path works. 2. **Necessity**: root-cause analysis needs data-plane evidence that the control plane or CMS cannot establish alone. > For endpoints **listed** under a fired **MUST** table row **or** **rule-engine MUST**, **necessity** for those calls is **already satisfied** by the trigger—still require **feasibility** (**Feasibility order**). For **optional** engine `curl` **not** in those lists, apply **feasibility** and **necessity** per [es-api-diagnosis-strategy.md](references/es-api-diagnosis-strategy.md). **MUST triggers** (if **any** CMS condition below holds, collect the listed Elasticsearch evidence): | Trigger | Scenario | Required Elasticsearch evidence | |---------|----------|----------------------------------| | `ClusterStatus` max ≥ Yellow / Red | Cluster health | `allocation/explain`, `_cat/shards` | | `NodeCPUUtilization` max > 80% | CPU overload | `_nodes/hot_threads`, `_tasks` | | `NodeHeapMemoryUtilization` max > 85% | Memory pressure | `_nodes/stats/breaker`, `GET /_cluster/settings?include_defaults=true` ( **`indices.breaker.*`** in transient / persistent ) | | Thread pool `rejected` > 0 | Performance | `_nodes/hot_threads`, `_nodes/stats/thread_pool` | | Inter-node resource CV > 0.3 | Load imbalance | `_cat/shards`, `_cat/allocation` | | Write failures or index read-only | Disk / watermark / blocks | `_cluster/settings`, `_all/_settings?filter_path=*.settings.index.blocks`, `_cat/allocation` | | Intermittent Elasticsearch API timeouts + CMS CPU > 80% | Possible cascading failure | `_nodes/hot_threads`, `_nodes/stats/thread_pool`, `_tasks` | > **Thread-pool row:** interpret **search** vs **write** / **bulk** using [sop-query-thread-pool.md](references/sop-query-thread-pool.md) vs [sop-write-performance.md](references/sop-write-performance.md) (see also **Write-path / bulk saturation** below). > **Rule-engine MUST:** If `check_es_instance_health.py` prints a **§5 MUST / §5–§7** callout for this run, treat it like a row above—collect that listed ES evidence when feasibility holds. > **Binding rule (MUST triggers):** If **any** MUST-trigger row **or** the **rule-engine MUST** line above applies, **necessity is satisfied** for that evidence set—OpenAPI/CMS cannot replace those calls for engine-level root cause (cluster-health: `allocation/explain` + `_cat/shards` for Yellow/Red). Confirm **feasibility** per **Feasibility order** below. If reachable with auth, **run the MUST-listed endpoints in Step 2** in parallel with control-plane collection. If still blocked **after authenticated** `GET /_cluster/health`, lead with **blocking reason**: unset `ES_*`; transport failure (`000`, refused, timeout); **401 with `-u`**; scheme/TLS mismatch—not **401 on an unauthenticated probe** when `ES_PASSWORD` is SET. ### Write-path / bulk saturation > If **`ThreadPool.WriteRejected`** or **`write`** pool stress matches **high-QPS bulk** indexing, read and follow **`references/sop-write-performance.md` — §2**, subsection **“Evidence interpretation: bulk QPS → write pool”** for the evidence chain, **`rejected`** semantics (cumulative since node start), **report ordering vs Old GC / heap** (causal chain or dual P0 — write path before JVM-only headline), **per-node `rejected`/`completed` numbers** (reject share), per-node asymmetry, and write-only vs search. **Do not** lead with a JVM-only narrative when that subsection applies. For **write-queue–style** acceptance prompts, the **opening conclusion** should read as **write-capacity** (data-plane counters + optional CMS rule names), not **only** a GC/heap headline. ### Search-primary vs write (both pools show cumulative `rejected`) > When **`_nodes/stats/thread_pool`** shows **`search.rejected` ≫ `write.rejected`** on the same node(s) and **`ThreadPool.SearchRejected`** / **query-driven** overload applies, **lead** the **executive summary** and **P0 ordering** with **`search`** (high concurrent query / terms / slow query; hot index when verified) — **not** **`write`** first. **`write.rejected`** may remain **P0/P1** as **parallel** or **secondary** (bulk, catch-up); **Old GC / CPU / node disconnect** stay **co-stress or cascade**. Checker **listing order** is not proof of narrative order — see [acceptance-criteria.md](references/acceptance-criteria.md) **§6.5** and [sop-query-thread-pool.md](references/sop-query-thread-pool.md) *Report narrative*. > > **Recency overrides this magnitude default** when **time-resolved** evidence exists: **do not** rank the opening story by **`search.rejected` vs `write.rejected` alone** — cumulative counters lack timestamps. Full rubric: [acceptance-criteria.md](references/acceptance-criteria.md) **§6.5** (*P0 / executive order vs `search` ≫ `write`*: **unless** write dominated **by time**) and **§6.6** (*Executive order*, *No false recency from counters*). **Binding:** **Timeline and recency (MUST)** below (same skill). ### `activating` / change workflow stuck (cross-layer root cause) > When an instance stays in **`activating`**, a change is unfinished, and **Red** or unassigned shards coexist, follow **`references/sop-activating-change-stuck.md`** end-to-end (**MUST** includes `ListActionRecords`, `DescribeInstance` before/after remediation, collection order **section 3.1**, reporting **section 4**). ### Pre-flight validation for Elasticsearch API > **[IMPORTANT] `ES_ENDPOINT` must match the diagnosed instance** > > Compare `publicDomain` / `domain` and **`protocol`** from `DescribeInstance` with `ES_ENDPOINT`. > If they differ, warn: `⚠️ ES_ENDPOINT does not match the current instance; run export ES_ENDPOINT="http://{publicDomain}:9200"` when **`protocol` is `HTTP`**, or `https://…` only when **`protocol` is `HTTPS`** (adjust host/port to match the deployment). ### When Elasticsearch credentials are missing or connections fail > **[CRITICAL]** Guide the user to fix connectivity explicitly; **classify** failure modes (do not default persistent timeouts to “allowlist only”). **Do not imply the agent “forgot” Elasticsearch** — if the first answer is CMS/OpenAPI-heavy, give the **blocking reason** per **Feasibility order** below: unset `ES_*`; transport errors; **401 with valid `-u`**; TLS/scheme—not **401** on a probe **without** `-u` when `ES_PASSWORD` is SET (use authenticated `curl` first). **Progressive playbook (read in order):** [references/es-api-call-failures.md](references/es-api-call-failures.md) (sections **1 → 4**). **MUST / strategy context:** [references/es-api-diagnosis-strategy.md](references/es-api-diagnosis-strategy.md) (sections 1–3 and **3.5** summary table). ### Mandatory warning when MUST applies but Elasticsearch is not configured > **[CRITICAL] If a MUST trigger fires but data-plane evidence is missing, put a warning at the top of the report:** follow **section 4** of [references/es-api-call-failures.md](references/es-api-call-failures.md) (blocking reason first, then MUST list, missing evidence; if `ES_*` unset, pointer to **section 2.2** of this SKILL; if vars are set, use es-api-call-failures **sections 1–2** for auth vs transport). ### Step 1: Quick health scan (initial signals) Run the lightweight rules engine (17 metric rules) to list P0 / P1 / P2 findings and steer deeper collection: ```bash python3 scripts/check_es_instance_health.py -i -r [--window ] [--profile ] ``` ### Feasibility order (agent) 1. **Run** §2.2 `ES_*` checks (password = SET only)—**do not skip**; never infer feasibility without this step. 2. `ES_ENDPOINT` matches `DescribeInstance` `domain` / `publicDomain` (scheme/port). 3. **Authenticated** `GET /_cluster/health`—do not stop at **401** on an unauthenticated probe if `ES_PASSWORD` is SET. 4. MUST scope: **table rows** and/or **rule-engine MUST** line in §5. ### Step 2: Collect evidence in parallel Based on Step 1, run collection in parallel (prioritize dimensions with signals). If a **MUST-trigger** row **or rule-engine MUST** applies: run **Feasibility order**, then **run that Required Elasticsearch evidence** via `curl` in the **same** round (see §7). If **no** MUST applies, add optional data-plane `curl` only when **feasibility** and **necessity** both hold per the strategy doc. Re-run **`check_es_instance_health.py`** with the same invocation pattern as Step 1; for this parallel round, **`--window 120`** and explicit **`--profile `** are common. To backfill control-plane evidence (`DescribeInstance`, `ListSearchLog`, CMS-style calls), use **`aliyun`** patterns in [references/verification-method.md](references/verification-method.md) (epoch times, profiles, namespaces). > Note: data-plane access still requires `ES_ENDPOINT` / `ES_PASSWORD`; the Aliyun CLI **cannot** replace `curl` to the cluster. > > For **MUST-trigger** rows, necessity for the **listed** endpoints is **already** established—do **not** skip them when feasibility including reachability holds. **Outside** those rows, avoid unrelated bulk `curl` solely because `ES_*` is set; use the strategy doc’s feasibility + necessity test instead. ### Step 3: Read SOPs by signal Map signals to SOPs and read for deeper reasoning. With multiple signals, process **P0 → P1 → P2** for **severity**, then apply **Timeline and recency (MUST)** in Step 4 so the **narrative order** matches **when** signals mattered in the window—not only static rule-engine print order. | Observed signal | Read | |-----------------|------| | Cluster Red/Yellow, node loss, pending tasks | `references/sop-cluster-health.md` | | Long `activating`, unfinished change records, Red / unassigned shards | `references/sop-cluster-health.md` + `references/sop-activating-change-stuck.md` | | High CPU, load, imbalance | `references/sop-cpu-load.md` | | Per-node load imbalance (CPU/memory/disk/shard count) | `references/sop-node-load-imbalance.md` | | JVM pressure, GC, circuit breaker, OOM | `references/sop-memory-gc.md` | | Disk watermark, IO, write failures (read-only) | `references/sop-disk-storage.md` | | Watermark misconfiguration, index blocks, “normal” disk % but write failures | `references/sop-disk-storage.md` (Section 3 — watermark misconfiguration) | | Write timeouts / rejections / latency / QPS drop | `references/sop-write-performance.md` | | Query timeouts / rejections / slow queries | `references/sop-query-thread-pool.md` | | Nodes look down but CPU still reported; `all shards failed` | `references/sop-service-avalanche.md` | | Intermittent Elasticsearch timeouts + CMS CPU > 80% | `references/sop-service-avalanche.md` | | Risky settings, Ngram issues, API anomalies | `references/sop-configuration.md` | | Event code definitions | `references/health-events-catalog.md` | ### Step 4: Synthesize and write the structured report > **Acceptance-style optional checklists:** [references/acceptance-criteria.md](references/acceptance-criteria.md) **§6.1**–**§6.6** — Red/Yellow; read-heavy CPU + `search` pool (+ CMS alignment); JVM / breakers / fielddata; write-queue vs GC + **`rejected`/`completed`**; read-heavy **search pool vs GC-only** headline (expand in [sop-query-thread-pool.md](references/sop-query-thread-pool.md) *Report narrative: search pool vs GC / CPU headlines*); timeline/recency. **Bulk/write:** [references/sop-write-performance.md](references/sop-write-performance.md) §2. **Shard `reroute`:** [references/sop-node-load-imbalance.md](references/sop-node-load-imbalance.md) §1.3 (allocator / change control only). > **[CRITICAL] Remediation must match the diagnosed root cause** — avoid generic templates. Wrong breaker or concurrency fixes (e.g. `in_flight_requests` vs `request`, “split query” when concurrency is the issue) → see **`sop-memory-gc.md`** and the fired signal’s SOP. > **`activating` + data-plane anomaly**: include the **one-line cross-layer root cause**; see `references/sop-activating-change-stuck.md` **section 4**. **Report skeleton (copy/fill):** [references/report-template.md](references/report-template.md). ### Timeline and recency (MUST for synthesized reports) > **Problem:** `check_es_instance_health.py` and P0/P1/P2 bands express **severity**, not **when** a signal mattered most within the analysis window. **Cumulative** engine counters (`search.rejected`, `write.rejected`) do **not** encode recency—**write** and **search** issues can both be “real” while **only one path dominated the recent past** (e.g. search pressure **closer to window end** than write pressure). **Binding rules for the agent:** 1. **Two axes** — Treat **severity** (P0/P1/P2) and **temporal relevance** (proximity to window end / “now”) as **orthogonal**. Do **not** infer recency from priority alone (e.g. “write is P0 so it must be the current headline”) when time-resolved evidence says otherwise. 2. **Mandatory human-facing section** — When more than one major finding fires (e.g. **write pool + search pool + GC/CPU**), the synthesized report **must** include an **`### Incident timeline (recency-ordered)`** (or equivalent) block **before or immediately after** the executive summary, unless the user explicitly asks for a minimal report. In that block: - Order **bullets or rows by time** (earlier → later), or state **which signal cluster peaked / persisted in the **latter** portion of `{begin} ~ {end}`**. - Call out **divergence**: e.g. “write-path stress **earlier** in window; search-path / CPU **more recent**” when CMS or logs support it. 3. **Evidence for recency** (use what exists; do not invent timestamps): - **CloudMonitor**: per-metric time series — note **peak timestamp** or **sustained-high interval** for `NodeCPUUtilization`, `NodeHeapMemoryUtilization`, GC-related metrics, `ThreadPool.*` **if** exposed as rates or non-cumulative series in the collected JSON. - **Slow logs / `ListSearchLog`**: correlate **query vs index** slow entries to minutes. - **Engine (optional):** two `_nodes/stats/thread_pool` samples at **known times** to show **delta** on `rejected` / `completed`; or **`_tasks`** / **`hot_threads`** for **current** skew vs historical cumulative counters. 4. **Executive summary ordering** — The **opening 2–4 sentences** should reflect **recency-weighted** user impact: if **search** pressure is **closer to current** than **write** pressure, **lead** with search/query concurrency and co-stress (GC/CPU) **as appropriate**, and place **historical write saturation** as **context** or **second wave**—**without** dropping P0 write findings if they remain valid for remediation backlog. 5. **Explicit uncertainty** — If only cumulative counters exist and **no** time series differentiates paths, state **one line**: recency is **undifferentiated**; recommend **narrower window**, **slow logs**, or **delta sampling** for the next run. --- ## 6. Data collection details (CLI OpenAPI + injected input) ### One-shot entry Use the same **`check_es_instance_health.py`** command as **§5 Step 1** (optional **`--window`** / **`--profile`**; default window **60** minutes if omitted). ### Injected input mode (paired with CLI) `check_es_instance_health.py` accepts external JSON to avoid duplicate calls: ```bash python3 scripts/check_es_instance_health.py \ -i -r \ --data-source input \ --input-json-file /path/to/diag-input.json ``` Input JSON shape: ```json { "status_info": {}, "metrics": {}, "events": [], "logs": [] } ``` `--data-source` modes: - `auto`: prefer injected fields; backfill gaps via Aliyun CLI. - `cli`: ignore injection; fetch everything via CLI. - `input`: injection only; no OpenAPI calls. ### Manual control-plane CLI backfill For additional OpenAPI examples, see `references/verification-method.md`. --- ## 7. Elasticsearch direct API access (data-plane deep dive) When **feasibility** holds (including reachability), execute the REST calls **required by any MUST-trigger row** (§5). For endpoints **not** listed in a fired MUST row, call them only when **feasibility** and **necessity** both hold per the strategy doc. > `ES_ENDPOINT` may be `host:port` or a full URL. For the samples below, normalize to `http://${ES_ENDPOINT#http://}` (use `https://` consistently when the cluster serves TLS). > > **Timeouts**: every `curl` must use `--connect-timeout 10 --max-time 30`. ### Red / Yellow (MUST) — recommended set **Scope:** The cluster-health MUST row uses `ClusterStatus` max ≥ **Yellow** (includes **Red**). Use this set for **unassigned / misallocated shard** root cause on the engine. ```bash curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_cluster/health?pretty" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ -H "Content-Type: application/json" \ -X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \ -d '{}' curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty" ``` ### Query / write performance (MUST) — recommended set > Include **`_cluster/settings`** when **heap / GC / breaker** rules fired in Step 1 or **`_nodes/stats/breaker`** shows concern — read **transient** and **persistent** `indices.breaker.*` / `network.breaker.*`. ```bash curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty" ``` > **`/_cluster/pending_tasks`** and **`GET /_nodes/stats/thread_pool`** are also listed under **Red / Yellow (MUST)** above—one call each per session when both sections apply. If you run **only** this performance block, add those two `curl` lines from that block. ### Resource anomalies without a closed loop (SHOULD) — recommended set ```bash curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty" curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \ "http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb" ``` > **`GET /_cluster/settings?include_defaults=true`** also appears under **Query / write performance (MUST)** above—reuse one response when both blocks apply. If you run **only** this SHOULD block, add the same `curl` line from that block. **Protocol sanity** (avoid `WRONG_VERSION_NUMBER`): usually **http/https scheme mismatch** on `ES_ENDPOINT` — fix scheme/port and retry. **Scenario → endpoint** index: [references/es-api-catalog.md](references/es-api-catalog.md). --- ## 8. Diagnostic coverage The knowledge base covers **48+** health-event-style rules and chained scenarios (e.g. disk pressure → allocation → Red). **Per-category counts, P0/P1/P2 mix, and event codes:** [references/health-events-catalog.md](references/health-events-catalog.md) — scenario runbooks: `references/sop-*.md` (index: [references/README.md](references/README.md)). --- ## 9. Best practices **Read-only:** no mutating control-plane APIs; no teardown. 1. **Layered + evidence-bound**: scan → SOP depth; every conclusion cites metrics/logs/events; if ES is unreachable, state limits ([es-api-call-failures.md](references/es-api-call-failures.md)). 2. **Priority vs narrative**: **P0→P2** for urgency; **Incident timeline** when multiple dimensions differ in time (Step 4). **Credentials / TLS / parameters:** §1–2 and §4. 3. **Green is not “all clear”** — watermarks, blocks, mis-set limits still matter; **MUST + reachable ES:** do not skip §5/§7 evidence because the cluster is Green or OpenAPI “explains” symptoms. 4. **Thread-pool `rejected`:** **cumulative** unless you show a delta — [sop-query-thread-pool.md](references/sop-query-thread-pool.md) §1–2; write/bulk: [sop-write-performance.md](references/sop-write-performance.md) §2. --- ## 10. Reference links - `references/verification-method.md` — Verification (how to validate diagnosis; metrics, APIs, workflows) - `references/report-template.md` — Structured diagnosis report skeleton - `references/README.md` — **Language map** (reference assets and `sop-*.md` runbooks; English in this repo) - `references/ram-policies.md` — RAM policy list - `references/acceptance-criteria.md` — Correct/incorrect patterns and acceptance (includes credential and safety anti-patterns) - `references/cli-installation-guide.md` — Aliyun CLI installation - `references/es-api-catalog.md` — Elasticsearch REST API catalog - `references/health-events-catalog.md` — Health event catalog - `references/sop-*.md` — Scenario SOPs (e.g. `sop-activating-change-stuck.md` for `activating` / change stuck, cross-layer root cause) - `references/es-api-diagnosis-strategy.md` — Elasticsearch API diagnosis strategy