--- name: alibabacloud-emr-starrocks-assistant description: > Alibaba Cloud EMR Serverless StarRocks development & operations assistant. Covers five scenarios: cluster connection, schema design, data ingestion, SQL development & tuning, and cluster health diagnostics. Use this Skill when users ask about StarRocks table design, writing SQL, choosing an ingestion method, query execution plans, materialized views, cluster health checks, FE/BE/CN node status, tablet health, or compaction. Typical scenarios: table design, Stream Load / Routine Load / Broker Load selection, SQL optimization, window functions, CTEs, JOIN tuning, materialized view design, cluster health inspection, node-down diagnosis. Not applicable for: StarRocks instance lifecycle management (create / scale / restart / config change / version upgrade — these are control-plane operations, please use the EMR Serverless console or the corresponding OpenAPI), or other Alibaba Cloud products (EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS). license: MIT allowed-tools: Bash Read compatibility: > Python 3.10+ with uv; reachable StarRocks FE endpoint (default port 9030). On first use, run `sr-login` to register a cluster credential locally. Privileges follow the user's own account — capabilities (which databases / diagnostic commands are accessible) are introspected at login via SHOW GRANTS FOR CURRENT_USER(). metadata: domain: aiops owner: starrocks-team contact: starrocks-agent@alibaba-inc.com required_starrocks_privileges: - "Whatever the user's account already has. The skill does not create or elevate accounts." - "For full feature coverage: SELECT on the databases of interest + OPERATE ON SYSTEM for cluster diagnostics. Missing privileges degrade gracefully (specific suggestions are skipped)." --- # Alibaba Cloud EMR Serverless StarRocks Development & Operations Assistant Help users perform day-to-day table design, data ingestion, SQL writing & tuning, and health diagnostics on Alibaba Cloud EMR Serverless StarRocks. All cluster access goes through the bundled `srsql` CLI (pymysql-based, uses the user's own account); no MySQL client required. Non-READ SQL is classified by sqlglot and requires `--yes` confirmation before execution. > **Scope statement**: This Skill focuses on *using* StarRocks — development, diagnostics, and day-to-day data operations. Cluster-internal data and schema operations (DDL, DML, materialized view refresh, GRANT, etc.) are supported and execute under the user's own account, gated by sqlglot classification + `--yes` confirmation. Instance-lifecycle operations (create, scale, restart, configuration change, version upgrade) are control-plane operations and are **not** in this Skill's scope; please use the EMR Serverless console or the corresponding OpenAPI. ## When to use / When not to use **When to use**: - Schema design (table model, partitioning, bucketing, sort key, indexes, storage parameters) - Data ingestion selection (Stream/Broker/Routine Load, INSERT, Pipe, Flink/Kafka Connector, CDC) - SQL writing, rewriting, and tuning (JOIN strategy, window functions, CTE, aggregation optimization, statistics) - Materialized view design and operations - Cluster health diagnostics (FE/BE/CN nodes, tablet health, compaction, warehouse, recent failed loads) **When NOT to use**: - Instance lifecycle control: create / scale / restart / config change / upgrade StarRocks instances — these are control-plane operations; use the EMR Serverless console or the corresponding OpenAPI instead - Operating non-StarRocks products: EMR Cluster, Spark, Milvus, ClickHouse, Doris, RDS, ECS, etc. ## First-time setup: install & log in This Skill ships with the `sr-connect` Python CLI. See [references/connect.md](references/connect.md) for details. ### Assistant bootstrap protocol (instructions for Claude) When this Skill is invoked and you anticipate running any cluster query, ensure `srsql` is available *before* asking the user for anything: 1. Run `which srsql`. If it returns a path, skip to step 4. 2. If missing, install it yourself: `uv tool install ` where `` is the directory containing this `SKILL.md` and `pyproject.toml` (the Skill's base directory shown at invocation time; commonly `~/.claude/skills/alibabacloud-emr-starrocks-assistant/`, which may be a symlink). **Do not ask the user to run this** — the bundled CLI is part of the Skill's capability surface, not user infrastructure. 3. If `uv` itself is missing (`which uv` fails), surface that to the user — `uv` is a system tool and not auto-installed. 4. Check `~/.starrocks/{profile}.cnf` (default profile name: `default`; respect `SR_PROFILE` env var if set). If it exists, skip to step 5. If missing: - **First try `sr-login --from-env`.** Safe to call unconditionally — it exits 2 with a clear "missing" message when the environment doesn't have the credentials it needs, and does nothing else. You do not need to inspect environment variables yourself. - **If `sr-login --from-env` exits non-zero**, the user hasn't logged in yet. Give them the `sr-login --host ... --user ...` command and ask them to run it themselves. **Do not run interactive `sr-login` yourself** — it would block on a password prompt you cannot answer. 5. After both `srsql` is on PATH and the profile file exists, run queries via `srsql -e "..."` yourself. If `srsql` was just installed in this session and PATH hasn't been refreshed in the user's shell, fall back to the absolute path printed by `uv tool install` (typically `~/.local/bin/srsql`). **Chat-style rule after bootstrap succeeds**: Do **not** echo `sr-whoami` / `srsql -e "..."` invocation syntax to the user as a "you can now run …" hint. You are the one calling these CLIs on the user's behalf — the user drives the Skill, not the binaries. Skip the post-success "next step" narration entirely and just ask what they want to do, or proceed if their intent is already clear. ### Login command (give this to the user when their profile is missing) ```bash # EMR Serverless StarRocks — both internal and public endpoints use the MySQL # wire protocol over plain TCP; no SSL/TLS. Use the same form for either. sr-login --host --port 9030 --user # Verify sr-whoami srsql -e "SELECT CURRENT_VERSION()" ``` Re-running `sr-login` with the same `--profile` silently overwrites the stored credential (same semantics as `docker login`). Use `--profile` for multi-cluster: ```bash sr-login --profile prod --host fe-prod.xxx --user app_user SR_PROFILE=prod srsql -e "..." ``` ## Security model This Skill has **two layers**: 1. **FE is the authoritative permission boundary.** The user supplies their own StarRocks account; whatever they're allowed to do, they're allowed to do. The Skill does not create, elevate, or rotate any accounts. 2. **`srsql` is a UX gate, not a security boundary.** Every statement is parsed by sqlglot (dialect `starrocks`): - `READ` (SELECT / SHOW / DESC / EXPLAIN / WITH / …) executes directly. - **Any non-READ** (INSERT / UPDATE / DELETE / DDL / GRANT / SET / USE / …) **is refused unless `--yes` is passed**. - SQL sqlglot cannot parse falls back to a leading-keyword check; if still ambiguous → `UNKNOWN`, treated as non-READ, executable with `--yes` plus a soft warning. When the user asks for a write operation: 1. Show them the SQL you intend to run. 2. Optionally preview classification via `srsql --dry-run -e "..."`. 3. Get explicit confirmation in chat. 4. Then run with `srsql --yes -e "..."`. For DDL on production tables, or operations that change global cluster state (CREATE/DROP USER, ADMIN SET CONFIG, etc.), prefer to print the SQL and let the user run it themselves — even though the gate would let them run it via `--yes`. The gate is a safety net, not a license. ## Input validation & command-injection protection SQL passed into `srsql -e "..."` is assembled by the LLM and must follow these rules: 1. Identifiers (table / column / database names) are validated before interpolation: only `[A-Za-z0-9_]` plus backtick-quoted forms. 2. User-provided string values (search terms, label names, etc.) are **not** spliced into SQL directly; use parameter binding or pre-escape. 3. Never execute raw user-provided strings as SQL fragments. ## Sensitive data masking | Scenario | Handling | |----------|----------| | Profile file content (incl. user password) | Never echoed; mode 600 under a 700 directory | | Password in error messages | Truncate / replace with `******` | | Query results contain obvious key / token columns | Warn the user without displaying full content | | `aliyun configure list` output containing AK | Show only the first 4 chars; replace the rest with `****` | ## Intent routing > **Disambiguation rule**: When the user input is ambiguous (e.g. "ingestion is slow", "queries are slow") and context is unclear, ask one clarifying question before acting. | User intent | Route | Reference | |-------------|-------|-----------| | First-time cluster connection / register or switch credentials / multi-cluster setup | sr-login / sr-whoami / sr-logout | [references/connect.md](references/connect.md) | | New table / change schema / table model selection / partition+bucket design | Schema design | [references/schema.md](references/schema.md) | | Choose ingestion method / configure Stream/Broker/Routine Load / Flink/Kafka Connector | Import selection | [references/data-import.md](references/data-import.md) | | Write SQL / optimize SQL / materialized views / function selection / read execution plans | SQL development & tuning | [references/sql.md](references/sql.md) | | Cluster health check / FE/BE/CN status / unhealthy tablets / compaction lag | Cluster diagnostics | [references/diagnostics.md](references/diagnostics.md) | | "Ingestion used to be fine, suddenly slow" | Cluster diagnostics (distinct from import selection) | [references/diagnostics.md](references/diagnostics.md) | | "How should I design a new ingestion pipeline" | Import selection | [references/data-import.md](references/data-import.md) | ## Five scenarios at a glance ### 1. Schema design Four table models and their typical use cases: | Use case | Model | |----------|-------| | Logs / events / detail records | Duplicate Key | | Pre-aggregated metrics | Aggregate | | Real-time upsert / CDC | Primary Key | | Simple deduplication | Unique Key (for new use cases, prefer Primary Key) | **⚠ Anti-patterns — do not produce these in DDL:** - **Shared-data PK table without `persistent_index_type=CLOUD_NATIVE` + `datacache.partition_duration`** — LOCAL index doesn't survive CN rebalance; no hot-data caching window. See [schema/storage-properties.md](references/schema/storage-properties.md). - **Setting `datacache.partition_duration` to an arbitrary "hot window" (e.g. `30 DAY`) instead of the user's stated query window** — the value MUST be **≥ the query window**. If the user says "查询近 N 天" / "queries the last N days", set `datacache.partition_duration = "N DAY"` (or larger). A value smaller than the query window guarantees cache misses on in-window queries. Do not default to 7/30/60 days when the user has given you a number. - **`storage_cooldown_time`/`storage_cooldown_ttl`/`storage_medium`/`replicated_storage` on shared-data** — silently stripped or rejected by `PropertyAnalyzer`; use `datacache.partition_duration` for the cooldown effect. - **FLOAT / DOUBLE columns inside `PRIMARY KEY`** — not supported; use BIGINT or DECIMAL. - **Treating "CN" as a shared-nothing signal** — CN = Compute Node, which is the shared-data terminology. BE = Backend = shared-nothing. See [references/schema.md](references/schema.md). ### 2. Data ingestion | Data source | Recommended method | |-------------|--------------------| | Local files < 10 GB | Stream Load | | Object storage / HDFS bulk | Broker Load or `INSERT INTO ... FROM FILES()` | | Object storage with continuous file arrivals | Pipe + AUTO_INGEST | | Kafka / Pulsar | Routine Load or Kafka/Flink Connector | | MySQL CDC | Flink CDC + Flink Connector | **⚠ Anti-patterns — do not produce these in load configs:** - **PK-table DELETE without `__op` integer column (`0`=UPSERT, `1`=DELETE) in COLUMNS list + `$.__op` in `jsonpaths`** — all events are silently treated as UPSERT. **The `__op` contract is a pair and must be taught as a pair**: the literal column name is `__op`, and the integer values are `__op=0` for UPSERT **and** `__op=1` for DELETE. Even when the user only asks about DELETE, your response MUST state **both** mappings (`__op=0` → UPSERT, `__op=1` → DELETE) — never one without the other. This applies on every ingestion path including Flink Connector and Kafka Connector, where the connector populates `__op` for the user but they still need both values to debug "DELETE not applied" / "UPSERT not applied" symptoms. - **Treating `partial_update=true` as a DELETE enabler** — it controls partial-column UPSERT and has **nothing to do** with DELETE. If a user enables it while asking why DELETE doesn't work, **flag it as misconfigured-for-intent** and tell them to remove it unless they actually have a partial-column UPSERT use case. Do not validate the existing setting just because it parses. - **`COLUMNS FROM PATH AS (...)` in Routine Load** — that's Broker Load's Hive-partition path syntax; not valid in Routine Load. - **`__op` values as strings (`"upsert"`/`"delete"`)** — must be the integers `0` / `1`. - **High-throughput CDC (≥ ~10K events/sec) without flagging TOO_MANY_VERSION risk** — applies to Routine Load, Flink Connector, Kafka Connector, not just `INSERT INTO VALUES`. Whenever the user's scenario implies high event rate, the recommendation MUST cover: (a) the method-appropriate concurrency cap (`desired_concurrent_number` ≤ Kafka partitions for Routine Load; `sink.parallelism` ≤ Kafka partitions for Flink/Kafka Connector), AND (b) an explicit TOO_MANY_VERSION / compaction-pressure warning with the relevant flush-interval guidance. See [references/data-import.md](references/data-import.md). ### 3. SQL development | Use case | Pattern | |----------|---------| | Period-over-period / cumulative / Top-N | Window functions | | Large fact table JOIN small dimension (right side ≤ `broadcast_row_limit`, default 15M rows) | Broadcast / Colocate | | Complex layered logic | CTE | | Billion-scale deduplication | `APPROX_COUNT_DISTINCT` / BITMAP / HLL | | High-frequency repeated query acceleration | Asynchronous materialized view | | Cross-source query | External Catalog | **⚠ Anti-patterns — do not produce these in query rewrites or tuning advice:** - **Wrapping the partition column with `date_format()` / `date_trunc()` / `cast()` in WHERE** — breaks partition pruning; rewrite as a range predicate (`col >= '...' AND col < '...'`). - **Tuning advice without `EXPLAIN VERBOSE` + checking `partitions=N/M` and `tabletRatio=N/M`** — pruning failures (numerator == denominator) go undetected; never use plain `EXPLAIN` for this. - **Reading `cardinality` in EXPLAIN as the result row count** — it's the **CBO's row estimate**. Always quantify the staleness gap using the **direct comparison `cardinality` vs the user-stated total table size** (e.g. "estimate 5M vs total 500M ≈ 100×"); a ratio > 10× means stats are stale → run `ANALYZE TABLE`. - **Estimating "real filtered rows" by guessing predicate selectivity, then comparing cardinality to that guess** — you don't have runtime row counts, and guessing selectivity from a predicate like `WHERE create_time > '...'` introduces large errors (you don't know the data distribution). When the user gives you a total row count, compare `cardinality` to that **directly**; do not divide the total by an assumed time window or selectivity factor. - **Conflating `partitions`/`tabletRatio` pruning failures with `cardinality` deviation** — these are **two independent diagnostic signals**. When both look bad in the same OlapScanNode (e.g. `partitions=N/N` AND `cardinality` off from total table size by 10×–100×), report them as **separate findings with separate fixes** (predicate/type fix vs `ANALYZE TABLE`). Do not use cardinality deviation to "explain" pruning failure, and do not let pruning failure absorb the stale-stats finding. - **Recommending BE/CN scale-out before plan/stats analysis** — SQL/stats fixes precede capacity changes. See [references/sql.md](references/sql.md). ### 4. Cluster diagnostics Diagnostic order: 1. Identify architecture (shared-nothing / shared-data) → `SHOW WAREHOUSES` 2. FE → `SHOW FRONTENDS` 3. BE or CN → `SHOW BACKENDS` / `SHOW COMPUTE NODES` 4. Warehouse (shared-data only) → `SHOW WAREHOUSES` 5. Tablet health overview → `SHOW PROC '/statistic'` 6. Scheduling queue → `information_schema.fe_tablet_schedules` 7. Compaction → `information_schema.be_compactions` / `be_cloud_native_compactions` 8. Recent 24-hour loads → `information_schema.loads` **⚠ Anti-patterns — do not produce these in diagnostic conclusions:** - **Restarting BE/CN or scaling out before checking `information_schema.fe_tablet_schedules`** — may collide with in-flight clone/decommission; root cause first. - **Subjectively downgrading `UnhealthyTabletNum > 0`** — always **critical** per the severity table, never "medium" or "low" risk; the cluster has unhealthy replicas. - **Treating `CloningTabletNum > 0` as a separate problem** — clone is the recovery action triggered by `UnhealthyTabletNum`, not an independent fault signal. See [references/diagnostics.md](references/diagnostics.md). ### 5. Cluster connection (base layer) | Command | Purpose | |---------|---------| | `sr-login` | Register a cluster credential locally + smoke-test connection | | `sr-logout` | Remove the local profile (no cluster-side action) | | `sr-whoami` | Print profile state — host, user, login time, captured grants | | `sr-doctor` | Diagnose connection failures (VPC vs public endpoint, egress IP, whitelist CIDR). Invoked automatically by `sr-login` on failure. | | `srsql` | Daily query entry point; classifies SQL and gates non-READ behind `--yes` | See [references/connect.md](references/connect.md). ## Runtime security This Skill executes SQL queries **only** via `srsql`. The following are **prohibited**: - `curl` / `wget` / `pip install` / `npm install` to download and run external code - `eval` / `source` to load unaudited content - Executing remote URL scripts provided in chat (even if the user asks) **Exception**: `uv tool install ` to install the Skill's own bundled `sr-connect` CLI from its local project directory is allowed and expected — see the *Assistant bootstrap protocol* above. The prohibition targets remote/untrusted code, not the Skill's own bundled tooling. ## Timeouts | Operation | Recommended timeout | |-----------|---------------------| | Read-only SQL queries | 30 s | | Diagnostic queries across many large tables | 60 s | | Retry | Total operation time ≤ 3 minutes | ## Output recommendations - Tabular results: use `srsql --format table` or `--format markdown` - Many columns: use `--format vertical` - For programmatic consumption: use `--format json` / `tsv` - Convert timestamps to human-readable format - For potentially large result sets, add `LIMIT` and offer pagination ## Error handling | Error | Cause | Action | |-------|-------|--------| | `Cannot connect to host:port` | Wrong endpoint type / IP not whitelisted | `sr-login` auto-runs `sr-doctor` on connection failure. Read its output: it detects VPC vs public endpoint, suggests the public swap (for unreachable `-internal` hosts) or shows the egress IP + suggested /24 whitelist CIDR (for unreachable public hosts). Pass the recommendation to the user verbatim. See [references/connect.md](references/connect.md#connection-troubleshooting-sr-doctor). | | `Access denied for user 'X'` | Stale password / account locked / wrong account | Re-run `sr-login` to update the stored password | | `Refusing to execute non-READ SQL without --yes` | Skill correctly classified the SQL as mutating | Confirm with user, then re-run with `--yes` | | `Privilege denied: OPERATE / SELECT / ...` | User account lacks the privilege | Surface the limitation; skip the affected diagnostic; don't retry | | `Table not found` | Wrong DB / table name | Confirm with `SHOW DATABASES` / `SHOW TABLES FROM db` | | Query returns empty but user expects rows | Over-aggressive predicate / RBAC isolation | Check WHERE clauses; suggest the user verify with admin | | `No profile 'X'` | `srsql --profile X` without prior `sr-login --profile X` | Run `sr-login` for that profile first | **Principle**: Read the full error message before deciding; do not retry blindly on the error code alone. ## Related documents - [references/connect.md](references/connect.md) — sr-connect CLI, install, security model, troubleshooting - [references/ram-policies.md](references/ram-policies.md) — RAM permission declaration (none required; StarRocks-internal auth only) - [references/schema.md](references/schema.md) — schema design flow: table models, partitioning, bucketing, sort key, indexes, storage parameters - [references/data-import.md](references/data-import.md) — ingestion method selection, parameters, performance tuning, Primary Key updates - [references/sql.md](references/sql.md) — query writing, window functions, materialized views, functions, SQL tuning, advanced features - [references/diagnostics.md](references/diagnostics.md) — cluster health inspection flow, severity classification, synthesis template