--- name: alibabacloud-pai-rec-diagnosis description: | Alibaba Cloud PAI-Rec Engine Diagnostic and Configuration Validation Skill. Use for diagnosing PAI-Rec engine interface issues and validating engine configurations. Triggers: "PAI-Rec", "engine diagnosis", "engine config validation", "pairec", "recommendation engine". --- # PAI-Rec Engine Diagnosis and Configuration Validation This skill provides comprehensive diagnostic and validation capabilities for Alibaba Cloud PAI-Rec (Programmable Recommendation System) engines, including interface troubleshooting and configuration analysis. ## Scenario Description PAI-Rec is Alibaba Cloud's programmable recommendation system that provides intelligent recommendation capabilities. This skill helps users: 1. **Diagnose PAI-Rec Engine Interface Issues**: When engine API returns errors or unexpected results, trace the request through EAS service logs and engine configurations to identify root causes. 2. **Validate Engine Configurations**: Analyze engine configuration files for potential issues, inconsistencies, or misconfigurations before deployment. **Architecture**: PAI-EAS Service + PAI-Rec Engine + Engine Configuration Management ### Key Components - **PAI-EAS Service**: Elastic Algorithm Service hosting the recommendation engine - **PAI-Rec Engine**: The recommendation engine processing requests - **Engine Configuration**: Configuration files defining engine behavior - **Service Logs**: EAS service logs containing request traces --- ## Installation **Pre-check: Aliyun CLI >= 3.3.3 required** > Run `aliyun version` to verify >= 3.3.3. If not installed or version too low, > run `curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash` to install/update, > or see `references/cli-installation-guide.md` for installation instructions. **Pre-check: Aliyun CLI plugin update required** > [MUST] run `aliyun configure set --auto-plugin-install true` to enable automatic plugin installation. > [MUST] run `aliyun plugin update` to ensure that any existing plugins are always up-to-date. --- ## Authentication > **Pre-check: Alibaba Cloud Credentials Required** > > **Security Rules:** > - **NEVER** read, echo, or print AK/SK values (e.g., `echo $ALIBABA_CLOUD_ACCESS_KEY_ID` is FORBIDDEN) > - **NEVER** ask the user to input AK/SK directly in the conversation or command line > - **NEVER** use `aliyun configure set` with literal credential values > - **ONLY** use `aliyun configure list` to check credential status > > ```bash > aliyun configure list > ``` > Check the output for a valid profile (AK, STS, or OAuth identity). > > **If no valid profile exists, STOP here.** > 1. Obtain credentials from [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak) > 2. Configure credentials **outside of this session** (via `aliyun configure` in terminal or environment variables in shell profile) > 3. Return and re-run after `aliyun configure list` shows a valid profile --- ## RAM Policy > **[MUST] Permission Failure Handling:** When any command or API call fails due to permission errors at any point during execution, follow this process: > 1. Read `references/ram-policies.md` to get the full list of permissions required by this SKILL > 2. Use `ram-permission-diagnose` skill to guide the user through requesting the necessary permissions > 3. Pause and wait until the user confirms that the required permissions have been granted For detailed RAM policies required by this skill, see [references/ram-policies.md](references/ram-policies.md). --- ## Parameter Confirmation > **IMPORTANT: Parameter Confirmation** — Before executing any command or API call, > ALL user-customizable parameters (e.g., RegionId, service names, instance IDs, > environment names, config names, etc.) MUST be confirmed with the user. > Do NOT assume or use default values without explicit user approval. ### Required Parameters | Parameter Name | Required/Optional | Description | Default Value | |----------------|-------------------|-------------|---------------| | `service-name` | Required | PAI-EAS service name | N/A | | `cluster-id` | Required | EAS cluster/region ID | N/A | | `instance-id` | Required | PAI-Rec instance ID | N/A | | `request-id` | Required (for diagnosis) | Request ID from engine API response | N/A | | `config-name` | Optional | Engine configuration name | N/A | | `environment` | Optional | Engine environment (Prod/Pre) | N/A | | `engine-config-id` | Optional | Specific engine config version ID | N/A | | `region` | Optional | Alibaba Cloud region | N/A | --- ## Core Workflow > At the **start** of the Core Workflow (before any CLI invocation): > **[MUST] Enable AI-Mode** — AI-mode is required for Agent Skill execution. > Run the following commands before any CLI invocation: > ```bash > aliyun configure ai-mode enable > aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis" > ``` > **[MUST] Disable AI-Mode at EVERY exit point** — Before delivering the final response for ANY reason, always disable AI-mode first. This applies to ALL exit paths: workflow success, workflow failure, error/exception, user cancellation, session end, or any other scenario where no further CLI commands will be executed. > AI-mode is only used for Agent Skill invocation scenarios and MUST NOT remain enabled after the skill stops running. > ```bash > aliyun configure ai-mode disable > ``` > **[MUST] Allocate a per-session work directory** — All transient artifacts > (raw config dumps, log dumps, sanitized output) MUST be written under a unique > per-session directory to avoid concurrent overwrite between parallel skill > invocations. Run the following at the start of the workflow, before any > artifact-producing CLI invocation, and reuse `$WORKDIR` for the whole session: > ```bash > # Create an isolated working directory for this session, pinned under /tmp. > # Pass the full template as a positional arg (works on both BSD/macOS and > # GNU/Linux mktemp); do NOT use `-t prefix`, which falls back to $TMPDIR > # (e.g. /var/folders on macOS) and may even fail under sandboxed shells. > export WORKDIR=$(mktemp -d /tmp/pairec-diag-XXXXXX) > ``` > All file paths shown in the steps below (`$WORKDIR/engine_configs_list.json`, > `$WORKDIR/raw_engine_config.json`, etc.) live inside this directory and MUST NOT > be replaced with hard-coded `/tmp/...` paths. ### Workflow 1: PAI-Rec Engine Interface Diagnosis This workflow helps diagnose issues when a PAI-Rec engine API returns errors or unexpected results. **Input Example:** ``` Service Name: embedding_recall API Response: { "code": 299, "msg": "items size not enough", "request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb", "size": 0, "experiment_id": "", "items": [] } ``` #### Step 1: Retrieve EAS Service Information Get the service details to find the EAS service ID and configuration: ```bash aliyun eas describe-service \ --cluster-id \ --service-name ``` **What to extract:** - `Resource`: EAS service resource ID (e.g., `eas-r-1v4qb1yan3qmnjwxqe`) - `ServiceConfig.envs`: Environment variables containing: - `REGION`: The region - `INSTANCE_ID`: PAI-Rec instance ID - `CONFIG_NAME`: Engine configuration name - `PAIREC_ENVIRONMENT`: Environment (product/prepub) #### Step 2: Extract Request ID from API Response Parse the API response JSON to get the `request_id` field. This will be used to search service logs. #### Step 3: Query EAS Service Logs Use the request ID as the **sole** filter to search service logs. Do NOT pass `--start-time` / `--end-time` when searching PAI-Rec business logs: ```bash aliyun eas describe-service-log \ --cluster-id \ --service-name \ --keyword \ --page-size 500 ``` **[CRITICAL] `--keyword ` is MANDATORY — local post-processing is FORBIDDEN:** - You MUST pass `--keyword ` to the CLI command. This is a server-side filter; the API only returns log lines matching the keyword. - You MUST NOT omit `--keyword` and then filter locally (e.g., piping through `head`, `grep`, `python3`, `jq`, or any script to search for the request_id in the output). - You MUST NOT make multiple `describe-service-log` calls without `--keyword` hoping to find relevant lines by scanning the full log stream. - If a call with `--keyword` returns empty results, report that no matching logs were found — do NOT fall back to fetching unfiltered logs. > **❌ WRONG** (fetches ALL logs, filters locally — FORBIDDEN): > ```bash > aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | head -300 > aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | python3 filter.py > aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall | grep "request_id" > ``` > > **✅ RIGHT** (server-side keyword filter — REQUIRED): > ```bash > aliyun eas describe-service-log --cluster-id cn-beijing --service-name embedding_recall --keyword 0c6cbd91-5618-4705-8e08-9126bf4600f7 --page-size 500 > ``` **[CRITICAL] Known CLI pitfall — keyword-only lookup is required for business logs:** - When only `--keyword` is supplied (no time range), the CLI returns the full PAI-Rec application trace (`controller.go` / `feed.go` / `recall.go` / `rank_service.go` etc.) matching the request_id. - As soon as `--start-time` / `--end-time` are added — even if the window covers the real log timestamp — the CLI silently drops business logs and only returns infrastructure noise (`/bin/sh` wrapper heartbeats, `502 Bad Gateway` retries, `postgres.go dbstat`). - Therefore: for request-level diagnosis, always omit the time range and rely on `--keyword ` alone. **Notes:** - `--keyword`: Use the full `request_id` extracted from the API response (case-sensitive exact match). - `--page-size`: Raise to 500 to capture the entire trace in a single page; total matched entries for one request is usually < 30. - `--start-time` / `--end-time`: Only use these for broad time-window scans **without** `--keyword` (e.g., when investigating non-request-specific issues). Required format is `yyyy-MM-dd HH:mm:ss` in UTC (space separator, no `T` / no `Z`). ISO-8601 forms like `2025-04-28T00:00:00Z` will be rejected with `InvalidParameter`. #### Step 4: List Engine Configurations Map the environment and list matching configurations: **Environment Mapping:** - `product` → `Prod` - `prepub` → `Pre` ```bash aliyun pairecservice list-engine-configs \ --instance-id \ --environment \ --status Released \ --name > "$WORKDIR/engine_configs_list.json" 2>&1 ``` **[MUST] Always pass `--name ` for server-side filtering:** - `` is already known from Step 1 (`ServiceConfig.envs.CONFIG_NAME`); it MUST be forwarded to this call as `--name`. - Omitting `--name` returns the entire instance's config inventory (often hundreds of unrelated entries), forces client-side filtering, wastes tokens, and risks hitting CLI default pagination so the target row is silently dropped. - `--name` is an exact-match filter on the server; do NOT substitute with `grep` / `jq select` post-processing. - The same rule applies to every `list-engine-configs` invocation in this skill (Workflow 2 Step 1 included). **What to extract:** - Find the configuration with `Status: Released` - Get `EngineConfigId` and `Version` #### Step 5: Get Engine Configuration Details ```bash aliyun pairecservice get-engine-config \ --instance-id \ --engine-config-id > "$WORKDIR/raw_engine_config.json" 2>&1 ``` **[MUST] Sanitize before display** — Config may contain plaintext passwords or access keys. Always pipe through the sanitizer before printing to terminal: ```bash python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json" ``` Only the sanitized output (with credentials replaced by `***REDACTED***`) should appear in the terminal. The raw file at `$WORKDIR/raw_engine_config.json` can be passed directly to `scripts/validate.py` for validation (validate.py does not print credential values). **What to extract:** - `ConfigValue`: The actual engine configuration (JSON/YAML) #### Step 5.5 (Optional): Static Config Sanity Check Optionally run `scripts/validate.py` against the retrieved `ConfigValue` to quickly rule out structural / reference / naming errors in the engine configuration before diving into the log trace. See Workflow 2 § Step 3 and [references/config-validation.md](references/config-validation.md) for usage, exit codes, and the full rule list. ```bash printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin ``` **When to run:** when the log trace points at a specific configuration element (e.g. a `RecallConfs` / `FilterConfs` / `SceneConfs` entry), or when the configuration is being diagnosed for the first time in this skill session. **When to skip:** when the log trace already shows a decisive non-config root cause (e.g. a `scene_id` not present in `SceneConfs`, a 5xx from an upstream EAS dependency, a missing feature table). `validate.py` is a static checker and cannot detect request-time mismatches between client input and configuration. **[MUST] Scoping rule for the final report:** - `validate.py` findings may enter the final diagnosis ONLY when they are directly tied to the log evidence for the current `request_id` (e.g. the log blames a `RecallConf` name that `validate.py` flags as duplicated or dangling). - Findings unrelated to the current `request_id` trace MUST NOT be added to the final conclusion. They remain an internal sanity-check signal only. This preserves the evidence-only reporting rule in Step 6. #### Step 6: Comprehensive Analysis Analyze the following components together: 1. **API Response**: Error code, message, and returned data 2. **Service Logs**: Trace logs for the request_id showing processing flow 3. **Engine Configuration**: Settings that may affect the behavior **Common Issues to Check:** - Configuration mismatches (e.g., recall settings, filtering rules) - Resource limitations (e.g., insufficient items, timeout settings) - Data source issues (e.g., table access, feature availability) - Environment inconsistencies (e.g., prod config in prepub environment) **[MUST] Evidence-only reporting rule:** The final diagnosis delivered to the user MUST be grounded strictly in what the EAS service logs and the engine configuration directly show. Apply the following constraints: - **Report only what is observed.** Quote the exact log line (file:line, level, message) and the exact config fragment that proves each claim. - **State the direct causal chain** from log evidence to the API response, and stop there. - **Do NOT add** any of the following unless the user explicitly asks: - Speculative root causes not visible in logs/config (e.g., "client probably sent wrong X") - Fix recommendations or remediation steps - Conditional "if X then Y" scenarios - Tangential best-practice advice (security, fallback design, naming, etc.) - Guesses about upstream systems, client code, or data sources not covered by the logs/config - **If the evidence is insufficient** to reach a conclusion, state explicitly what additional data (specific log lines, other config versions, other environments) is needed, instead of guessing. - **Recommendations are opt-in only.** Provide fixes/suggestions only when the user explicitly requests them in a follow-up. --- ### Workflow 2: PAI-Rec Engine Configuration Validation This workflow validates engine configurations for potential issues. **Input:** Configuration name and environment (Prod/Pre) #### Step 1: List Configuration Versions If user doesn't provide `engine-config-id`, list available versions: ```bash aliyun pairecservice list-engine-configs \ --instance-id \ --environment \ --name ``` **Display to user:** - `Version`: Version number - `Status`: Configuration status (Released/Draft/Archived) - `GmtCreateTime`: Creation timestamp - `EngineConfigId`: Version ID Ask user to select a version or provide the `engine-config-id`. #### Step 2: Retrieve Configuration Details ```bash aliyun pairecservice get-engine-config \ --instance-id \ --engine-config-id > "$WORKDIR/raw_engine_config.json" 2>&1 ``` **[MUST] Sanitize before display** — Always sanitize before printing to terminal: ```bash python3 scripts/sanitize_config.py "$WORKDIR/raw_engine_config.json" ``` #### Step 3: Run Schema + Rule Validation **[MUST]** Feed the extracted `ConfigValue` JSON into `scripts/validate.py`. The script enforces JSON Schema (`references/schema.json`) + reference-consistency rules and exits with status 0 on pass, 1 on failure. ```bash # From stdin (recommended when ConfigValue is already in memory) printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin # From a saved JSON file python3 scripts/validate.py "$WORKDIR/raw_engine_config.json" # From an inline JSON string python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}' ``` Requires `jsonschema` (`pip install jsonschema`); if missing the script falls back to rule-only validation without Schema checks. **What the script checks (summary):** 1. **Structure** — JSON well-formedness, required fields, types (`RunMode`, `RecallConfs`, `FilterConfs`, `SortConfs`, `AlgoConfs`, `SceneConfs`, `RankConf`, `FeatureConfs`, `UserFeatureConfs`, `DebugConfs`, `FeatureLogConfs`, `CallBackConfs`, `PipelineConfs`, etc.) 2. **Enum values** — `RecallType` / `FilterType` / `SortType` / `RunMode` / `DebugConfs.OutputType` / `GeneralRankConfs.ActionConfs[].ActionType` 3. **Reference consistency** — `SceneConfs.RecallNames` → `RecallConfs`; `FilterNames` → `FilterConfs`; `SortNames` → `SortConfs`; `RankConf.RankAlgoList` → `AlgoConfs`; any `DaoConf.AdapterType` + `*Name` → the corresponding `*Confs` (Hologres / Redis / MySQL / TableStore / FeatureStore / …) 4. **Business rules** - `User2ItemExposureFilter` with `WriteLog=true` + FeatureStore adapter: must set `TimeInterval > 0` - `PriorityAdjustCountFilter` in `accumulator` mode: `Count` must be strictly increasing (use `Type="fix"` for independent per-recall caps) - `PipelineConfs.*.Name` must be globally unique - `DebugConfs.Rate` must be an integer in `[0, 100]` 5. **Duplicate name detection** within `RecallConfs`, `FilterConfs`, `SortConfs`, `AlgoConfs` Detailed usage, exit codes, example outputs and the full rule list live in [references/config-validation.md](references/config-validation.md). #### Step 4: Evidence-Grounded Report Report to the user based strictly on the script's output plus any additional inspection of `ConfigValue`: - ✅ Checks passed (Schema clean, references resolved, no rule violations) - ⚠️ Warnings reported by the script (severity=warning) or inconsistencies observed in `ConfigValue` (e.g. naming collisions between `RankScore` variables and model output fields, env/region mismatches) - ❌ Errors reported by the script (severity=error) or missing required fields - Missing-evidence notes — what extra data (other config versions, model signatures, etc.) would be needed to turn a warning into a confirmed error Do not add speculative fixes or best-practice tangents; suggestions are provided only when the user explicitly asks for them. --- ## Success Verification Method For detailed verification steps, see [references/verification-method.md](references/verification-method.md). **Quick Verification:** 1. **For Diagnosis Workflow:** - Service information retrieved successfully - Logs found containing the request_id - Configuration loaded correctly - Root cause identified 2. **For Validation Workflow:** - Configuration retrieved successfully - All validation checks executed - Issues clearly reported - Recommendations provided (if applicable) --- ## Cleanup This skill performs read-only Alibaba Cloud API calls (no remote resources are created). Transient artifacts are written to a per-session local working directory `$WORKDIR` under `/tmp` (see Core Workflow preamble). The skill does NOT delete `$WORKDIR` automatically — the OS-level temporary file policy is relied on for eventual reclamation (macOS reaps `/tmp` periodically; most Linux distros reap on reboot or via `systemd-tmpfiles`). If an operator wants to free disk space sooner, they may manually run `rm -rf /tmp/pairec-diag-*` outside the workflow. --- ## Best Practices 1. **Always capture request_id**: When reporting API issues, include the full response with request_id for accurate log correlation. 2. **Log queries — keyword only, no time range, no local filtering**: For request-level diagnosis, pass `--keyword ` to `aliyun eas describe-service-log` and leave `--start-time` / `--end-time` unset. NEVER omit `--keyword` and post-process locally (e.g., `| head`, `| grep`, `| python3`) — this defeats server-side filtering, wastes tokens, and may miss logs beyond the first page. Combining keyword with a time range filters out business logs due to a CLI quirk (see Workflow 1, Step 3). Only use time ranges for broad non-request scans, and only with the `yyyy-MM-dd HH:mm:ss` UTC format (no `T` / no `Z`). 3. **Environment awareness**: Always verify that configurations match the target environment (Prod vs Pre). 4. **Version control**: When validating configurations, check multiple versions if issues persist across deployments. 5. **Log retention**: EAS service logs are retained for limited periods; diagnose issues promptly after occurrence. 6. **Configuration backup**: Before applying changes based on validation results, ensure current configurations are backed up. 7. **Cross-reference**: Compare working configurations with problematic ones to identify differences. 8. **Service status**: Check EAS service status before diagnosing; service-level issues may mask configuration problems. 9. **Evidence-only conclusions**: Ground every statement in the diagnosis on a specific log line or config fragment. Do not speculate, do not propose fixes, and do not volunteer best-practice advice unless the user explicitly asks. If the evidence is insufficient, say what is missing rather than inferring. 10. **Structured analysis**: Follow the systematic workflow rather than jumping to conclusions based on error messages alone. 11. **Document findings**: Keep track of recurring issues and their resolutions for faster future diagnosis. --- ## Reference Links | Reference Document | Description | |--------------------|-------------| | [RAM Policies](references/ram-policies.md) | Required RAM permissions for PAI-Rec and EAS APIs | | [Related Commands](references/related-commands.md) | Complete CLI command reference | | [Verification Method](references/verification-method.md) | Detailed verification procedures | | [CLI Installation Guide](references/cli-installation-guide.md) | Alibaba Cloud CLI installation instructions | | [Configuration Examples](references/configuration-examples.md) | Sample engine configurations and common patterns | | [Config Validation](references/config-validation.md) | `scripts/validate.py` usage, exit codes, rule catalogue | | [Troubleshooting Guide](references/troubleshooting-guide.md) | Common issues and solutions | | [Config Sanitization](scripts/sanitize_config.py) | Credential redaction before LLM analysis |