--- name: ops-inspector description: AIOps-style one-click inspection skill for CloudBase resources. Use this skill when users need to diagnose errors, check resource health, inspect logs, or run a comprehensive health check across cloud functions, CloudRun services, databases, and other CloudBase resources. version: 2.23.6 alwaysApply: false --- ## Standalone Install Note If this environment only installed the current skill, start from the CloudBase main entry and use the published `cloudbase/references/...` paths for sibling skills. - CloudBase main entry: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md` - Current skill raw source: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md` Keep local `references/...` paths for files that ship with the current skill directory. When this file points to a sibling skill such as `cloud-functions` or `cloudrun-development`, use the standalone fallback URL shown next to that reference. ## Activation Contract ### Use this first when - The user wants to check the health or status of CloudBase resources (cloud functions, CloudRun, databases, storage, etc.). - The user reports errors, failures, or abnormal behavior and wants a quick diagnosis. - The user asks for an "inspection", "health check", "巡检", "诊断", or "troubleshooting" of their CloudBase environment. - The user wants to review recent error logs across services. ### Read before writing code if - The inspection reveals code-level issues in cloud functions or CloudRun services — then read the relevant implementation skill before suggesting fixes. - The user wants to fix a problem found during inspection rather than just diagnose it. ### Then also read - Cloud function issues -> `../cloud-functions/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md`) - CloudRun issues -> `../cloudrun-development/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md`) - Database issues -> `../postgresql-development/SKILL.md` for CloudBase PG / PostgreSQL (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/postgresql-development/SKILL.md`), `../relational-database-tool/SKILL.md` for MySQL (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md`), or `../no-sql-web-sdk/SKILL.md` for NoSQL (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md`) - Platform overview -> `../cloudbase-platform/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md`) ### Do NOT use for - Deploying new resources or writing application code. This skill is read-only and diagnostic. - Replacing proper monitoring/alerting infrastructure. It provides point-in-time inspection, not continuous monitoring. - Directly fixing problems — it diagnoses and recommends; actual fixes should use the appropriate implementation skill. ### Common mistakes / gotchas - Running a full inspection without first confirming the environment is bound (`auth` tool must show logged-in and env-bound state). - Ignoring CLS log service status — if CLS is not enabled, `queryLogs` will fail; always check first with `queryLogs(action="checkLogService")`. - Searching logs without a time range — this can return excessive or irrelevant results. Always scope searches to a relevant time window. - Treating a single error log as the root cause without correlating across resources. A function error may stem from a database or config issue. ### Minimal checklist - [ ] Environment is bound and accessible (`envQuery(action="info")`) - [ ] CLS log service is enabled (`queryLogs(action="checkLogService")`) - [ ] All target resources are listed before diving into details - [ ] Time range is specified for any log searches - [ ] Findings are summarized with severity levels and actionable recommendations --- ## How to use this skill (for a coding agent) ### Inspection Modes The skill supports two modes based on user intent: | Mode | When to use | Scope | |------|-------------|-------| | **Full inspection** | User asks for a general health check / 巡检 / 全面检查 | All resource types in the environment | | **Targeted inspection** | User reports a specific error or asks about a specific resource | One resource type or a specific resource | ### Full Inspection Workflow Follow these steps in order for a comprehensive environment health check: **Step 1 — Environment Check** ``` envQuery(action="info") ``` Confirm the environment is accessible. Record the `envId` for console link generation. **Step 2 — Log Service Status** ``` queryLogs(action="checkLogService") ``` If CLS is not enabled, note this as a **warning** — log-based diagnosis will be unavailable. Recommend enabling CLS in the console: `https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log` **Step 3 — Cloud Functions Inspection** ``` queryFunctions(action="listFunctions") ``` For each function, check: - **Status**: Is the function in an active/deployed state? - **Recent errors**: `queryFunctions(action="listFunctionLogs", functionName="", startTime="")` - **Common issues**: - Timeout errors (execution exceeded limit) - Memory limit exceeded - Runtime errors (unhandled exceptions) - Cold start frequency **Step 4 — CloudRun Services Inspection** ``` queryCloudRun(action="list") ``` For each service, check: - **Status**: Is the service running? - **Detail**: `queryCloudRun(action="detail", detailServerName="")` - **Common issues**: - Service not running (scaled to zero or crashed) - Image pull failures - OOMKilled events - Health check failures **Step 5 — Error Log Aggregation** (if CLS is enabled) ``` queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50) queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50) ``` Look for patterns: - Repeated error messages (same error many times) - Cascading failures (errors in multiple services around the same time) - Timeout patterns **Step 6 — Summary Report** Generate a structured report: ```markdown # CloudBase Resource Inspection Report **Environment**: ${envId} **Inspection Time**: ${timestamp} ## Overall Health: ✅ Healthy / ⚠️ Warnings Found / ❌ Issues Found ### Cloud Functions | Function | Status | Recent Errors | Severity | |----------|--------|---------------|----------| | ... | ... | ... | ... | ### CloudRun Services | Service | Status | Issues | Severity | |---------|--------|--------|----------| | ... | ... | ... | ... | ### Error Log Summary - Total errors in last 24h: N - Top error patterns: ... ## Recommendations 1. ... 2. ... ## Console Links - Cloud Functions: https://tcb.cloud.tencent.com/dev?envId=${envId}#/scf - CloudRun: https://tcb.cloud.tencent.com/dev?envId=${envId}#/platform-run - Logs: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log ``` ### Targeted Inspection Workflow When the user specifies a resource type or a specific resource: 1. **Cloud function errors**: `queryFunctions(action="listFunctionLogs", functionName="")` then `queryLogs(action="searchLogs", queryString="* AND functionName: AND level:ERROR", ...)` 2. **CloudRun errors**: `queryCloudRun(action="detail", detailServerName="")` then `queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)` 3. **Database issues**: Check `queryPgDatabase(action="context"|"metadata"|"objects")` for CloudBase PG, `queryMysqlDatabase` for MySQL, or `readNoSqlDatabaseStructure` for NoSQL depending on type 4. **General error search**: `queryLogs(action="searchLogs", queryString="", ...)` ### AIOps Methodology This skill follows AIOps principles for intelligent inspection: 1. **Data Collection**: Gather logs and resource states via MCP tools 2. **Pattern Recognition**: Identify recurring errors, anomaly patterns, and correlations across services 3. **Root Cause Hypothesis**: Based on error patterns, suggest likely root causes (e.g., a function timeout may be caused by a database query bottleneck) 4. **Actionable Recommendations**: Provide specific, prioritized remediation steps with links to relevant skills and console pages ### Severity Levels | Level | Icon | Meaning | |-------|------|---------| | Critical | ❌ | Service is down or data is at risk; requires immediate action | | Warning | ⚠️ | Errors detected but service is still partially functional; investigate soon | | Info | ℹ️ | No errors found; informational status only | | Healthy | ✅ | Resource is operating normally | ### Preferred Tool Map | Operation | MCP Tool Call | |-----------|---------------| | Check environment | `envQuery(action="info")` | | Check CLS status | `queryLogs(action="checkLogService")` | | List cloud functions | `queryFunctions(action="listFunctions")` | | Get function detail | `queryFunctions(action="getFunctionDetail", functionName="")` | | Get function logs | `queryFunctions(action="listFunctionLogs", functionName="", startTime="