---
name: ops-inspector
description: AIOps-style one-click inspection skill for CloudBase resources. Use this skill when users need to diagnose errors, check resource health, inspect logs, or run a comprehensive health check across cloud functions, CloudRun services, databases, and other CloudBase resources.
version: 2.23.6
alwaysApply: false
---

## Standalone Install Note

If this environment only installed the current skill, start from the CloudBase main entry and use the published `cloudbase/references/...` paths for sibling skills.

- CloudBase main entry: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md`
- Current skill raw source: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md`

Keep local `references/...` paths for files that ship with the current skill directory. When this file points to a sibling skill such as `cloud-functions` or `cloudrun-development`, use the standalone fallback URL shown next to that reference.

## Activation Contract

### Use this first when

- The user wants to check the health or status of CloudBase resources (cloud functions, CloudRun, databases, storage, etc.).
- The user reports errors, failures, or abnormal behavior and wants a quick diagnosis.
- The user asks for an "inspection", "health check", "巡检", "诊断", or "troubleshooting" of their CloudBase environment.
- The user wants to review recent error logs across services.

### Read before writing code if

- The inspection reveals code-level issues in cloud functions or CloudRun services — then read the relevant implementation skill before suggesting fixes.
- The user wants to fix a problem found during inspection rather than just diagnose it.

### Then also read

- Cloud function issues -> `../cloud-functions/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md`)
- CloudRun issues -> `../cloudrun-development/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md`)
- Database issues -> `../postgresql-development/SKILL.md` for CloudBase PG / PostgreSQL (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/postgresql-development/SKILL.md`), `../relational-database-tool/SKILL.md` for MySQL (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md`), or `../no-sql-web-sdk/SKILL.md` for NoSQL (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md`)
- Platform overview -> `../cloudbase-platform/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md`)

### Do NOT use for

- Deploying new resources or writing application code. This skill is read-only and diagnostic.
- Replacing proper monitoring/alerting infrastructure. It provides point-in-time inspection, not continuous monitoring.
- Directly fixing problems — it diagnoses and recommends; actual fixes should use the appropriate implementation skill.

### Common mistakes / gotchas

- Running a full inspection without first confirming the environment is bound (`auth` tool must show logged-in and env-bound state).
- Ignoring CLS log service status — if CLS is not enabled, `queryLogs` will fail; always check first with `queryLogs(action="checkLogService")`.
- Searching logs without a time range — this can return excessive or irrelevant results. Always scope searches to a relevant time window.
- Treating a single error log as the root cause without correlating across resources. A function error may stem from a database or config issue.

### Minimal checklist

- [ ] Environment is bound and accessible (`envQuery(action="info")`)
- [ ] CLS log service is enabled (`queryLogs(action="checkLogService")`)
- [ ] All target resources are listed before diving into details
- [ ] Time range is specified for any log searches
- [ ] Findings are summarized with severity levels and actionable recommendations

---

## How to use this skill (for a coding agent)

### Inspection Modes

The skill supports two modes based on user intent:

| Mode | When to use | Scope |
|------|-------------|-------|
| **Full inspection** | User asks for a general health check / 巡检 / 全面检查 | All resource types in the environment |
| **Targeted inspection** | User reports a specific error or asks about a specific resource | One resource type or a specific resource |

### Full Inspection Workflow

Follow these steps in order for a comprehensive environment health check:

**Step 1 — Environment Check**

```
envQuery(action="info")
```

Confirm the environment is accessible. Record the `envId` for console link generation.

**Step 2 — Log Service Status**

```
queryLogs(action="checkLogService")
```

If CLS is not enabled, note this as a **warning** — log-based diagnosis will be unavailable. Recommend enabling CLS in the console: `https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log`

**Step 3 — Cloud Functions Inspection**

```
queryFunctions(action="listFunctions")
```

For each function, check:
- **Status**: Is the function in an active/deployed state?
- **Recent errors**: `queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<recent>")`
- **Common issues**:
  - Timeout errors (execution exceeded limit)
  - Memory limit exceeded
  - Runtime errors (unhandled exceptions)
  - Cold start frequency

**Step 4 — CloudRun Services Inspection**

```
queryCloudRun(action="list")
```

For each service, check:
- **Status**: Is the service running?
- **Detail**: `queryCloudRun(action="detail", detailServerName="<name>")`
- **Common issues**:
  - Service not running (scaled to zero or crashed)
  - Image pull failures
  - OOMKilled events
  - Health check failures

**Step 5 — Error Log Aggregation** (if CLS is enabled)

```
queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50)
queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50)
```

Look for patterns:
- Repeated error messages (same error many times)
- Cascading failures (errors in multiple services around the same time)
- Timeout patterns

**Step 6 — Summary Report**

Generate a structured report:

```markdown
# CloudBase Resource Inspection Report

**Environment**: ${envId}
**Inspection Time**: ${timestamp}

## Overall Health: ✅ Healthy / ⚠️ Warnings Found / ❌ Issues Found

### Cloud Functions
| Function | Status | Recent Errors | Severity |
|----------|--------|---------------|----------|
| ... | ... | ... | ... |

### CloudRun Services
| Service | Status | Issues | Severity |
|---------|--------|--------|----------|
| ... | ... | ... | ... |

### Error Log Summary
- Total errors in last 24h: N
- Top error patterns: ...

## Recommendations
1. ...
2. ...

## Console Links
- Cloud Functions: https://tcb.cloud.tencent.com/dev?envId=${envId}#/scf
- CloudRun: https://tcb.cloud.tencent.com/dev?envId=${envId}#/platform-run
- Logs: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log
```

### Targeted Inspection Workflow

When the user specifies a resource type or a specific resource:

1. **Cloud function errors**: `queryFunctions(action="listFunctionLogs", functionName="<name>")` then `queryLogs(action="searchLogs", queryString="* AND functionName:<name> AND level:ERROR", ...)`
2. **CloudRun errors**: `queryCloudRun(action="detail", detailServerName="<name>")` then `queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)`
3. **Database issues**: Check `queryPgDatabase(action="context"|"metadata"|"objects")` for CloudBase PG, `queryMysqlDatabase` for MySQL, or `readNoSqlDatabaseStructure` for NoSQL depending on type
4. **General error search**: `queryLogs(action="searchLogs", queryString="<error-keyword>", ...)`

### AIOps Methodology

This skill follows AIOps principles for intelligent inspection:

1. **Data Collection**: Gather logs and resource states via MCP tools
2. **Pattern Recognition**: Identify recurring errors, anomaly patterns, and correlations across services
3. **Root Cause Hypothesis**: Based on error patterns, suggest likely root causes (e.g., a function timeout may be caused by a database query bottleneck)
4. **Actionable Recommendations**: Provide specific, prioritized remediation steps with links to relevant skills and console pages

### Severity Levels

| Level | Icon | Meaning |
|-------|------|---------|
| Critical | ❌ | Service is down or data is at risk; requires immediate action |
| Warning | ⚠️ | Errors detected but service is still partially functional; investigate soon |
| Info | ℹ️ | No errors found; informational status only |
| Healthy | ✅ | Resource is operating normally |

### Preferred Tool Map

| Operation | MCP Tool Call |
|-----------|---------------|
| Check environment | `envQuery(action="info")` |
| Check CLS status | `queryLogs(action="checkLogService")` |
| List cloud functions | `queryFunctions(action="listFunctions")` |
| Get function detail | `queryFunctions(action="getFunctionDetail", functionName="<name>")` |
| Get function logs | `queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<time>", endTime="<time>")` |
| Get function log detail | `queryFunctions(action="getFunctionLogDetail", requestId="<id>")` |
| List CloudRun services | `queryCloudRun(action="list")` |
| Get CloudRun detail | `queryCloudRun(action="detail", detailServerName="<name>")` |
| Search CLS logs | `queryLogs(action="searchLogs", queryString="<query>", service="tcb\|tcbr", startTime="<time>", endTime="<time>")` |
| Check NoSQL structure | `readNoSqlDatabaseStructure(action="listCollections")` |
| Check PostgreSQL context | `queryPgDatabase(action="context")` |
| Check PostgreSQL metadata | `queryPgDatabase(action="metadata", limit=20)` |
| Check MySQL status | `queryMysqlDatabase(action="getContext")` |

### Common CLS Query Patterns

| Scenario | queryString |
|----------|-------------|
| All errors | `ERROR` |
| Function timeout | `timeout OR 超时` |
| Function OOM | `OOM OR out of memory OR 内存超限` |
| CloudRun crash | `crash OR OOMKilled OR Error` |
| Specific function errors | `functionName:<name> AND level:ERROR` |
| 5xx HTTP errors | `statusCode:>499` |
| Cold start issues | `coldStart OR 冷启动` |

### Time Range Guidance

- **Quick check**: Last 1 hour (`startTime` = 1 hour ago)
- **Standard inspection**: Last 24 hours
- **Trend analysis**: Last 7 days
- **Specific incident**: Narrow to the reported time window

Always use ISO 8601 format for `startTime`/`endTime`, e.g., `"2025-01-15 00:00:00"`.

## Related Skills

- `cloud-functions` — Cloud function development, deployment, and debugging
- `cloudrun-development` — CloudRun backend deployment and management
- `cloudbase-platform` — General platform knowledge and console navigation
- `postgresql-development` — CloudBase PostgreSQL / PG diagnostics and schema/RLS checks
- `relational-database-tool` — MySQL database management and diagnostics