---
name: vtex-io-observability-and-ops
description: "Apply when making VTEX IO services easier to observe, troubleshoot, and operate in production. Covers metrics, structured logging, failure visibility, rate-limit awareness, and production readiness checks for backend apps. Use for integration monitoring, error diagnosis, or improving the operational quality of VTEX IO services before or after release."
---

# Observability & Operational Readiness

## When this skill applies

Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.

- Adding metrics to important client calls or flows
- Improving logs for routes, workers, or integrations
- Surfacing failures clearly for operations and support
- Reviewing whether a service is ready for production
- Monitoring rate-limit-sensitive integrations

Do not use this skill for:
- app policy declaration
- trust-boundary modeling
- frontend analytics or browser monitoring
- route contract design by itself

## Decision rules

- Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
- Use `ctx.vtex.logger` with appropriate log levels such as `info`, `warn`, and `error` instead of `console.log`, so logs are properly collected and searchable in the VTEX logging stack.
- Treat `ctx.vtex.logger` as the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service.
- Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
- Choose metric names that reflect the integration and operation, such as `partner-get-order` or `partner-sync-catalog`, so counts, latency, and error rates can be tracked over time.
- Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
- For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
- Review whether expensive or fragile flows expose enough operational signals before releasing them.

## Hard constraints

### Constraint: Important failures must be visible in logs, metrics, or durable state

Routes, event handlers, and workers MUST not hide important failures from operators.

**Why this matters**

If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.

**Detection**

If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.

**Correct**

```typescript
try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
  ctx.vtex.logger.error({
    message: 'Failed to send order to partner',
    orderId,
    account: ctx.vtex.account,
    routeId: ctx.vtex.route?.id,
  })
  throw error
}
```

**Wrong**

```typescript
try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
  return
}
```

### Constraint: Metrics should be attached to important integration calls

Client calls that are operationally important SHOULD include `metric` so request behavior can be tracked consistently.

**Why this matters**

Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.

**Detection**

If a key downstream integration call has no `metric` and operations depend on it, STOP and add a meaningful metric name.

**Correct**

```typescript
return this.http.get(`/orders/${id}`, {
  metric: 'partner-get-order',
})
```

**Wrong**

```typescript
return this.http.get(`/orders/${id}`)
```

### Constraint: Logs must stay useful without leaking sensitive data

Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.

**Why this matters**

Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.

**Detection**

If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.

**Correct**

```typescript
ctx.vtex.logger.info({
  message: 'Partner sync started',
  orderId,
  account: ctx.vtex.account,
})
```

**Wrong**

```typescript
ctx.vtex.logger.info({
  message: 'Partner sync started',
  body: ctx.request.body,
  auth: ctx.request.header.authorization,
})
```

## Preferred pattern

Operationally healthy VTEX IO services should:

- emit metrics for important client calls so counts, latency, and error rates are visible
- log failures with enough structured context such as domain IDs, account, and `routeId`
- avoid silent error swallowing
- sanitize sensitive data before logging
- review retries, caching, and throughput with rate-limit behavior in mind

Use observability to shorten diagnosis time, not just to create more logs.

## Common failure modes

- Catching and ignoring errors in async flows.
- Logging too little context to diagnose production incidents.
- Logging too much sensitive data.
- Omitting metrics from important integration calls.
- Treating rate-limit failures as isolated bugs instead of operational signals.

## Review checklist

- [ ] Are important failures visible to operators?
- [ ] Do key integrations emit useful metrics?
- [ ] Are logs structured and safe?
- [ ] Are retries, caching, and rate-limit behavior considered together?
- [ ] Would someone on call be able to diagnose this flow from the available signals?

## Reference

- [Using Node Clients](https://developers.vtex.com/docs/guides/using-node-clients) - Client usage patterns relevant to metrics and retries
- [Best practices for avoiding rate-limit errors](https://developers.vtex.com/docs/guides/best-practices-for-avoiding-rate-limit-errors) - Operational guidance for stable integrations