---
name: logging-observability
description: 'Structured logging, distributed tracing, and metrics for production applications. [What: OpenTelemetry setup, log level strategy, correlation IDs, SLI/SLO alerting thresholds, Grafana dashboard
design, PagerDuty integration] [When: setting up production logging, adding observability to a service, debugging distributed systems, designing alerting, implementing traces/metrics/logs] [Keywords:
logging, observability, OpenTelemetry, OTel, structured logs, distributed tracing, correlation ID, metrics, Grafana, Prometheus, PagerDuty, Winston, Pino, structlog, log levels, SLI, SLO, alerting] NOT
for application performance profiling (use a profiler), load testing, or database query optimization.'
allowed-tools: Read,Write,Edit,Bash(npm:*,npx:*,pip:*,docker:*)
argument-hint: '[service description] [stack: node|python|go|java] [current problem: no-logging|no-tracing|alert-fatigue|pii-leak]'
metadata:
category: Code Quality & Testing
pairs-with:
- skill: api-architect
reason: API request tracing and correlation IDs
- skill: devops-automator
reason: Deploying collectors and dashboards
- skill: background-job-orchestrator
reason: Distributed job observability
tags:
- observability
- logging
- tracing
- metrics
- opentelemetry
- monitoring
---
# Logging & Observability
Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing.
## When to Use
**Activate on:** "structured logging", "distributed tracing", "OpenTelemetry", "OTel", "correlation ID", "log levels", "Grafana dashboard", "alerting thresholds", "SLI SLO", "Prometheus metrics", "PagerDuty integration", "observability stack", "Winston setup", "Pino logger", "log aggregation", "Datadog", "Honeycomb"
**NOT for:** Performance profiling (CPU/memory flamegraphs) | Load testing | Database query optimization | Security auditing
## Decision Tree: What to Log at Each Level
```mermaid
flowchart TD
E[Event Occurs] --> Q1{Does it represent\na system failure?}
Q1 -->|Yes| Q2{Is it recoverable\nwithout human?}
Q2 -->|No| FATAL[FATAL: Service cannot\ncontinue — trigger pager]
Q2 -->|Yes| ERROR[ERROR: Operation failed,\nwill retry or degrade]
Q1 -->|No| Q3{Is it unexpected\nbut not failing?}
Q3 -->|Yes| WARN[WARN: Unusual condition,\ncircuit breaker open,\ndeprecation used]
Q3 -->|No| Q4{Is it a meaningful\nbusiness event?}
Q4 -->|Yes| INFO[INFO: User action,\npayment processed,\nservice started]
Q4 -->|No| Q5{Needed to debug\na specific issue?}
Q5 -->|Yes| DEBUG[DEBUG: DB queries,\ncache hits/misses,\nfunction inputs]
Q5 -->|No| TRACE[TRACE: Fine-grained\nloop iterations,\nOTel spans]
```
**Rule of thumb**: Production runs INFO and above. DEBUG only enabled per-service via dynamic config, never always-on in prod.
## Core Patterns
### Structured Log Format (JSON)
Every log line must be parseable. String concatenation is not a log.
**Node.js with Pino:**
```typescript
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: 'payment-service',
version: process.env.SERVICE_VERSION,
env: process.env.NODE_ENV,
},
redact: {
paths: ['req.headers.authorization', 'body.password', 'body.cardNumber', '*.ssn'],
censor: '[REDACTED]',
},
});
// Good: structured fields
logger.info({ orderId, userId, amountCents }, 'Payment processed');
// Bad: string interpolation
logger.info(`Payment processed for user ${userId} order ${orderId}`);
```
**Python with structlog:**
```python
import structlog
log = structlog.get_logger()
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
]
)
# Good: key-value pairs
log.info("payment_processed", order_id=order_id, user_id=user_id, amount_cents=amount)
```
### Correlation IDs
Every request needs a trace ID that flows through all downstream calls. This is the minimum viable distributed tracing without OTel.
```typescript
// Express middleware
import { randomUUID } from 'crypto';
import { AsyncLocalStorage } from 'async_hooks';
const requestContext = new AsyncLocalStorage<{ traceId: string; spanId: string }>();
export function correlationMiddleware(req, res, next) {
const traceId = req.headers['x-trace-id'] ?? randomUUID();
const spanId = randomUUID().slice(0, 8);
requestContext.run({ traceId, spanId }, () => {
res.setHeader('x-trace-id', traceId);
next();
});
}
// Logger that auto-includes context
export function getLogger(name: string) {
return {
info: (msg: string, fields?: object) => {
const ctx = requestContext.getStore();
logger.info({ ...ctx, ...fields, logger: name }, msg);
},
// ... error, warn, debug
};
}
```
### OpenTelemetry Setup
See `references/opentelemetry-setup.md` for complete OTel collector config, SDK initialization per language, and span attribute conventions.
**Minimal Node.js bootstrap:**
```typescript
// Must be first import in entrypoint
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'payment-service',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
```
### Distributed Trace Propagation
```mermaid
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant SVC as Payment Service
participant DB as Database
C->>GW: POST /checkout
(no trace header)
Note over GW: Generate trace-id: abc123
span-id: 0001
GW->>SVC: POST /payment
traceparent: 00-abc123-0001-01
Note over SVC: Inherit trace-id: abc123
New span-id: 0002
SVC->>DB: INSERT payment
traceparent: 00-abc123-0002-01
Note over DB: Inherit trace-id: abc123
New span-id: 0003
DB-->>SVC: OK (span 0003 ends)
SVC-->>GW: 200 OK (span 0002 ends)
GW-->>C: 200 OK (span 0001 ends)
x-trace-id: abc123
```
The W3C `traceparent` header format: `00-{traceId}-{spanId}-{flags}`. Always propagate this header on every downstream HTTP call.
## Reference Files
| File | Contents |
|------|----------|
| `references/opentelemetry-setup.md` | OTel SDK init per language, collector YAML config, span attributes, context propagation |
| `references/alerting-patterns.md` | SLI/SLO definitions, alert routing, PagerDuty severity mapping, alert fatigue prevention |
## Anti-Patterns (Shibboleths)
### Anti-Pattern 1: Logging PII or Secrets in Production
**Novice thinking**: "I'll just log the full request body to debug this auth issue."
**Why wrong**: GDPR/CCPA violations carry fines up to 4% of global revenue. Secrets in logs propagate to log aggregators, S3 exports, audit trails — all places with different access controls. A single `console.log(req.body)` can expose thousands of user passwords in your Datadog dashboard.
**Detection signature**: Search your logs for `password`, `ssn`, `cardNumber`, `authorization` as field values (not keys). If any appear, you have a PII leak.
**Fix — Allowlist approach:**
```typescript
// Never log what you don't explicitly approve
const SAFE_BODY_FIELDS = ['orderId', 'productId', 'quantity', 'currency'];
logger.info({
body: pick(req.body, SAFE_BODY_FIELDS), // only known-safe fields
path: req.path,
method: req.method,
}, 'Request received');
```
**Fix — Redaction in logger config:**
```typescript
// Pino's redact runs before any transport
const logger = pino({
redact: {
paths: [
'req.headers.authorization',
'req.headers.cookie',
'body.password',
'body.*.password', // nested objects too
'body.cardNumber',
'body.ssn',
'*.token',
'*.secret',
],
censor: '[REDACTED]',
},
});
```
**Shibboleth**: An expert sets up redaction at logger initialization, not as a reminder comment. Redaction must be structural, not ad-hoc.
---
### Anti-Pattern 2: Unstructured String Logs Instead of Structured JSON
**Novice thinking**: `logger.info('User ' + userId + ' purchased ' + productId + ' for $' + amount)`
**Why wrong**: You cannot filter, aggregate, or alert on string-interpolated data in any log aggregator. A Grafana query for `amount > 1000` requires `amount` to be a numeric field, not embedded in a sentence. String logs are write-only — you can read them but not query them at scale.
**Impact**: Your 10 million daily log lines become unsearchable. MTTR (mean time to recovery) during incidents doubles because engineers grep through strings instead of filtering structured fields.
**Fix:**
```typescript
// Bad: string log — amount is buried in text
logger.info(`User ${userId} purchased ${productId} for $${amount}`);
// Good: structured — every field is queryable
logger.info({ userId, productId, amountDollars: amount / 100 }, 'purchase_completed');
```
**Consistent event naming**: Use `snake_case` verb-noun event names (`payment_processed`, `user_signed_up`, `order_failed`) as the message string. This creates a stable vocabulary for dashboards and alerts.
**Shibboleth**: The log message string is for humans scanning log tails. All queryable data lives in structured fields. A logger that produces `{}` as its output shape is better than one that produces readable strings.
---
### Anti-Pattern 3: Log-and-Throw (Duplicate Log Entries)
**Novice thinking**: Log the error, then re-throw so the caller also knows about it.
```typescript
// BAD: log-and-throw
async function processPayment(orderId: string) {
try {
return await chargeCard(orderId);
} catch (err) {
logger.error({ err, orderId }, 'Payment failed'); // Logged here
throw err; // And the caller logs it again
}
}
async function handleCheckout(req, res) {
try {
await processPayment(req.body.orderId);
} catch (err) {
logger.error({ err }, 'Checkout failed'); // Same error logged twice
res.status(500).json({ error: 'Checkout failed' });
}
}
```
**Why wrong**: Every error appears 2-5 times in your logs depending on call depth. Alerting on error count becomes unreliable. Incident review is confusing — engineers think there were multiple failures. Log volume costs money (Datadog charges per ingested GB).
**Fix — Log only at the boundary where you handle the error:**
```typescript
// Good: log only where you decide what to do with the error
async function processPayment(orderId: string) {
// No try-catch: let errors propagate naturally
return await chargeCard(orderId);
}
async function handleCheckout(req, res) {
try {
await processPayment(req.body.orderId);
res.json({ success: true });
} catch (err) {
// One log, at the boundary where we're deciding to return 500
logger.error({ err, orderId: req.body.orderId }, 'checkout_failed');
res.status(500).json({ error: 'Checkout failed' });
}
}
```
**Rule**: Log where you handle. Don't log where you propagate. The call stack in the error object already tells you where it originated.
**Shibboleth**: If you see the same `traceId` appear in more than two error log lines for a single request, you have a log-and-throw chain somewhere.
## Quality Checklist
```
[ ] All log lines are JSON (no string concatenation)
[ ] Log level strategy documented: what goes at each level
[ ] PII/secrets redacted at logger config level, not call site
[ ] Correlation IDs propagated on all outbound HTTP calls
[ ] OTel SDK initialized before any other imports
[ ] Error logs include the error object (not just message)
[ ] No log-and-throw patterns in error handling
[ ] DEBUG logs use conditional guards or sampling
[ ] SLI/SLO defined for each critical user journey
[ ] Alert routing: notify vs page threshold documented
[ ] Runbook linked from every paging alert
[ ] Log retention policy set (cost vs compliance)
```
## Output Artifacts
1. **Logger configuration** — Pino/Winston/structlog setup with redaction rules
2. **OTel bootstrap file** — SDK init with auto-instrumentation
3. **Correlation middleware** — AsyncLocalStorage request context
4. **Prometheus metrics module** — Counter/histogram/gauge definitions
5. **Grafana dashboard JSON** — Four golden signals panels
6. **Alertmanager rules YAML** — SLO-based alert definitions