# Observability Your workflows are running in production. Something is slow, but you can't tell whether it's the payment activity, the shipping call, or the sleep between them. You need traces, spans, and metrics—without instrumenting every workflow by hand. Weft's observability module is a pre-built [interceptor](./interceptors.md) that gives you all of this out of the box. > [!NOTE] > The observability interceptor shape is available for trials, but [OpenTelemetry](https://opentelemetry.io/) metric names are experimental before 1.0. Treat metric names and label sets as release-note-sensitive until the stability contract graduates them. ## Quick setup Import the factory, pass the engine as the `eventTarget`, and register the interceptor. ```typescript partial import { createObservabilityInterceptors } from '@lostgradient/weft/observability'; const { interceptor, dispose } = createObservabilityInterceptors({ eventTarget: engine, }); engine.addInterceptor(interceptor); ``` That's it. Every workflow start, activity call, sleep, and signal wait now produces spans with trace context propagation. Wiring the engine as the `eventTarget` lets the factory subscribe to workflow lifecycle events (`workflow:completed`, `workflow:failed`, `workflow:cancelled`, `workflow:timed-out`) and automatically end the root workflow span with the right status. Without it, root spans would stay "in progress" forever and the internal span map would grow unbounded. When tearing down the engine, call `dispose()` to unsubscribe those listeners and end any spans that are still open. If you're using [remote workers](./remote-workers.md), pass the same interceptor to them too. ## Configuration The `createObservabilityInterceptors()` factory accepts options for controlling what gets recorded: ```typescript partial interface ObservabilityOptions { tracerName?: string; // Name passed to trace.getTracer(). Default: 'weft'. tracerVersion?: string; // Version passed to trace.getTracer(). recordPayloads?: boolean; // Record activity/workflow inputs as span attributes. Default: false. maxPayloadSize?: number; // Maximum serialized payload size in bytes. Default: 1024. attributeExtractor?: ( interception: InterceptionContext, ) => Record; metrics?: MetricsCollectorClass; // Metrics collector for counters, histograms, gauges. openTelemetryApi?: OpenTelemetryApi; // Override OpenTelemetry API instance (primarily for testing). eventTarget?: EventTarget; // Engine instance; enables auto-close of root spans on lifecycle events. } ``` Use `recordPayloads` and `maxPayloadSize` to control payload attributes, `attributeExtractor` to add domain-specific span attributes, and `eventTarget` to let the interceptor factory close root spans from engine lifecycle events. ```typescript partial const { interceptor } = createObservabilityInterceptors({ eventTarget: engine, recordPayloads: true, maxPayloadSize: 2048, attributeExtractor: () => ({ 'service.name': 'checkout' }), }); engine.addInterceptor(interceptor); ``` ## Span hierarchy Each workflow execution creates a root span. Context operations create child spans: ``` workflow:order (root span) activity:charge (child span) sleep (child span) activity:ship (child span) ``` Attributes include `workflow.id`, `workflow.type`, `activity.name`, `activity.attempt`, and optionally the serialized `input` payload. ## W3C Trace Context propagation The observability interceptor uses the [interceptor headers mechanism](./interceptors.md) to propagate W3C Trace Context across thread and network boundaries. Its workflow-side hooks inject a `traceparent` header before each activity call. Its activity-side `execute` hook extracts it and creates a child span. ``` Workflow Worker Activity Worker ------------------ ------------------ creates span A extracts traceparent from headers injects traceparent into headers creates span B (child of A) yields ctx.run(...) executes activity function ---- postMessage/WebSocket ----> (includes headers map) <--- result ---- span A ends span B ends ``` The propagation helpers are exported individually if you need them: ```typescript import { generateTraceId, generateSpanId, formatTraceParent, parseTraceParent, } from '@lostgradient/weft/observability'; ``` `generateTraceId()` produces a 32-hex-character (16-byte) random trace ID. `generateSpanId()` produces a 16-hex-character (8-byte) random span ID. Both use `crypto.randomBytes()` under the hood. The `traceparent` header uses the standard `{version}-{traceId}-{spanId}-{flags}` format. `formatTraceParent()` serializes that shape, and `parseTraceParent()` goes the other direction, returning `null` for invalid inputs or all-zero IDs. ## Metrics The `METRICS` object defines the metric catalogue: ```typescript import { METRICS } from '@lostgradient/weft/observability'; // METRICS.workflowDuration // name: 'weft.workflow.duration', type: 'histogram', unit: 'ms' // METRICS.activityDuration // name: 'weft.activity.duration', type: 'histogram', unit: 'ms' // METRICS.activityAttempts // name: 'weft.activity.attempts', type: 'counter', unit: 'attempts' // METRICS.workflowActive // name: 'weft.workflow.active', type: 'gauge', unit: 'workflows' // METRICS.workflowStarted // name: 'weft.workflow.started', type: 'counter', unit: 'workflows' // METRICS.workflowCompleted // name: 'weft.workflow.completed', type: 'counter', unit: 'workflows' // METRICS.workflowFailed // name: 'weft.workflow.failed', type: 'counter', unit: 'workflows' // METRICS.taskBacklog // name: 'weft.task.backlog', type: 'gauge', unit: 'tasks' // METRICS.taskQueueLatency // name: 'weft.task.queue_latency', type: 'histogram', unit: 'ms' // METRICS.taskExecutionLatency // name: 'weft.task.execution_latency', type: 'histogram', unit: 'ms' // METRICS.workerCapacitySaturation // name: 'weft.worker.capacity_saturation', type: 'gauge', unit: 'ratio' ``` Each metric has a `name`, `description`, `unit`, and `type` (counter, gauge, or histogram). The [server](./server.md) exposes these at `GET /v1/metrics` in Prometheus-compatible text format. Task and worker metrics are intentionally low-cardinality. They tell you that backlog, queue latency, execution latency, retries, stale heartbeats, or capacity saturation exist. When you need the concrete evidence behind those aggregates, use `GET /api/v1/tasks/diagnostics` to retrieve bounded diagnostic items for stuck queued tasks, stale in-flight tasks, retry storms, all-workers-at-capacity conditions, and task-result dead letters. ## Composing with other interceptors Observability is just a regular interceptor. Compose it with your own by controlling registration order. The first registered interceptor is the outermost wrapper. ```typescript partial engine.addInterceptor(authInterceptor); // 1. Check auth engine.addInterceptor(validationInterceptor); // 2. Validate inputs engine.addInterceptor(observabilityInterceptor); // 3. Trace the validated, authorized call engine.addInterceptor(encryptionInterceptor); // 4. Encrypt before sending to worker ``` In this arrangement, the observability span captures the call _after_ auth and validation have passed but _before_ encryption. The span timings reflect the actual activity execution, not the overhead of validation and encryption. Adjust the order to match what you want to measure. Activity-side hooks follow the same pattern through `addInterceptor()`: ```typescript partial engine.addInterceptor(observabilityInterceptor); engine.addInterceptor(decryptionInterceptor); ``` If you're building a custom interceptor that also needs trace context, let the observability interceptor handle extraction. It reads the propagated `traceparent` header internally and parses it with `parseTraceParent()`. Traces stitch together automatically as long as the headers propagate through the interceptor chain.