--- name: langchain-observability description: "Wire LangSmith tracing and custom metric callbacks into a LangChain\ \ 1.0 chain\nor LangGraph 1.0 agent correctly \u2014 env-var spelling, subgraph\ \ propagation,\nper-tenant dimensions, cost and latency counters. Use when setting\ \ up\nobservability on a new service, debugging blank traces in LangSmith, or adding\n\ per-tenant cost breakdowns. Trigger with \"langchain observability\",\n\"langsmith\ \ tracing\", \"langchain callbacks\", \"langchain metrics\".\n" allowed-tools: Read, Write, Edit, Bash(python:*) version: 2.0.0 license: MIT author: Jeremy Longshore tags: - saas - langchain - langgraph - python - langchain-1.0 - observability - langsmith - callbacks compatibility: Designed for Claude Code, also compatible with Codex --- # LangChain Observability (Python) ## Overview Engineer sets `LANGCHAIN_TRACING_V2=true` and `LANGCHAIN_API_KEY=...` from the 0.2 docs, restarts the service, and sees zero traces in LangSmith — no errors, no warnings. That is P26: in LangChain 1.0 the canonical env vars are `LANGSMITH_TRACING` and `LANGSMITH_API_KEY`. The `LANGCHAIN_*` names are soft-deprecated and fail silently on any chain that goes through 1.0 middleware or `create_react_agent`. One-line fix: ```bash export LANGSMITH_TRACING=true export LANGSMITH_API_KEY=lsv2_... export LANGSMITH_PROJECT=my-service-prod ``` Next failure mode: a custom `BaseCallbackHandler` attached via `chain.with_config(callbacks=[meter])` fires on the parent but is silent on LangGraph subgraphs and `create_react_agent` tool calls — token counts under-report by 30-70% vs the provider dashboard. That is P28: LangGraph creates a child runtime per subgraph, and bound callbacks do not propagate. Pass callbacks at invocation time instead: ```python await chain.ainvoke(inputs, config={"callbacks": [meter], "configurable": {"tenant_id": t}}) ``` This skill walks through canonical LangSmith setup, a metric-callback template with tenant dimensions, invocation-time propagation, `RunnableConfig` trace tagging, and a decision tree for LangSmith-only vs OTEL-native (defer to `langchain-otel-observability` / L33 for OTEL-heavy). Pin: `langchain-core 1.0.x`, `langgraph 1.0.x`, `langsmith` current. LangSmith tracing adds <5ms per-span overhead; metric callbacks add <1ms per fire. Pain-catalog anchors: P26, P28, P04 (cache-token aggregation), P25 (retry double-counting). ## Prerequisites - Python 3.10+ - `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0` - `langsmith` (bundled with `langchain`; upgrade to current for 1.0 env-var support) - A LangSmith API key (`lsv2_...`) — free tier at https://smith.langchain.com - Optional metric sinks: `prometheus_client`, `statsd`, or `datadog` Python packages ## Instructions ### Step 1 — Enable LangSmith with the canonical 1.0 env vars `LANGSMITH_TRACING=true` is the switch. `LANGSMITH_API_KEY` authenticates. `LANGSMITH_PROJECT` groups traces by environment — use one project per `service-env` pair (`myapp-prod`, `myapp-staging`), not one per service. ```bash # .env (loaded via python-dotenv or secret manager) LANGSMITH_TRACING=true LANGSMITH_API_KEY=lsv2_pt_... LANGSMITH_PROJECT=my-service-prod # Legacy fallback names (still work, soft-deprecated — do not use in new code): # LANGCHAIN_TRACING_V2=true # LANGCHAIN_API_KEY=lsv2_pt_... # LANGCHAIN_PROJECT=my-service-prod ``` Verify in a REPL that the client sees the key before relying on it in production: ```python from langsmith import Client c = Client() # reads LANGSMITH_API_KEY and LANGSMITH_ENDPOINT print(c.list_projects(limit=1)) # raises LangSmithAuthError if key is wrong ``` Do NOT set both `LANGCHAIN_TRACING_V2` and `LANGSMITH_TRACING` — mixed settings have caused stale project routing in 1.0.x. See P26. For selective sampling in high-traffic services, set `LANGSMITH_SAMPLING_RATE=0.1` (10% of runs). Full detail in [LangSmith Setup](references/langsmith-setup.md). ### Step 2 — Write a metric callback for per-request observability Subclass `BaseCallbackHandler`. Record `token_in`, `token_out`, `latency_ms`, `tool_calls`, and `error`, tagged with a `tenant_id` dimension for downstream grouping. ```python import time from langchain_core.callbacks import BaseCallbackHandler from langchain_core.outputs import LLMResult class MetricCallback(BaseCallbackHandler): """Per-LLM-call metrics tagged with tenant_id. Overhead <1ms per event.""" def __init__(self, tenant_id: str, sink) -> None: self.tenant_id = tenant_id self.sink = sink self._starts: dict[str, float] = {} def on_llm_start(self, serialized, prompts, *, run_id, **kwargs) -> None: self._starts[str(run_id)] = time.perf_counter() def on_llm_end(self, response: LLMResult, *, run_id, **kwargs) -> None: t0 = self._starts.pop(str(run_id), time.perf_counter()) elapsed_ms = (time.perf_counter() - t0) * 1000 # wall-clock latency tags = {"tenant_id": self.tenant_id} for gen in response.generations: for g in gen: meta = getattr(g.message, "usage_metadata", None) or {} self.sink.incr("llm.token_in", meta.get("input_tokens", 0), tags) self.sink.incr("llm.token_out", meta.get("output_tokens", 0), tags) # P04 — aggregate Anthropic cache reads across calls cache = meta.get("input_token_details", {}).get("cache_read", 0) self.sink.incr("llm.cache_read", cache, tags) self.sink.hist("llm.latency_ms", elapsed_ms, tags) def on_llm_error(self, error, *, run_id, **kwargs) -> None: self._starts.pop(str(run_id), None) self.sink.incr("llm.error", 1, {"tenant_id": self.tenant_id, "error_type": type(error).__name__}) def on_tool_end(self, output, *, run_id, **kwargs) -> None: self.sink.incr("llm.tool_calls", 1, {"tenant_id": self.tenant_id}) ``` A thin `sink` protocol (`incr`, `hist`) swaps between Prometheus, StatsD, or Datadog. Alternative sinks (LangSmith-only, OTEL) do not need this callback at all — see Step 5. Full sink adapters and P25 retry dedupe in [Custom Metrics Callback](references/custom-metrics-callback.md). ### Step 3 — Pass callbacks via `config["callbacks"]` at invocation (P28) This is the single most common observability bug in LangGraph 1.0 services. Binding callbacks at definition time does not propagate into subgraphs or `create_react_agent` tool nodes — those create child runtimes with their own callback scope. ```python # WRONG — fires on parent runnable only; silent on subgraphs (P28) agent_bound = agent.with_config(callbacks=[MetricCallback(tenant_id, sink)]) result = await agent_bound.ainvoke(inputs) # RIGHT — propagates to every runnable, subgraph, and tool call meter = MetricCallback(tenant_id, sink) result = await agent.ainvoke( inputs, config={ "callbacks": [meter], "configurable": {"thread_id": session_id, "tenant_id": tenant_id}, "tags": ["prod", f"tenant:{tenant_id}"], "metadata": {"request_id": req_id, "tier": "enterprise"}, }, ) ``` Construct the callback *inside* the request handler so it captures a fresh `tenant_id` per request — and in that pattern, invocation-time config is the only way callbacks reach subgraphs. See [Trace Metadata and Tagging](references/trace-metadata-and-tagging.md) for the full `RunnableConfig` shape. ### Step 4 — Tag and annotate traces via `RunnableConfig` LangSmith indexes two per-request fields: `tags` (flat list, filterable) and `metadata` (key-value, searchable). Fix conventions early — LangSmith has no rename tool. ```python config = { "callbacks": [meter], "tags": [ "env:prod", # environment f"tenant:{tenant_id}", # tenant f"tier:{tenant_tier}", # plan tier f"feature:{feature_flag}", # A/B experiment arm ], "metadata": { "request_id": req_id, "user_id": user_id, "session_id": session_id, "app_version": os.environ["APP_VERSION"], }, "run_name": "agent_main", # LangSmith UI label; overrides chain class name } ``` Hierarchical tag conventions (`env:prod`, `tenant:acme`, `tier:enterprise`) make LangSmith filters work. Free-form tags (`"important"`, `"check-me"`) do not. See [Trace Metadata and Tagging](references/trace-metadata-and-tagging.md). ### Step 5 — Pick a sink and the stack shape The callback handler is the integration point. Options, in decreasing order of fit: - **LangSmith only** — zero additional overhead; tracing already covers latency and token accounting. Fine for solo dev, small teams, and LLM-native ops. - **Prometheus (pull)** — best fit for Kubernetes + existing Prom stack. Export via `prometheus_client` HTTP endpoint. Watch tenant label cardinality. - **StatsD / Datadog (push)** — UDP fire-and-forget; sub-1ms overhead. Safe on high-throughput async services. Use `datadog.dogstatsd` for tag support. - **OTEL native** — multi-service distributed tracing. Defer to `langchain-otel-observability` (L33); do not reimplement here. Decision tree: ``` Existing OTEL stack (Collector, Tempo, Jaeger)? ├── YES → OTEL-native (L33). LangSmith optional for prompt inspection. └── NO → LLM-specific features (prompt inspection, evals, queues) enough? ├── YES → LangSmith only. Add MetricCallback only for tenant cost. └── NO → Hybrid: LangSmith for prompts + Prometheus/Datadog for SLOs. See references/hybrid-langsmith-otel.md for split-point rules. ``` Mixing paths without a plan creates double-emission and conflicting trace IDs. See [Custom Metrics Callback](references/custom-metrics-callback.md) for Prometheus / StatsD / Datadog sink implementations, plus dedupe for P25 retry double-counts; see [Hybrid LangSmith + OTEL](references/hybrid-langsmith-otel.md) for the split-point contract. ### Step 6 — Feed runs back into evals Real traffic is the best eval set. Route a sampled subset of production runs into a LangSmith annotation queue for human review; the queue feeds `Dataset` objects replayable against candidate models. ```python from langsmith import Client Client().create_annotation_queue( name="prod-regressions", description="1% sample, weekly review", ) # Add metadata={"eval_candidate": "true"} on 1% of runs — LangSmith UI has # a rule to route into the queue by metadata filter. ``` Keep annotation queues under 500 runs/week (reviewers saturate past that). See [LangSmith Setup](references/langsmith-setup.md) for the queue and dataset flow. ## Output - LangSmith tracing on via `LANGSMITH_TRACING` / `LANGSMITH_API_KEY` / `LANGSMITH_PROJECT` with a `langsmith.Client()` smoke-check - `MetricCallback(BaseCallbackHandler)` emitting `token_in`, `token_out`, `cache_read`, `latency_ms`, `tool_calls`, `error` tagged with `tenant_id` - All chain invocations pass `config={"callbacks": [...], ...}` at invoke time so metrics propagate to subgraphs and agent tools - `RunnableConfig` carries hierarchical tags (`env:*`, `tenant:*`, `tier:*`) and structured `metadata` (`request_id`, `user_id`, `session_id`) - One metric sink wired (Prometheus, StatsD, Datadog, or LangSmith-only) - Explicit choice recorded for LangSmith / OTEL / hybrid / custom ## Error Handling | Error | Cause | Fix | |-------|-------|-----| | No traces in LangSmith, no errors | Used `LANGCHAIN_TRACING_V2` spelling on 1.0 middleware path (P26) | Switch to `LANGSMITH_TRACING=true` and `LANGSMITH_API_KEY` | | `langsmith.utils.LangSmithAuthError: Unauthorized` | Key is valid but points to a deleted workspace, or copied with trailing whitespace | Regenerate at smith.langchain.com, check `repr(os.environ['LANGSMITH_API_KEY'])` for `\n` | | Callback fires on parent only, silent on subgraphs | Bound via `.with_config(callbacks=[...])` — does not propagate (P28) | Pass via `config["callbacks"]` at `invoke()` / `ainvoke()` | | Token counts under by 30-70% vs provider dashboard | Combination of P28 (subgraph silence) and P25 (retry double-count not deduped) | Fix P28 first; for P25 add `request_id` dedupe key in sink | | Trace duration shows 0ms on streamed calls | `on_llm_end` fires after stream closes but handler records before — timing race | Use `time.perf_counter()` captured in `on_llm_start`, not `on_chat_model_start` | | Prometheus cardinality explosion | `tenant_id` label has high cardinality (>10k tenants) | Bucket tenants into tiers for metrics; keep full `tenant_id` in LangSmith metadata only | | LangSmith UI shows runs under `default` project, not the configured one | `LANGSMITH_PROJECT` env var not set at process start | Set before import; `LANGSMITH_PROJECT` is read once at `Client()` init | | `AttributeError: 'NoneType' object has no attribute 'get'` in `on_llm_end` | `usage_metadata` is `None` on intermediate streaming chunks | Guard with `if meta := getattr(g.message, 'usage_metadata', None):` | ## Examples ### Multi-tenant SaaS: per-tenant cost dashboard A production SaaS has 200 tenants on a shared LangGraph agent. Finance wants weekly cost reports per tenant. The `MetricCallback` records `token_in`, `token_out`, and `cache_read` tagged with `tenant_id`; Prometheus scrapes the `/metrics` endpoint; Grafana aggregates `sum by (tenant_id) (rate(llm_token_out_total[1w])) * 0.0000015` for Sonnet output cost. The invocation-time `config["callbacks"]` propagation is load-bearing here — without it, subgraph tool calls (the bulk of token spend) go uncounted. See [Custom Metrics Callback](references/custom-metrics-callback.md) for the full Prometheus integration. ### Debugging missing traces in staging A team deploys a new LangGraph service to staging. No traces show up in LangSmith. Checking: (1) `LANGSMITH_TRACING` spelled correctly — yes; (2) API key valid — `langsmith.Client().list_projects(limit=1)` returns ok; (3) project name matches — `LANGSMITH_PROJECT=myservice-staging`. Traces appear in the `default` project, not `myservice-staging`. Root cause: the env var was set in the runtime env-file but the process was started before the env-file was sourced. `Client()` read `LANGSMITH_PROJECT` at import time. Fix: restart the process cleanly. See [LangSmith Setup](references/langsmith-setup.md) for the process-order checklist. ### Feeding prod traffic to an eval dataset A team wants to validate a Claude 4.6 → Claude 4.7 upgrade against recent prod runs. They add `metadata={"eval_candidate": "pre-upgrade"}` to 1% of runs for one week, create a LangSmith dataset from the tagged runs, then replay against the new model and diff outputs. The sampling rule lives in LangSmith UI, filtered by `metadata.eval_candidate`. See [LangSmith Setup](references/langsmith-setup.md) for the annotation-queue and dataset-creation flow. ## Resources - [LangSmith Observability concepts](https://docs.smith.langchain.com/observability/concepts) - [LangSmith env variable reference](https://docs.smith.langchain.com/how_to_guides/setup/configure_project) - [LangChain callbacks (1.0)](https://python.langchain.com/docs/concepts/callbacks/) - [`BaseCallbackHandler` API](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.base.BaseCallbackHandler.html) - [`RunnableConfig` API](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.config.RunnableConfig.html) - For OTEL-native instrumentation: `langchain-otel-observability` (L33) in this pack - Pack pain catalog: `docs/pain-catalog.md` (entries P04, P25, P26, P28)