--- name: langchain-performance-tuning description: "Tune LangChain 1.0 / LangGraph 1.0 Python chains and agents for throughput,\n\ latency, and cost \u2014 streaming modes, explicit batch concurrency, semantic\n\ plus exact caches, persistent message history, and async-safe retriever\npatterns.\ \ Use when p95 latency exceeds target, batching \"does not work\",\ncost grows linearly\ \ with traffic, or a process restart wipes chat history.\nTrigger with \"langchain\ \ performance\", \"langchain slow batch\",\n\"langchain throughput\", \"langchain\ \ p95 latency\", \"semantic cache hit rate\".\n" allowed-tools: Read, Write, Edit, Bash(python:*), Bash(redis-cli:*) version: 2.0.0 license: MIT author: Jeremy Longshore tags: - saas - langchain - langgraph - python - langchain-1.0 - performance - caching - async compatibility: Designed for Claude Code, also compatible with Codex --- # LangChain Performance Tuning ## Overview An engineer calls `chain.batch(inputs_1000)` expecting 1000 parallel LLM calls. Actual behavior: `Runnable.batch` and `Runnable.abatch` in LangChain 1.0 default to `max_concurrency=1`, so the 1000 inputs run **sequentially with bookkeeping overhead** — sometimes slower than a plain `for` loop. This is pain-catalog entry P08. The fix is one line: ```python # Before: serial, ~1000 * per_call_latency await chain.abatch(inputs) # After: 10x throughput at 10 providers' worth of concurrency await chain.abatch(inputs, config={"max_concurrency": 10}) ``` Other silent regressions in the same pain catalog: P48 (`invoke` inside `async def` blocks the FastAPI event loop), P22 (`InMemoryChatMessageHistory` loses every user's chat on restart), P62 (`RedisSemanticCache` at the default `score_threshold=0.95` returns under 5% hit rate), P59 (async retrievers leak connections on cancellation), P60 (`BackgroundTasks` fires *after* the response — wrong for per-token SSE), P01 (streaming token counts are only reliable on the `on_chat_model_end` event). This skill wires a production performance baseline: explicit batch concurrency, async-only code paths, Redis-backed caches tuned on a golden set, persistent chat history with TTL, and TTFT instrumentation from `astream_events(version="v2")`. ## Prerequisites - Python 3.11+ with `langchain>=1.0,<2`, `langgraph>=1.0,<2`, `langchain-openai` or `langchain-anthropic`, `langchain-community`, `langchain-redis` or `redis>=5`. - A working LangChain 1.0 chain or LangGraph 1.0 graph that already passes functional tests. - Redis 7+ reachable from the app for cache and history (local Docker is fine for dev). - A FastAPI / Starlette async endpoint, or an equivalent async entrypoint. - Observability: a place to emit metrics (Prometheus, OpenTelemetry, or LangSmith) — needed to measure TTFT, p95, and cache hit rate. ## Instructions 1. **Establish a latency budget and baseline.** Pick explicit targets before changing code: TTFT under 1s, p95 total under 5s, throughput over 20 req/s per worker, cost under $X per 1k interactions. Run a 5-minute load test with `locust` or `wrk` against the current chain and record p50 / p95 / p99 / TTFT / total cost. Without these numbers every downstream change is theater. 2. **Convert every hot path to async (P48).** Inside `async def` handlers, replace `invoke`, `stream`, `batch`, `get_relevant_documents`, and `tool.run` with `ainvoke`, `astream` / `astream_events(version="v2")`, `abatch`, `aget_relevant_documents`, and `tool.arun`. See `references/async-safety-checklist.md` for a grep pattern and a CI linter. Target: zero sync LangChain calls inside any async function. 3. **Fix `.abatch()` concurrency (P08).** Every `.abatch` / `.batch` call must pass `config={"max_concurrency": N}` where N is chosen from the provider table in `references/batch-concurrency-per-provider.md` (Anthropic 10-20, OpenAI 20-50, local vLLM 100+). For multi-worker deploys, cap account-wide calls with a LiteLLM / Portkey proxy or a Redis semaphore — `max_concurrency` only governs one process. 4. **Instrument TTFT with `astream_events(version="v2")` (P01).** Measure time to first token separately from total latency — user-perceived performance hinges on TTFT. Read usage metadata only on the `on_chat_model_end` event; per-chunk usage fields lag and are not reliable mid-stream. ```python from time import perf_counter async def run(chain, query: str): t0 = perf_counter(); ttft = None; tokens = 0 async for ev in chain.astream_events({"input": query}, version="v2"): if ev["event"] == "on_chat_model_stream" and ttft is None: ttft = perf_counter() - t0 if ev["event"] == "on_chat_model_end": tokens = ev["data"]["output"].usage_metadata["total_tokens"] return {"ttft_s": ttft, "total_s": perf_counter() - t0, "tokens": tokens} ``` 5. **Enable an exact LLM cache.** For deterministic (temperature=0) prompts, set `RedisCache` or `SQLiteCache` globally. LangChain 1.0 keys include the bound tools signature (P61 fix), which prevents cache poisoning when an agent's tool list changes. Always set an explicit TTL on Redis keys — default Redis keys are immortal. ```python from langchain_core.globals import set_llm_cache from langchain_community.cache import RedisCache import redis set_llm_cache(RedisCache(redis.Redis.from_url("redis://cache:6379/0"))) ``` 6. **Add a semantic cache with a tuned threshold (P62).** The `RedisSemanticCache` default `score_threshold=0.95` produces < 5% hit rate on real traffic. Collect a 200-500 prompt golden set with labeled near-duplicates, measure cosine similarity with your embedding model, and pick the F1-maximizing threshold — typically **0.85-0.90** for `text-embedding-3-small`. Full procedure in `references/cache-tuning.md`. Do not run semantic cache behind `temperature > 0`; users will see prior random draws. 7. **Replace `InMemoryChatMessageHistory` (P22).** Every production chat path must use `RedisChatMessageHistory` (with `ttl`) or a LangGraph checkpointer (`AsyncPostgresSaver` / `AsyncSqliteSaver`). Add a restart test: mid-conversation, kill and restart the worker, assert the next user turn still sees prior messages. See `references/persistent-history.md` for migration steps and trim policies. 8. **Close retriever connection pools in FastAPI `lifespan` (P59).** Build the vector store once at startup, expose it via `app.state`, close it in the `finally` block. Never construct a retriever per request — cancellations leak pg connections. 9. **Stream tokens with SSE, not `BackgroundTasks` (P60).** `BackgroundTasks` runs after the response body is flushed; per-token dispatch via it delivers tokens the client will never read. Use `EventSourceResponse` (sse-starlette) or a WebSocket and pipe events from `astream_events`. 10. **Re-run the load test and diff the four metrics.** TTFT, p95, throughput, cost per 1k. If any regressed, revert that step and investigate — do not stack changes without verification. Execute in this order to isolate effects: 1. Run the baseline load test and save results. 2. Set `max_concurrency` on every `.abatch` call and re-run. 3. Add exact cache, re-run, check cache hit rate. 4. Configure semantic cache with tuned threshold, re-run, check hit rate again. 5. Verify persistent history survives a worker restart. ### Throughput Tuning Table (starting values) | Provider | Safe `max_concurrency` | Ceiling signal | |----------|------------------------|-----------------| | Anthropic (sonnet-4.5/4.6) | 10-20 | 429 `rate_limit_error` | | OpenAI (gpt-4o / 4o-mini) | 20-50 | 429 + TPM exhaustion header | | OpenAI o1 / reasoning | 2-5 | Cost + latency, not rate | | Google Gemini 1.5/2.5 | 10-30 | 429 | | Cohere | 20-40 | 429 | | Local vLLM / TGI | 100-500 (batch N≈32-64) | GPU KV-cache OOM | | Ollama on consumer GPU | 1-4 | Process queue backpressure | ### Latency Breakdown Template Record these for every change, not just total: | Metric | Target | Source | |--------|--------|--------| | TTFT p50 / p95 | 500ms / 1s | first `on_chat_model_stream` event | | Total p50 / p95 | 2s / 5s | end-to-end handler | | Tool-call p95 | < 1s per tool | `on_tool_end` - `on_tool_start` | | Retriever p95 | < 300ms | `on_retriever_end` - `on_retriever_start` | | Provider p95 | measure per model | split by LLM node | ### Batch Sweet-Spot Numbers - Anthropic tier 2 chat: `max_concurrency=10` saturates at roughly 8 req/s, p95 doubles past 20. - OpenAI `gpt-4o-mini` tier 3: knee of the curve around `max_concurrency=30-40`; ~40 req/s throughput. - Local vLLM A100: server-side batch sweet spot `N=32-64`, client `max_concurrency=100+`. Verify on your own account — these are starting points, not promises. ## Output Deliverables from running this skill end-to-end: - A `perf/` directory with `baseline.json` and `tuned.json` load-test results. - All async handlers use `ainvoke` / `astream_events` / `abatch` with explicit `max_concurrency`. - `set_llm_cache` wired to `RedisCache` (exact) and optionally `RedisSemanticCache` (tuned threshold). - `RunnableWithMessageHistory` or LangGraph checkpointer backed by Redis or Postgres, with TTL. - FastAPI `lifespan` closing vector store pools on shutdown. - SSE endpoint streaming from `astream_events(version="v2")`. - A `tests/test_no_sync_in_async.py` CI guard (see async-safety reference). - Metrics exported: `ttft_seconds`, `total_latency_seconds`, `cache_hit_total`, `cache_miss_total`, `batch_concurrency_current`. - Runbook entry with the tuned `max_concurrency` per provider and the semantic-cache threshold, versioned in git. ## Error Handling | Symptom | Root cause | Fix | |---------|-----------|-----| | `.abatch(inputs)` no faster than a `for` loop | `max_concurrency=1` default (P08) | Pass `config={"max_concurrency": N}` | | FastAPI TTFT collapses under load | Sync `invoke` inside `async def` (P48) | Switch to `ainvoke` / `astream_events` | | Chat forgets prior turns after deploy | `InMemoryChatMessageHistory` (P22) | Move to `RedisChatMessageHistory` with TTL | | Semantic cache hit rate < 5% | `score_threshold=0.95` default (P62) | Tune on golden set to 0.85-0.90 | | pg pool exhausted hours into load test | Retriever not closed on cancel (P59) | Close vector store in FastAPI `lifespan` | | SSE client sees zero tokens | Dispatching via `BackgroundTasks` (P60) | Use `EventSourceResponse` and `astream_events` | | Per-chunk token counts fluctuate | Usage metadata lags during stream (P01) | Read only on `on_chat_model_end` | | 429 storm after tuning concurrency | Per-worker limit * N workers > account RPM | Add LiteLLM/Portkey proxy or Redis semaphore | | Semantic cache returns off-brand output | Cache hit on `temperature > 0` route | Disable semantic cache or force temperature=0 | | Cache poisoning after tool change | Missing tools in cache key | Upgrade LangChain to 1.0.x post-P61 fix | ## Examples **Example 1 — Fix a sequential batch job.** ```python # Before — 1000 items, 18 minutes end-to-end results = await chain.abatch(inputs) # After — 1000 items, ~2 minutes; Anthropic tier-2 account, N=10 results = await chain.abatch(inputs, config={"max_concurrency": 10}) ``` **Example 2 — Wire persistent history and an exact cache on a FastAPI app.** ```python from contextlib import asynccontextmanager from fastapi import FastAPI from langchain_core.globals import set_llm_cache from langchain_core.runnables.history import RunnableWithMessageHistory from langchain_community.cache import RedisCache from langchain_community.chat_message_histories import RedisChatMessageHistory import redis @asynccontextmanager async def lifespan(app: FastAPI): r = redis.Redis.from_url("redis://cache:6379/0") set_llm_cache(RedisCache(r)) app.state.r = r yield r.close() app = FastAPI(lifespan=lifespan) def history_for(session_id: str) -> RedisChatMessageHistory: return RedisChatMessageHistory( session_id=session_id, url="redis://history:6379/2", ttl=60 * 60 * 24 * 14, ) chain_with_history = RunnableWithMessageHistory( base_chain, history_for, input_messages_key="input", history_messages_key="history", ) ``` **Example 3 — Stream tokens with measured TTFT.** ```python from sse_starlette.sse import EventSourceResponse from time import perf_counter @app.post("/chat") async def chat(req: ChatReq): async def gen(): t0 = perf_counter() async for ev in chain_with_history.astream_events( {"input": req.text}, config={"configurable": {"session_id": req.session_id}}, version="v2", ): if ev["event"] == "on_chat_model_stream": yield {"data": ev["data"]["chunk"].content} app.state.r.incrbyfloat("ttft_sum_s", perf_counter() - t0) return EventSourceResponse(gen()) ``` ## Resources - [One-pager](references/one-pager.md) — problem / solution / key features snapshot. - [batch-concurrency-per-provider](references/batch-concurrency-per-provider.md) — per-provider `max_concurrency` table, sweep procedure, semaphore patterns. - [cache-tuning](references/cache-tuning.md) — exact vs semantic, Redis key design, golden-set threshold procedure, TTL strategy. - [persistent-history](references/persistent-history.md) — Redis / Postgres / LangGraph checkpointer migration off `InMemoryChatMessageHistory`. - [async-safety-checklist](references/async-safety-checklist.md) — sync-in-async grep + linter, lifespan pool cleanup, SSE vs `BackgroundTasks`. - [LangChain streaming / batching](https://python.langchain.com/docs/how_to/streaming/#batching) — official docs for `Runnable.batch` and streaming modes. - [LangChain caching](https://python.langchain.com/docs/how_to/llm_caching/) — `set_llm_cache`, Redis and SQLite backends. - [LangGraph checkpointers](https://langchain-ai.github.io/langgraph/how-tos/persistence/) — persistence for graph state. - Companion skills in `langchain-py-pack`: `langchain-model-inference` (token accounting), `langchain-embeddings-search` (retrieval tuning), `langchain-middleware-patterns` (tool-signature cache keying, P61).