--- name: langchain-multi-env-setup description: "Build reliable dev / staging / prod isolation for LangChain 1.0 services\ \ \u2014\nPydantic `Settings` + `SecretStr`, cloud Secret Manager in prod, per-env\n\ prompt and model version pinning, env-specific checkpointer and observability.\n\ Use when graduating from `.env`-in-dev to real prod infra, or debugging a\nconfig\ \ that loaded the wrong values in the wrong env.\nTrigger with \"langchain multi-env\"\ , \"langchain pydantic settings\",\n\"langchain secret manager\", \"langchain env\ \ config\", \"langchain prod setup\".\n" allowed-tools: Read, Write, Edit, Bash(python:*), Bash(gcloud:*) version: 2.0.0 license: MIT author: Jeremy Longshore tags: - saas - langchain - langgraph - python - langchain-1.0 - config - pydantic - multi-env - secrets compatibility: Designed for Claude Code, also compatible with Codex --- # LangChain Multi-Env Setup (Python) ## Overview A team ships a LangChain 1.0 service to staging with `python-dotenv` loading `.env.staging` into `os.environ`. Security audits — `docker exec STAGING-POD env` prints `ANTHROPIC_API_KEY=sk-ant-api03-...` in plain text. Anyone with `kubectl exec`, any sidecar, any core dump, any error tracker that auto-captures process env sees the key. This is pain **P37**: secrets loaded from `.env` in production containers leak via `env`. A second failure chains. A developer runs the staging deploy from a shell where `LANGCHAIN_ENV=production` was set hours earlier. The loader picks the prod `.env`, staging answers with a prompt commit tuned only for the prod model tier, latency doubles. Two root causes: no type-safe env gate, no startup validation that would have caught the mismatched model id. Both are one refactor: ```python # BAD — dotenv populates os.environ; any process with container access sees it from dotenv import load_dotenv load_dotenv(".env.production") api_key = os.environ["ANTHROPIC_API_KEY"] # P37: leaks via `docker exec env` # GOOD — SecretStr in a validated Settings object, pulled from Secret Manager from pydantic import SecretStr from pydantic_settings import BaseSettings class Settings(BaseSettings): env: Literal["dev", "staging", "prod"] anthropic_api_key: SecretStr settings = build_settings() # pulls from GCP Secret Manager in prod api_key = settings.anthropic_api_key.get_secret_value() # repr(settings) prints `SecretStr('**********')` — safe to log ``` This skill owns the per-env **config plumbing** — `Settings` skeleton, Secret Manager integration, per-env pinning, startup smoke test. It does **not** own the full secrets lifecycle (rotation, revocation, scope) — that belongs to `langchain-security-basics`. Pin: `langchain-core 1.0.x`, `langchain-anthropic 1.0.x`, `pydantic >= 2.5`, `pydantic-settings >= 2.1`. Pain anchors: **P37** (primary), **P20** (checkpointer schema — cross-ref `langchain-langgraph-checkpointing`). Two numbers: **smoke test < 10 seconds**; **env-var count ~15-30** (more than 30 means `Settings` is absorbing feature flags and should split). ## Prerequisites - Python 3.10+ (3.11+ recommended for `Literal` and `StrEnum` ergonomics) - `langchain-core >= 1.0, < 2.0` - `pydantic >= 2.5`, `pydantic-settings >= 2.1` - One secret backend: GCP Secret Manager (`google-cloud-secret-manager`), AWS Secrets Manager (`boto3`), or HashiCorp Vault (`hvac`) - Completed `langchain-sdk-patterns` — the `Settings` object is injected into the chain factories from that skill ## Instructions Run these six steps in order — each adds one invariant the next step depends on: 1. Define a `Settings` class with `SecretStr` keys, `Literal` env, and fail-fast validation. 2. Add a per-env loader — file in dev, env vars in staging, Secret Manager in prod. 3. Use the cloud Secret Manager client to pull keys into memory only. 4. Pin `model_id`, `prompt_commit_hash`, and `vector_index_name` per env. 5. Configure the checkpointer per env — memory in dev, Postgres elsewhere. 6. Run a startup smoke test under 10 seconds before the HTTP server binds. ### Step 1 — Create a Settings class with SecretStr and fail-fast validation ```python from typing import Literal from pydantic import SecretStr, HttpUrl, Field, ValidationError from pydantic_settings import BaseSettings, SettingsConfigDict class Settings(BaseSettings): model_config = SettingsConfigDict( env_file=None, # see Step 2 — loader picks the file env_file_encoding="utf-8", case_sensitive=False, extra="forbid", # reject unknown env vars — typo detection ) # --- env switch (drives everything else) --- env: Literal["dev", "staging", "prod"] = Field(..., alias="LANGCHAIN_ENV") # --- secrets (always SecretStr — never str) --- anthropic_api_key: SecretStr = Field(..., alias="ANTHROPIC_API_KEY") openai_api_key: SecretStr = Field(..., alias="OPENAI_API_KEY") langsmith_api_key: SecretStr = Field(..., alias="LANGSMITH_API_KEY") # --- per-env pinning (see Step 4) --- model_id: str = Field(..., alias="LANGCHAIN_MODEL_ID") prompt_commit_hash: str = Field(..., alias="LANGCHAIN_PROMPT_COMMIT") vector_index_name: str = Field(..., alias="LANGCHAIN_VECTOR_INDEX") # --- endpoints (validated URLs — typo caught at startup) --- checkpointer_url: HttpUrl | None = Field(None, alias="LANGCHAIN_CHECKPOINTER_URL") otel_endpoint: HttpUrl = Field(..., alias="OTEL_EXPORTER_OTLP_ENDPOINT") # --- budget guards (per-env) --- max_cost_usd_per_day: float = Field(10.0, alias="LANGCHAIN_DAILY_BUDGET_USD") max_rpm: int = Field(60, alias="LANGCHAIN_MAX_RPM") ``` `SecretStr` masks `repr(settings)` to `SecretStr('**********')` — a routine `logger.info(settings)` cannot leak the key. The only way to read plaintext is `.get_secret_value()`, which greps like a sore thumb in review. `extra="forbid"` catches typos (`LANGCHIN_MODEL_ID`) at import time. `HttpUrl` rejects `http:/otel:4318` before the exporter wastes 60s on DNS. See [Settings Skeleton](references/settings-skeleton.md) for the full class. ### Step 2 — Per-env config loading (file OR Secret Manager, never both) ```python import os from pathlib import Path def build_settings() -> Settings: env = os.environ.get("LANGCHAIN_ENV", "dev") if env == "dev": # Local dev: .env.dev file, values checked into 1Password not git return Settings(_env_file=Path(".env.dev")) if env == "staging": # CI / staging: env vars injected by the orchestrator # (GitHub Actions secrets, k8s envFrom: secretRef, etc.) return Settings() # reads os.environ directly if env == "prod": # Prod: pull from Secret Manager into memory ONLY values = pull_from_secret_manager() return Settings(**values) raise ValueError(f"unknown LANGCHAIN_ENV: {env!r}") ``` Three loaders, one class. Dev touches a file on disk. Staging inherits env vars from the orchestrator — `envFrom: secretRef` is readable via `docker exec env`, but the blast radius is bounded and rotation is weekly. Prod is the P37 fix: `pull_from_secret_manager()` builds a dict and passes kwargs to `Settings(...)`. Values land in the instance attribute and **never touch `os.environ`**. A subprocess will not inherit them. ### Step 3 — Secret Manager pull (GCP example; AWS / Vault in reference) ```python from google.cloud import secretmanager def pull_from_secret_manager() -> dict[str, str]: client = secretmanager.SecretManagerServiceClient() project = os.environ["GCP_PROJECT_ID"] secret_names = ["ANTHROPIC_API_KEY", "OPENAI_API_KEY", "LANGSMITH_API_KEY"] out: dict[str, str] = {} for name in secret_names: resource = f"projects/{project}/secrets/{name}/versions/latest" response = client.access_secret_version(request={"name": resource}) out[name] = response.payload.data.decode("utf-8") # Non-secret passthrough (model id, prompt hash, endpoints) for key in ["LANGCHAIN_ENV", "LANGCHAIN_MODEL_ID", "LANGCHAIN_PROMPT_COMMIT", "LANGCHAIN_VECTOR_INDEX", "LANGCHAIN_CHECKPOINTER_URL", "OTEL_EXPORTER_OTLP_ENDPOINT"]: if key in os.environ: out[key] = os.environ[key] return out ``` No `os.environ[k] = v` line. The dict goes straight into `Settings(**values)`. Workload-identity IAM handles auth; no static key on disk. For AWS / Vault see [Secret Manager Integration](references/secret-manager-integration.md). ### Step 4 — Per-env model and prompt pinning Dev, staging, and prod run **different** model ids and **different** prompt commit hashes. Pinning happens at env-var level so app code is env-agnostic (see the Env Matrix below for values). One function reads `settings.prompt_commit_hash` and pulls from LangSmith (cross-ref `langchain-prompt-engineering`): ```python from langsmith import Client ls = Client(api_key=settings.langsmith_api_key.get_secret_value()) def get_prompt(settings: Settings) -> ChatPromptTemplate: return ls.pull_prompt(f"triage-prompt:{settings.prompt_commit_hash}") ``` **Prevents:** staging loading a prod prompt commit. Pinning per env makes promotion explicit — dev → staging → prod moves one hash at a time. See [Per-Env Pinning](references/per-env-pinning.md). ### Step 5 — Per-env checkpointer selection Checkpointer choice is per-env too: ```python from langgraph.checkpoint.memory import MemorySaver from langgraph.checkpoint.postgres import PostgresSaver def build_checkpointer(settings: Settings): if settings.env == "dev": return MemorySaver() # ephemeral, resets on restart # staging + prod: Postgres with env-isolated schema # cross-ref langchain-langgraph-checkpointing (P20) for schema migration return PostgresSaver.from_conn_string( str(settings.checkpointer_url) ) ``` Dev uses `MemorySaver` — no infra dependency, no state between runs. Staging and prod use `PostgresSaver` against **separate** databases (or separate schemas). Never share a checkpointer DB between envs; P20 explains — schema migrations on a version bump corrupt cross-env threads. ### Step 6 — Startup smoke test (< 10 seconds budget) ```python import time from anthropic import Anthropic def validate_integrations(settings: Settings) -> None: t0 = time.monotonic() # 1. Model reachable (1-token ping ~ $0.00001) anthropic = Anthropic(api_key=settings.anthropic_api_key.get_secret_value()) anthropic.messages.create( model=settings.model_id, max_tokens=1, messages=[{"role": "user", "content": "hi"}], ) # 2. Checkpointer reachable if settings.env != "dev": checkpointer = build_checkpointer(settings) checkpointer.setup() # runs SELECT 1 + schema check # 3. Vector store reachable (see langchain-embeddings-search) # ... describe_index call here ... # 4. Observability endpoint reachable (OTLP HTTP health) # ... requests.get(f"{settings.otel_endpoint}/health", timeout=2) ... elapsed = time.monotonic() - t0 if elapsed > 10.0: raise RuntimeError( f"startup smoke test took {elapsed:.1f}s (budget 10s)" ) ``` Call `validate_integrations(settings)` **before** the HTTP server binds. Failure aborts the deploy — the readiness probe never goes green, the rollout halts, the bad version takes no traffic. Budget: **10 seconds**. Past 10s an integration is degraded — fail loudly rather than ship a 30s cold start. See [Startup Smoke Test](references/startup-smoke-test.md). ## Output - `Settings` class on `pydantic-settings` with `SecretStr` for keys, `Literal` env, `HttpUrl` endpoints, `extra="forbid"` - Env-specific loader (file → dev; env vars → staging; Secret Manager → prod); values land in `Settings` only, never `os.environ` - Cloud Secret Manager integration (GCP / AWS / Vault) with IAM-bound auth; no static keys on disk - Per-env pinning for `model_id`, `prompt_commit_hash`, `vector_index_name`, `checkpointer_url` - Per-env checkpointer (`MemorySaver` dev, `PostgresSaver` on isolated DBs staging/prod) - Startup smoke test — model / vector / checkpointer / observability under 10-second budget ## Env Matrix | Dimension | dev | staging | prod | |---|---|---|---| | Secret backend | `.env.dev` file (git-ignored) | orchestrator env vars | cloud Secret Manager, memory only | | `os.environ` holds keys | yes (local) | yes (sidecar visible) | **no** (P37 fix) | | `model_id` | `claude-haiku-4-6` | `claude-sonnet-4-6` | `claude-sonnet-4-6` | | `prompt_commit_hash` | WIP | canary | stable (1 week old) | | `temperature` | 0.7 | 0.2 | 0.2 | | Checkpointer | `MemorySaver` | `PostgresSaver` (staging DB) | `PostgresSaver` (prod DB) | | Vector index | `dev-index` | `staging-index` | `prod-index` | | OTEL sample rate | 1.0 | 1.0 | 0.1 | | RPM limit | 10 | 60 | provider tier | | Daily budget | $1 | $10 | $500-$5000 | | Smoke probes | model | model + checkpointer + OTEL | all four | ## Error Handling | Error | Cause | Fix | |---|---|---| | `docker exec POD env` shows `ANTHROPIC_API_KEY=...` in prod (P37) | `dotenv` / plain env injection in prod | Pull from Secret Manager into `Settings(**values)`; never write to `os.environ` | | Staging answers with prod prompts / wrong model | Loader defaulted or picked stale `LANGCHAIN_ENV` | `Literal["dev","staging","prod"]` on env; raise on unknown; no default | | `ValidationError: extra fields forbidden` at startup | Typo (`LANGCHIN_MODEL_ID`) | Fix the typo — `extra="forbid"` working as intended | | Startup takes 30s before first request | Serialized probes or degraded integration | Enforce 10s budget; parallelize probes; fail the deploy | | `repr(settings)` in a log leaks the API key | Plain `str` used, not `SecretStr` | Change field to `SecretStr`; repr masks to `'**********'` | | Prod silently using `MemorySaver` | `build_checkpointer` defaulted when `checkpointer_url` was None | Require `checkpointer_url` in staging/prod via a model validator | | Secret Manager auth fails in CI | SA not bound; `google.auth` fell back to ADC | Bind SA with `roles/secretmanager.secretAccessor` | | Prompt hash rolled forward in staging without dev validation | Promotion skipped the dev gate | Enforce dev → staging → prod order in CI (see per-env pinning ref) | ## Examples ### Graduating a `.env`-in-dev service to prod Start: a single `.env` committed (or leaked via `docker exec env`). End: `Settings` class, three loaders, Secret Manager in prod, smoke test under 10s. Three PRs — (1) introduce `Settings` without changing loader behavior, (2) add `SecretStr` and migrate call sites to `.get_secret_value()`, (3) swap prod to Secret Manager and remove the prod `.env` from the image. See [Settings Skeleton](references/settings-skeleton.md) and [Secret Manager Integration](references/secret-manager-integration.md). ### Wrong-env prompt loaded in staging — postmortem Staging inherited `LANGCHAIN_ENV=production` from a stale shell. The `Literal["dev","staging","prod"]` field rejects `production`; CI promotion sets `LANGCHAIN_ENV` explicitly; `direnv` pins it per-project. See [Per-Env Pinning](references/per-env-pinning.md). ### Smoke test blocked a bad model id A prod deploy went out with `LANGCHAIN_MODEL_ID=claude-sonnet-4-7` (not yet rolled out). The 1-token ping failed with `model not found`, `validate_integrations` raised, the container crash-looped, the rollout halted, the previous version kept taking traffic. Zero user impact; failure budget stayed under 3s. See [Startup Smoke Test](references/startup-smoke-test.md). ## Resources - [Pydantic Settings docs](https://docs.pydantic.dev/latest/concepts/pydantic_settings/) - [Pydantic `SecretStr`](https://docs.pydantic.dev/latest/api/types/#pydantic.types.SecretStr) - [GCP Secret Manager client](https://cloud.google.com/secret-manager/docs/reference/libraries) - [AWS Secrets Manager `boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/secretsmanager.html) - [HashiCorp Vault `hvac`](https://hvac.readthedocs.io/) - [LangChain 1.0 release notes](https://blog.langchain.com/langchain-langgraph-1dot0/) - Related skills in pack: `langchain-security-basics` (secrets lifecycle, owns rotation and revocation — not duplicated here); `langchain-langgraph-checkpointing` (P20 schema migration); `langchain-prompt-engineering` (prompt pin / LangSmith pull workflow); `langchain-reference-architecture` (where `Settings` fits in the DI layer) - Pack pain catalog: `docs/pain-catalog.md` (entries P37 primary, P20 cross-ref)