--- name: langchain-ci-integration description: "Wire LangChain 1.0 / LangGraph 1.0 tests into a GitHub Actions pipeline\ \ \u2014\nunit tests with FakeListChatModel, VCR-gated integration tests, warning-filter\n\ policy, and eval-regression merge gates. Complements langchain-local-dev-loop\n\ (F23) which covers the inner loop; THIS covers the CI wire-up. Use when setting\n\ up GHA for a new LLM service, after a VCR cassette leak incident, or hardening\n\ an existing pipeline.\nTrigger with \"langchain ci\", \"langchain github actions\"\ , \"langchain test pipeline\",\n\"vcr ci\", \"langchain eval gate\", \"pytest -W\ \ error langchain\".\n" allowed-tools: Read, Write, Edit, Bash(python:*), Bash(pytest:*) version: 2.0.0 license: MIT author: Jeremy Longshore tags: - saas - langchain - langgraph - python - langchain-1.0 - ci - github-actions - testing compatibility: Designed for Claude Code, also compatible with Codex --- # LangChain CI Integration (Python) ## Overview A PR passes every test on your laptop. You push. GHA runs `pytest` and aborts during collection — before a single test executes — with: ``` PytestUnraisableExceptionWarning: Exception ignored in: ... DeprecationWarning: langchain_community.llms ... ``` The org runs `pytest -W error` and a provider SDK emitted a `DeprecationWarning` at *import* time, which the warning filter promoted to an exception while pytest was still walking the test tree. This is **P45** and it blocks every PR for the team until someone pins a `filterwarnings` config. Meanwhile the integration suite has its own failure mode: a VCR cassette recorded three months ago at `temperature=0` against Anthropic is now flaking against a snapshot. `temperature=0` is not deterministic on Claude — it still nucleus-samples (**P05**) — so the cassette captured *one* valid completion, not *the* valid completion. And yesterday a reviewer caught `Authorization: Bearer sk-ant-...` in a cassette file that had been committed six weeks ago (**P44**) because `vcrpy` records all request headers by default. This skill covers the outer loop: the GitHub Actions workflow, the unit / integration / eval gate separation, VCR cassette hygiene, pytest warning policy, and a merge-blocking eval regression gate. The **inner** loop — fake model fixtures, VCR recording workflow, local determinism tricks — lives in `langchain-local-dev-loop` (F23); cross-reference it, do not duplicate it. Pin: `langchain-core 1.0.x`, `langgraph 1.0.x`, `actions/checkout@v4`, `actions/setup-python@v5`, `vcrpy 6.x`. Pain-catalog anchors: **P05, P43, P44, P45**. ## Prerequisites - Python 3.10, 3.11, or 3.12 (matrix) - `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0` - `pytest >= 8`, `pytest-asyncio`, `vcrpy >= 6` (integration) - `langchain-local-dev-loop` (F23) applied locally — fixtures and recording workflow - GitHub repo with Actions enabled; secrets set for any live-API nightly job ## Instructions ### Step 1 — GHA workflow skeleton with four jobs Single workflow at `.github/workflows/tests.yml`. Matrix on unit only; keep integration and eval single-version to control cost. ```yaml name: tests on: pull_request: push: branches: [main] schedule: - cron: "0 6 * * *" # nightly live-API re-record check (06:00 UTC) concurrency: group: ${{ github.workflow }}-${{ github.ref }} cancel-in-progress: true jobs: unit: runs-on: ubuntu-latest strategy: fail-fast: false matrix: python: ["3.10", "3.11", "3.12"] steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: ${{ matrix.python }} cache: pip cache-dependency-path: | pyproject.toml requirements*.txt - run: pip install -e ".[test]" - run: pytest tests/unit/ -W error --timeout=30 -q integration: needs: unit if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'run-integration') runs-on: ubuntu-latest env: RUN_INTEGRATION: "1" VCR_MODE: "none" # replay-only; nightly cron flips to "once" steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: "3.12", cache: pip } - run: pip install -e ".[test,integration]" - run: pytest tests/integration/ -W error --timeout=60 -q eval: needs: unit if: github.event_name == 'pull_request' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: { fetch-depth: 0 } # need base ref for delta comparison - uses: actions/setup-python@v5 with: { python-version: "3.12", cache: pip } - run: pip install -e ".[test,eval]" - run: python scripts/run_eval.py --baseline origin/${{ github.base_ref }} --head HEAD --n 100 # run_eval.py posts a PR comment and exits nonzero on regression > threshold lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: "3.12", cache: pip } - run: pip install -e ".[dev]" - run: ruff check . - run: python scripts/dryrun_load_chains.py # catches ImportError migration regressions ``` See [GHA Workflow Reference](references/github-actions-workflow.md) for the full job definitions including the secret-injection pattern, the matrix caching nuance, and the `softprops/action-gh-release`-style PR comment action used by the eval job. ### Step 2 — Unit job: `-W error` + `filterwarnings` to neutralize P45 Root cause of the collection abort: pytest collects tests by importing them. Some provider SDKs emit `DeprecationWarning` on import. With `-W error` those become exceptions during collection. Fix at the *filter* level, not by dropping `-W error` (which would mask real warnings). In `pyproject.toml`: ```toml [tool.pytest.ini_options] filterwarnings = [ "error", # P45 — neutralize known import-time noise; scoped per module so new # warnings from YOUR code still fail the build. "ignore::DeprecationWarning:langchain_community.*", "ignore::DeprecationWarning:pydantic.*", "ignore:Pydantic serializer warnings:UserWarning", ] asyncio_mode = "auto" testpaths = ["tests"] ``` The ordering matters — `"error"` first, specific `"ignore"` entries after, so the filters override the global promote-to-error. Keep the list **narrow**: a blanket `ignore::DeprecationWarning` hides regressions you need to see. Unit tests use `FakeListChatModel` fixtures from F23 (do not redefine them here). One CI-specific gotcha (**P43**): `FakeListChatModel` does not emit `response_metadata["token_usage"]`, so any callback that asserts on token counts will break. Either subclass the fake and inject `generation_info`, or gate the assertion: ```python def test_chain_uses_tokens(patched_chat_model): result = chain.invoke({"input": "hi"}) if patched_chat_model.__class__.__name__ == "FakeListChatModel": pytest.skip("fake model doesn't emit token_usage (P43)") assert result.response_metadata["token_usage"]["total_tokens"] > 0 ``` Budget: unit job should finish in **<2 minutes** across the 3-version matrix. If it doesn't, something is calling out to a real provider — check with `pytest --collect-only -q | wc -l` and audit which tests lack fake-model fixtures. ### Step 3 — Integration job: VCR replay + `filter_headers` (P44) Integration tests replay pre-recorded VCR cassettes. Three rules: 1. Gate the job. `if: contains(github.event.pull_request.labels.*.name, 'run-integration')` or `env.RUN_INTEGRATION == "1"`, plus a nightly cron that flips to `VCR_MODE=once` and re-records against live APIs. PRs default to pure replay. 2. Enforce `filter_headers` at the fixture level — not per-test. A single `conftest.py` prevents any contributor from recording a cassette with raw credentials. 3. Pre-commit + CI both scan cassettes for leaked keys. Belt and suspenders. Fixture (lives in `tests/integration/conftest.py`, owned by this skill's pipeline concern — F23 owns the *recording* workflow): ```python import vcr import pytest @pytest.fixture(scope="module") def vcr_config(): return { "filter_headers": [ "authorization", "x-api-key", "anthropic-version", ("openai-organization", "REDACTED"), ], "filter_post_data_parameters": ["api_key"], "record_mode": "none", # CI default: replay only "match_on": ["method", "scheme", "host", "port", "path", "query"], } ``` Integration suite must finish in **<5 minutes** wall-clock on the runner, or you will start getting cancellation flakes from the `concurrency` block. If you exceed 5 minutes, split into a nightly-only long tier. See [Integration Gating](references/integration-gating.md) for the full live-vs-replay decision tree, cost-per-run budget worksheet, and the `VCR_MODE` flip pattern. ### Step 4 — Eval-regression gate: merge-blocking PR comment The eval job runs the `langchain-eval-harness` harness (see that skill for the harness itself — this skill only covers the CI wire-up) against both the PR branch and the merge base. Post a comment; block merge on regression. `scripts/run_eval.py` is a thin CI wrapper: check out baseline and head via `git worktree`, run the harness at each ref, diff the results, post a PR comment, exit nonzero on regression. Full implementation in [Eval Regression Gate](references/eval-regression-gate.md). Thresholds: | Gate | Threshold | Rationale | |------|-----------|-----------| | Aggregate score | drop >2% | One-sigma noise on n=100 with well-behaved evals | | Per-example score | drop >5% on any single case | Catches quiet regressions masked by aggregate averaging | | Sample size floor | n ≥ 100 | Below this, aggregate delta is dominated by noise | The PR comment is a Markdown table with `before / after / Δ` per metric plus a **bold red** line if the gate failed. Required-status-check on the `eval` job completes the enforcement. See [Eval Regression Gate](references/eval-regression-gate.md) for the comment template and the noise-budget calculation. ### Step 5 — Pre-commit hooks: secret scan + prompt lint Two layers: local (pre-commit) and CI (re-runs the same hooks as a final catch). Local alone is not sufficient — contributors can skip with `-n`. CI alone is slow. Run both. `.pre-commit-config.yaml`: ```yaml repos: - repo: local hooks: - id: vcr-secret-scan name: VCR cassette secret scan (P44) entry: python scripts/scan_cassettes.py language: system files: "tests/integration/cassettes/.*\\.ya?ml$" pass_filenames: true - id: prompt-convention-lint name: prompt-convention lint entry: python scripts/lint_prompts.py language: system files: "prompts/.*\\.j2$|src/.*prompts?\\.py$" - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.6.9 hooks: - id: ruff - id: ruff-format - repo: https://github.com/Yelp/detect-secrets rev: v1.5.0 hooks: - id: detect-secrets args: ["--baseline", ".secrets.baseline"] ``` `scan_cassettes.py` greps for `sk-[A-Za-z0-9]{20,}`, `sk-ant-[A-Za-z0-9_-]{20,}`, `AIza[A-Za-z0-9_-]{35}` (Google), `xoxb-`, and `Bearer [A-Za-z0-9._-]{20,}`. Fail on any match. This is your last line of defense before **P44** ships to `main`. See [Pre-Commit Hooks](references/pre-commit-hooks.md) for the full pattern list, the prompt-convention lint rules (aligned with `claude-prompt-conventions`), and the `detect-secrets` baseline-rotation policy. ### Step 6 — Dry-run chain loader: catch ImportError migration breaks LangChain 0.x → 1.0 moved integrations into provider packages. A chain that imports `from langchain.chat_models import ChatOpenAI` works in local dev if you still have the old compat shim installed, and explodes in CI. Dry-run-load every chain module at lint time: ```python # scripts/dryrun_load_chains.py import importlib, pathlib, sys, traceback failures = [] for py in pathlib.Path("src/chains").rglob("*.py"): mod = str(py.with_suffix("")).replace("/", ".") try: importlib.import_module(mod) except Exception: failures.append((mod, traceback.format_exc())) if failures: for mod, tb in failures: print(f"::error::chain {mod} failed to import\n{tb}") sys.exit(1) ``` Runs in the `lint` job. Costs ~5 seconds. Catches every `ImportError` and every top-level `NameError` from a bad rename before a single unit test fires. ## Output - GHA workflow with four isolated jobs (unit / integration / eval / lint) - `pyproject.toml` `filterwarnings` config that survives `-W error` (P45) - VCR `conftest.py` fixture with enforced `filter_headers` (P44) - `run_eval.py` CI wrapper that posts PR comments and blocks merge on regression - `.pre-commit-config.yaml` with cassette secret scan + prompt lint + ruff - Dry-run chain loader that catches migration `ImportError`s ### Gate policy | Gate | Required? | Target speed | On failure | |------|-----------|--------------|------------| | unit (3 Python versions) | yes, every PR | <2 min | block PR | | lint + dryrun-load | yes, every PR | <30 s | block PR | | integration (VCR replay) | on `run-integration` label or nightly | <5 min | block merge when run | | integration (live, nightly cron) | no | <15 min | open issue on fail | | eval regression (n≥100) | yes, every PR | <10 min | block merge if agg >2% or per-example >5% | | pre-commit (local) | yes | <10 s | reject commit | ## Error Handling | Error | Cause | Fix | |-------|-------|-----| | `PytestUnraisableExceptionWarning` during collection | `-W error` + SDK import-time `DeprecationWarning` (P45) | Add scoped `filterwarnings = ["ignore::DeprecationWarning:langchain_community.*"]` to `pyproject.toml` | | VCR replay mismatch after weeks of passing | Cassette recorded at `temp=0` on Anthropic (P05); model drift | Re-record on nightly cron with `VCR_MODE=once`; treat replay mismatches as eval-gate concerns, not unit failures | | `sk-ant-...` in cassette flagged by reviewer | `vcrpy` records all headers by default (P44) | Enforce `filter_headers` in `conftest.py`; add `scan_cassettes.py` to pre-commit AND CI | | Callback `AssertionError: 'token_usage' not in response_metadata` | `FakeListChatModel` doesn't emit metadata (P43) | Subclass the fake to inject `generation_info`, or `pytest.skip` on fake-model detection | | `ImportError: cannot import name 'ChatOpenAI' from 'langchain.chat_models'` in CI only | Legacy compat shim installed locally, not in CI | Add `dryrun_load_chains.py` to lint job; fail at lint, not at test | | Eval job times out at 10 min | n too large or harness not using `asyncio` concurrency | Cap at n=100 for PRs; run n=500 nightly; see F23 for async harness pattern | | Concurrency block cancels integration run | Long job + rapid pushes | Do not disable; keep integration <5 min or split long tier to nightly | ## Examples ### Wiring a new repo from scratch Copy the Step 1 workflow, the Step 2 `pyproject.toml` block, and the Step 5 pre-commit config. Create `tests/unit/`, `tests/integration/cassettes/`, `scripts/run_eval.py`, `scripts/dryrun_load_chains.py`, `scripts/scan_cassettes.py`. Apply `langchain-local-dev-loop` (F23) first so fake-model fixtures exist before the unit job runs. Enable required status checks: `unit (3.10)`, `unit (3.11)`, `unit (3.12)`, `lint`, `eval`. Integration stays optional (label-gated). See [GHA Workflow Reference](references/github-actions-workflow.md) for the complete copy-pasteable workflow. ### Hardening after a P44 cassette-leak incident Rotate every leaked key **first** (not a CI concern — incident response). Then: add `scan_cassettes.py` to pre-commit, re-scan the full history with `git log -p -- tests/integration/cassettes/`, rewrite history with `git-filter-repo` if keys hit `main`, enforce the `filter_headers` fixture going forward. See [Pre-Commit Hooks](references/pre-commit-hooks.md) for the full pattern list and the `detect-secrets` baseline-rotation playbook. ### Wiring the eval harness into an existing repo The harness itself lives in `langchain-eval-harness`. THIS skill only supplies `run_eval.py` (the CI wrapper that reads the harness output, computes deltas, and posts PR comments) plus the gate thresholds. Drop in the Step 4 script, add the `eval` job to `.github/workflows/tests.yml`, make `eval` a required status check. See [Eval Regression Gate](references/eval-regression-gate.md) for the PR-comment Markdown template and the n≥100 noise-budget derivation. ## Resources - [LangChain Python: Testing](https://python.langchain.com/docs/how_to/testing/) - [`FakeListChatModel` API](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.fake_chat_models.FakeListChatModel.html) - [vcrpy docs — filtering sensitive data](https://vcrpy.readthedocs.io/en/latest/advanced.html#filter-sensitive-data-from-the-request) - [GitHub Actions docs](https://docs.github.com/en/actions) - [pytest `filterwarnings`](https://docs.pytest.org/en/stable/how-to/capture-warnings.html) - Pair skill: `langchain-local-dev-loop` (F23) — fake fixtures, local recording - Pair skill: `langchain-eval-harness` — eval suite the gate runs against - Pack pain catalog: `docs/pain-catalog.md` (entries P05, P43, P44, P45)