--- name: langgraph-testing-evaluation description: "Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results." --- # LangGraph Testing & Evaluation Practical workflows for validating agent quality with: - Unit/integration tests - Trajectory evaluation - LangSmith dataset evaluations - A/B-style comparisons between versions Use this file for high-level flow. Load `references/*` for detailed implementation. ## Start Here Choose the smallest approach that answers your question: | Goal | Primary method | Load first | | --- | --- | --- | | Validate node logic quickly | Unit tests with mocks | `references/unit-testing-patterns.md` | | Validate multi-step agent behavior | Trajectory evaluation | `references/trajectory-evaluation.md` | | Track quality over datasets over time | LangSmith evaluation | `references/langsmith-evaluation.md` | | Compare old vs new agent versions | A/B comparison | `references/ab-testing.md` | Recommended order: 1. Unit tests 2. Integration/trajectory checks 3. Dataset evaluation in LangSmith 4. A/B comparison before deployment ## Quick Commands Run from repo root. ### Generate test scaffolding ```bash # Python (preferred) uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest # JavaScript/TypeScript node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest ``` ### Run trajectory evaluation ```bash # Python: LLM-as-judge uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini # Python: trajectory match uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json # JavaScript/TypeScript node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4 ``` ### Run LangSmith dataset evaluation ```bash # Python uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4 # Python (do not upload experiment results) uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload # JavaScript/TypeScript node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4 ``` ### Compare two agent versions ```bash # Python uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json # JavaScript/TypeScript node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json # JavaScript/TypeScript (force local dataset file only) node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith ``` ### Create mock response configs ```bash # Python uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json # JavaScript/TypeScript node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json ``` ## Core Workflow 1. Define test scope. - Unit: deterministic logic in one node/function. - Integration: node interactions and routing. - End-to-end: complete response quality on realistic inputs. 2. Start from deterministic checks. - Mock LLM/tool IO for speed and repeatability. - Keep real-model tests as a smaller, explicit suite. 3. Build/curate dataset examples. - Use stable inputs and expected outputs. - Keep schema simple: `inputs` and `outputs` objects (optional `metadata`). - Compatibility note: scripts also accept singular keys (`input`, `output`) for legacy datasets. 4. Run evaluation with explicit gates. - Use evaluator keys that map to deployment decisions. - Set thresholds in CI for regression prevention. 5. Compare versions before rollout. - Run same dataset on both versions. - Check both quality and latency. 6. Diagnose failures from traces/experiments. - Inspect low-scoring examples. - Split failures by pattern (routing, tool usage, hallucination, latency spikes). ## Current References (Load On Demand) ### `references/unit-testing-patterns.md` Load when: - You need node-level and routing test patterns. - You need pytest/vitest/Jest integration patterns. - You need robust mocking and flaky-test reduction. ### `references/trajectory-evaluation.md` Load when: - You need trajectory match evaluation (`strict`, `unordered`, `subset`, `superset`). - You need LLM-as-judge trajectory scoring. - You need LangSmith experiment comparison for trajectory results. ### `references/langsmith-evaluation.md` Load when: - You need dataset creation/management in LangSmith. - You need evaluator signatures and experiment runs in Python/TS. - You need CI-friendly workflows with quality thresholds. ### `references/ab-testing.md` Load when: - You need offline A/B comparison methodology. - You need significance testing and interpretation. - You need production traffic split strategy and guardrails. ## Assets ### `assets/templates/test_template.py` - Runnable Python pytest template aligned with current LangGraph testing patterns. - Includes: - Compiled-graph invocation with `thread_id` - Single-node testing via `compiled_graph.nodes[...]` - Integration-test placeholder ### `assets/datasets/sample_dataset.json` - Deterministic seed dataset for LangSmith ingestion. - Uses `examples: [{ inputs, outputs, metadata }]` format. ### `assets/examples/README.md` - Documentation-only index for current asset usage. - Notes where runnable assets live today. ## Script Interface Summary ### `scripts/generate_test_cases.py` / `.js` Use for fast test scaffolding. Inputs: - Graph module path - Python: `my_module:graph` or `my_module.graph` - JS/TS: `./file.ts:graph` Outputs: - Framework-specific starter tests in target directory. ### `scripts/run_trajectory_eval.py` / `.js` Use for trajectory scoring with either: - `--method match` - `--method llm-judge` Supports: - Local dataset files (`.json`) - LangSmith dataset names - Optional reference trajectory file with `--reference-trajectory` - Match modes: `strict`, `unordered`, `subset`, `superset` Local-only mode: - `--no-langsmith` in both Python and JavaScript scripts (requires local JSON dataset file) ### `scripts/evaluate_with_langsmith.py` / `.js` Use for dataset-based evaluation runs and experiment tracking. Supports: - Existing dataset by name - Dataset creation from JSON examples file - Multiple evaluators (`--evaluators accuracy,latency,...`) - Concurrency control (`--max-concurrency`) Python-only: - `--no-upload` to run without uploading experiment results ### `scripts/compare_agents.py` / `.js` Use for offline version comparisons: - Shared dataset input - Success/latency summaries - JSON report output for CI artifacts - Local JSON datasets or LangSmith datasets (JS supports `--no-langsmith` to disable remote loading) ### `scripts/mock_llm_responses.py` / `.js` Use for deterministic test doubles: - single - sequence - conditional ## Decision Rules If behavior is deterministic and local: - Use unit tests first. If behavior depends on tool sequence/routing: - Add trajectory evaluation. If behavior depends on realistic distribution quality: - Run LangSmith dataset evaluation. If approving a replacement model/prompt/graph: - Run A/B comparison and check both quality and latency. ## Common Failure Patterns ### Flaky tests - Cause: real-model nondeterminism in unit scope. - Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks. ### High trajectory variance - Cause: overly strict matching for workflows with equivalent paths. - Fix: switch match mode (`unordered`, `subset`, or `superset`) where appropriate. ### Regressions hidden by averages - Cause: only aggregate score monitored. - Fix: inspect per-example failures and segment by category metadata. ### Latency regressions with same quality - Cause: no explicit latency gate. - Fix: include latency evaluator and CI threshold. ## Minimal Best Practices 1. Keep fast deterministic tests as the largest share. 2. Version datasets and keep them stable. 3. Track both correctness and latency. 4. Add explicit go/no-go thresholds in CI. 5. Compare candidate vs baseline before production rollout. 6. Investigate failures with trace-level evidence, not only aggregate scores.