---
name: langsmith-evaluator
description: Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Does NOT cover RUNNING evaluations.
---

# LangSmith Evaluator

Create evaluators to measure agent performance on your datasets. LangSmith supports two types: **LLM as Judge** (uses LLM to grade outputs) and **Custom Code** (deterministic logic).

## Setup

### Environment Variables

```bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge
```

### Dependencies

```bash
pip install langsmith langchain-openai python-dotenv
```

## Evaluator Format

Evaluators support two function signatures:

**Method 1: Dict Parameters (For running evaluations locally):**
```python
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }
```

**Method 2: Run/Example Parameters (For uploading to LangSmith):**
```python
def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }
```

## LLM as Judge Evaluators

Use structured output for reliable grading:

```python
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]

# Configure model with structured output
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(
    AccuracyGrade, method="json_schema", strict=True
)

async def accuracy_evaluator(run, example):
    """Evaluate factual accuracy for LangSmith upload."""
    expected = example["outputs"].get('expected_response', '')
    agent_output = run["outputs"].get('expected_response', '')

    prompt = f"""Expected: {expected}
Agent Output: {agent_output}
Evaluate accuracy:"""

    grade = await judge.ainvoke([{"role": "user", "content": prompt}])

    return {
        "accuracy": 1 if grade["is_accurate"] else 0,
        "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
    }
```

**Common Metrics:** Completeness, correctness, helpfulness, professionalism

## Custom Code Evaluators

### Exact Match
```python
def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
```

### Trajectory Validation
```python
def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }
```

### Single Step Validation
```python
def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
```

## Running Evaluations

```python
from langsmith import Client

client = Client()

# Define your agent function
def run_agent(inputs: dict) -> dict:
    """Your agent invocation logic."""
    result = your_agent.invoke(inputs)
    return {"expected_response": result}

# Run evaluation
results = await client.aevaluate(
    run_agent,
    data="Skills: Final Response",              # Dataset name
    evaluators=[
        exact_match_evaluator,
        accuracy_evaluator,
        trajectory_evaluator
    ],
    experiment_prefix="skills-eval-v1",
    max_concurrency=4
)
```

## Upload Evaluators to LangSmith

The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.

Navigate to `skills/langsmith-evaluator/scripts/` to upload evaluators.

**Important:** LangSmith API requires evaluators to use `(run, example)` signature where:
- `run`: dict with `run["outputs"]` containing agent outputs
- `example`: dict with `example["outputs"]` containing expected outputs

### Create Evaluator File

```python
# my_project/evaluators/custom_evals.py

def my_custom_evaluator(run, example):
    """Your custom evaluation logic.

    Args:
        run: Dict with run["outputs"] - agent outputs
        example: Dict with example["outputs"] - expected outputs

    Returns:
        Dict with metric_name as key, score as value, optional comment
    """
    # Extract relevant data
    agent_output = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Your custom logic here
    match = agent_output == expected

    return {
        "my_metric": 1 if match else 0,
        "comment": "Custom reasoning here"
    }
```

### Upload

```bash
# List existing evaluators
python upload_evaluators.py list

# Upload evaluator
python upload_evaluators.py upload my_evaluators.py \
  --name "Trajectory Match" \
  --function trajectory_match \
  --dataset "Skills: Trajectory" \
  --replace

# Delete evaluator (will prompt for confirmation)
python upload_evaluators.py delete "Trajectory Match"

# Skip confirmation prompts (use with caution)
python upload_evaluators.py delete "Trajectory Match" --yes
python upload_evaluators.py upload my_evaluators.py \
  --name "Trajectory Match" \
  --function trajectory_match \
  --replace --yes
```

**Options:**
- `--name` - Display name in LangSmith
- `--function` - Function name to extract
- `--dataset` - Target dataset name
- `--project` - Target project name
- `--sample-rate` - Sampling rate (0.0-1.0)
- `--replace` - Replace if exists (will prompt for confirmation)
- `--yes` - Skip confirmation prompts for replace/delete operations

**IMPORTANT - Safety Prompts:**
- The script prompts for confirmation before any destructive operations (delete, replace)
- **ALWAYS respect these prompts** - wait for user input before proceeding
- **NEVER use `--yes` flag unless the user explicitly requests it**
- The `--yes` flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user

## Best Practices

1. **Use structured output for LLM judges** - More reliable than parsing free-text
2. **Match evaluator to dataset type**
   - Final Response → LLM as Judge for quality, Custom Code for format
   - Single Step → Custom Code for exact match
   - Trajectory → Custom Code for sequence/efficiency
3. **Combine multiple evaluators** - Run both subjective (LLM) and objective (code)
4. **Use async for LLM judges** - Enables parallel evaluation, much faster
5. **Test evaluators independently** - Validate on known good/bad examples first
6. **Upload to LangSmith** - Automatic evaluation on new runs

## Example Workflow

```bash
# 1. Create evaluators file
cat > evaluators.py <<'EOF'
def exact_match(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()
    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
EOF

# 2. Upload to LangSmith
python upload_evaluators.py upload evaluators.py \
  --name "Exact Match" \
  --function exact_match \
  --dataset "Skills: Final Response" \
  --replace

# 3. Evaluator runs automatically on new dataset runs
```

## Resources

- [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts)
- [Custom Code Evaluators](https://changelog.langchain.com/announcements/custom-code-evaluators-in-langsmith)
- [OpenEvals - Readymade Evaluators](https://github.com/langchain-ai/openevals)

## Related Skills

- Use **langsmith-trace** skill to query and export traces
- Use **langsmith-dataset** skill to generate evaluation datasets from traces