---
name: LLM Judge Patterns
description: Comprehensive guide to using LLMs as judges for automated evaluation including prompt patterns, calibration, bias reduction, and multi-judge ensembles
---

# LLM Judge Patterns

## What is LLM-as-Judge?

**Definition:** Using LLMs (GPT-4, Claude) to evaluate other LLM outputs automatically.

### Model
```
Input: Question + Answer (to evaluate)
Judge LLM: GPT-4 or Claude
Output: Score + Reasoning

Example:
Question: "What is the capital of France?"
Answer: "Paris is the capital of France."
Judge: "Score: 5/5 - Correct, concise, directly answers question"
```

---

## Why LLM-as-Judge?

### Human Eval is Slow and Expensive

**Comparison:**
```
Human evaluation:
- 100 answers × 5 min each = 500 min = 8.3 hours
- Cost: $20/hour × 8.3 = $166

LLM-as-judge:
- 100 answers × 2 sec each = 200 sec = 3.3 min
- Cost: 100 × $0.01 = $1
```

### Need to Evaluate Thousands of Outputs

**Scale:**
```
Development: Test 1000+ variations
Production: Evaluate millions of responses
Human eval: Impossible at this scale
LLM-judge: Feasible
```

### Research Shows High Correlation with Human Judgment

**Studies:**
- GPT-4 as judge correlates 0.8+ with human ratings
- Works well for subjective quality (fluency, helpfulness)
- Less reliable for factual correctness

### Enables Continuous Evaluation

**Workflow:**
```
Every response → LLM judge → Score logged → Dashboard
Detect regressions in real-time
```

---

## When to Use LLM-as-Judge

### Subjective Quality (Fluency, Relevance, Helpfulness)

**Good Use Cases:**
```
- Is this answer helpful?
- Is this text fluent and natural?
- Is this response relevant to the question?
- Is this summary coherent?
```

### Complex Rubrics (Multi-Criteria)

**Example:**
```
Evaluate on:
1. Accuracy (1-5)
2. Completeness (1-5)
3. Clarity (1-5)
4. Tone (1-5)

LLM can handle multi-dimensional evaluation
```

### Large-Scale Evaluation

**When:**
```
Need to evaluate 1000+ examples
Human eval too slow/expensive
```

### Rapid Iteration

**Development:**
```
Test 10 prompt variations
Evaluate each on 100 examples
LLM-judge: Minutes
Human eval: Days
```

---

## When NOT to Use LLM-as-Judge

### Objective Correctness (Factual Answers)

**Problem:**
```
Question: "What is 2+2?"
Answer: "5"
LLM judge might say: "The answer is clear and confident" (wrong!)

Better: Exact match or computation
```

### Mathematical Reasoning (Verify with Computation)

**Better Approach:**
```
Execute code to verify answer
Not: Ask LLM if math is correct
```

### Code Correctness (Run Tests)

**Better Approach:**
```
Run unit tests
Check if code compiles
Not: Ask LLM if code is correct
```

### Safety-Critical (Use Human Evaluation)

**Examples:**
```
Medical advice
Legal guidance
Financial recommendations

→ Always use human experts
```

---

## Judge Model Selection

### GPT-4 (Most Commonly Used)

**Pros:**
- High quality judgments
- Good correlation with humans
- Widely tested

**Cons:**
- Expensive ($0.03/1K tokens)
- Can be slow

### Claude Sonnet 4 (Excellent Reasoning)

**Pros:**
- Excellent reasoning
- Good for complex evaluations
- Fast

**Cons:**
- Expensive
- Less tested than GPT-4

### GPT-3.5 (Cheaper, Less Accurate)

**Pros:**
- Cheap ($0.001/1K tokens)
- Fast

**Cons:**
- Less accurate
- More biased

### Open-Source (Llama, Mixtral)

**Pros:**
- Free (if self-hosted)
- Privacy (on-prem)

**Cons:**
- Lower quality
- Requires infrastructure

---

## Judge Prompt Patterns

### Single-Answer Grading

**Pattern:**
```
You are evaluating an AI assistant's response.

Question: {question}
Answer: {answer}

Rate the answer on a scale of 1-5:
1 = Poor
5 = Excellent

Consider:
- Accuracy
- Relevance
- Completeness

Score:
```

**Example:**
```python
def single_answer_grading(question, answer):
    prompt = f"""
    You are evaluating an AI assistant's response.
    
    Question: {question}
    Answer: {answer}
    
    Rate the answer on a scale of 1-5:
    1 = Poor (incorrect, irrelevant, or incomplete)
    5 = Excellent (correct, relevant, and complete)
    
    Provide:
    - Score (1-5)
    - Brief reasoning
    
    Format:
    Score: [number]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    score = extract_score(response)
    return score
```

### Pairwise Comparison (A vs B)

**Pattern:**
```
Which answer is better?

Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}

Which is better? A or B?
Explain why.
```

**More Reliable:**
```
Pairwise comparison reduces absolute scoring bias
Humans also find comparisons easier than absolute ratings
```

**Example:**
```python
def pairwise_comparison(question, answer_a, answer_b):
    prompt = f"""
    Question: {question}
    
    Answer A: {answer_a}
    Answer B: {answer_b}
    
    Which answer is better? A or B?
    
    Consider:
    - Accuracy
    - Relevance
    - Clarity
    
    Respond with:
    - Winner: A or B
    - Reasoning: Why is it better?
    
    Format:
    Winner: [A or B]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    winner = extract_winner(response)
    return winner
```

**Aggregate via Elo Ratings:**
```python
# After many pairwise comparisons
# Calculate Elo rating for each model
# Higher Elo = better model
```

### Multi-Aspect Evaluation (Rubric)

**Pattern:**
```
Evaluate on multiple criteria:
1. Accuracy (1-5)
2. Relevance (1-5)
3. Completeness (1-5)
4. Clarity (1-5)

Score each separately
```

**Example:**
```python
def multi_aspect_evaluation(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate on these criteria (1-5 scale):
    
    1. Accuracy: Is the information correct?
       1 = Incorrect, 5 = Perfectly accurate
    
    2. Relevance: Does it answer the question?
       1 = Irrelevant, 5 = Highly relevant
    
    3. Completeness: Does it cover all aspects?
       1 = Incomplete, 5 = Comprehensive
    
    4. Clarity: Is it clear and well-written?
       1 = Confusing, 5 = Very clear
    
    Provide scores and brief reasoning for each.
    
    Format:
    Accuracy: [score] - [reasoning]
    Relevance: [score] - [reasoning]
    Completeness: [score] - [reasoning]
    Clarity: [score] - [reasoning]
    Overall: [average score]
    """
    
    response = llm.generate(prompt)
    scores = extract_scores(response)
    return scores
```

### Chain-of-Thought Judging

**Pattern:**
```
First, explain your reasoning
Then, provide score

This increases reliability
```

**Example:**
```python
def cot_judging(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate this answer step by step:
    
    Step 1: Is the answer factually correct?
    Step 2: Does it fully address the question?
    Step 3: Is it clear and well-written?
    
    Based on your analysis, rate the answer (1-5).
    
    Format:
    Step 1: [analysis]
    Step 2: [analysis]
    Step 3: [analysis]
    Final Score: [number]
    """
    
    response = llm.generate(prompt)
    return response
```

---

## Judge Prompt Template

**Comprehensive Template:**
```
You are an expert evaluator assessing AI assistant responses.

Question: {question}
Answer: {answer}
{optional: Ground Truth: {ground_truth}}
{optional: Context: {context}}

Evaluate the answer on these criteria:

1. **Accuracy** (1-5): Is the information factually correct?
   - 1 = Completely incorrect
   - 3 = Partially correct
   - 5 = Fully correct

2. **Relevance** (1-5): Does it address the question?
   - 1 = Completely irrelevant
   - 3 = Partially relevant
   - 5 = Directly addresses question

3. **Completeness** (1-5): Does it cover all aspects?
   - 1 = Missing most information
   - 3 = Covers some aspects
   - 5 = Comprehensive

4. **Clarity** (1-5): Is it clear and well-written?
   - 1 = Confusing or poorly written
   - 3 = Acceptable clarity
   - 5 = Very clear and well-written

Provide:
- Score for each criterion (1-5)
- Brief reasoning for each score
- Overall score (average of all criteria)

Format:
Accuracy: [score] - [reasoning]
Relevance: [score] - [reasoning]
Completeness: [score] - [reasoning]
Clarity: [score] - [reasoning]
Overall: [average score]
```

---

## Judge Calibration

### Compare Judge Scores to Human Scores

**Process:**
```
1. Get 100 examples
2. Human annotators rate each (1-5)
3. LLM judge rates each (1-5)
4. Calculate correlation
```

**Correlation:**
```python
from scipy.stats import pearsonr

human_scores = [4, 5, 3, 4, 2, ...]
judge_scores = [4.2, 4.8, 3.1, 4.5, 2.3, ...]

correlation, p_value = pearsonr(human_scores, judge_scores)
print(f"Correlation: {correlation:.2f}")

# Target: >0.7 (good correlation)
# If <0.7: Adjust prompt or use different judge
```

### Calculate Correlation

See above

### Adjust Prompt if Low Correlation

**If correlation <0.7:**
```
1. Analyze disagreements (where judge differs from human)
2. Update prompt to address issues
3. Re-test correlation
4. Iterate until >0.7
```

### Test on Multiple Examples

**Validation Set:**
```
Use 100-500 examples with human ratings
Ensure diverse (easy, hard, edge cases)
```

---

## Reducing Judge Bias

### Position Bias (Favors First Option in A/B)

**Problem:**
```
Judge tends to prefer Answer A over Answer B
Even when B is better
```

**Mitigation:**
```python
# Randomize order
import random

if random.random() < 0.5:
    winner = compare(question, answer_a, answer_b)
else:
    winner = compare(question, answer_b, answer_a)
    winner = "A" if winner == "B" else "B"  # Flip
```

### Length Bias (Favors Longer Answers)

**Problem:**
```
Judge tends to prefer longer answers
Even if shorter answer is better
```

**Mitigation:**
```
Prompt: "Do not favor longer answers. Concise answers can be better."
Or: Normalize scores by length
```

### Self-Preference Bias (Favors Own Outputs)

**Problem:**
```
GPT-4 as judge tends to prefer GPT-4 outputs
Over Claude outputs
```

**Mitigation:**
```
Use external judge (Claude to judge GPT-4)
Or: Blind evaluation (don't reveal which model)
```

---

## Multi-Judge Ensemble

### Use Multiple Judges (GPT-4 + Claude)

**Approach:**
```python
def multi_judge_ensemble(question, answer):
    # Judge 1: GPT-4
    score_gpt4 = gpt4_judge(question, answer)
    
    # Judge 2: Claude
    score_claude = claude_judge(question, answer)
    
    # Judge 3: GPT-3.5 (cheaper, as tiebreaker)
    score_gpt35 = gpt35_judge(question, answer)
    
    return {
        "gpt4": score_gpt4,
        "claude": score_claude,
        "gpt35": score_gpt35
    }
```

### Aggregate Scores (Majority Vote, Average)

**Majority Vote:**
```python
scores = [4, 5, 4]  # Three judges
majority = max(set(scores), key=scores.count)  # 4
```

**Average:**
```python
scores = [4.2, 4.8, 4.5]
average = sum(scores) / len(scores)  # 4.5
```

**Weighted Average:**
```python
scores = {"gpt4": 4.8, "claude": 4.5, "gpt35": 4.0}
weights = {"gpt4": 0.5, "claude": 0.4, "gpt35": 0.1}

weighted_avg = sum(scores[j] * weights[j] for j in scores)  # 4.58
```

### Increases Reliability

**Why:**
```
Single judge can be wrong
Multiple judges reduce variance
Ensemble is more robust
```

---

## Cost Optimization

### Use Cheaper Judge for Initial Filtering

**Two-Stage:**
```
Stage 1: GPT-3.5 judge (cheap, fast)
  - Filter out clearly bad answers (score <3)
  
Stage 2: GPT-4 judge (expensive, accurate)
  - Evaluate borderline cases (score 3-4)
```

### Use Expensive Judge for Borderline Cases

See above

### Cache Judge Results

**Caching:**
```python
import hashlib
import json

cache = {}

def cached_judge(question, answer):
    # Create cache key
    key = hashlib.md5(f"{question}{answer}".encode()).hexdigest()
    
    # Check cache
    if key in cache:
        return cache[key]
    
    # Call judge
    score = llm_judge(question, answer)
    
    # Cache result
    cache[key] = score
    
    return score
```

---

## Judge Evaluation Frameworks

### G-Eval (Using GPT-4)

**Paper:** "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

**Approach:**
```
Use GPT-4 to generate evaluation criteria
Then use GPT-4 to evaluate based on those criteria
```

### Prometheus (Using Llama)

**Open-Source Judge:**
```
Fine-tuned Llama model for evaluation
Free to use
Lower quality than GPT-4 but no API costs
```

### Custom Implementation

See examples throughout this document

---

## Metrics to Track

### Judge-Human Correlation

**Target:** >0.7

**Calculation:**
```python
from scipy.stats import pearsonr

correlation, p_value = pearsonr(human_scores, judge_scores)
```

### Inter-Judge Agreement (If Multiple Judges)

**Kappa Score:**
```python
from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(judge1_scores, judge2_scores)
# >0.7 = good agreement
```

### Judge Consistency (Same Input → Same Output)

**Test:**
```python
# Evaluate same example 10 times
scores = [judge(question, answer) for _ in range(10)]

# Calculate variance
variance = np.var(scores)
# Low variance = consistent judge
```

---

## Real-World Judge Use Cases

### RAG Answer Evaluation

See RAG Evaluation skill

### Chatbot Response Quality

**Criteria:**
- Helpfulness
- Relevance
- Safety
- Tone

### Content Moderation

**Criteria:**
- Toxicity
- Hate speech
- Misinformation
- Spam

### Translation Quality

**Criteria:**
- Accuracy
- Fluency
- Preserves meaning

### Summarization Quality

**Criteria:**
- Completeness
- Conciseness
- Accuracy

---

## Limitations

### Judge Can Be Wrong (Validate with Humans)

**Always:**
```
Spot-check judge results with human evaluation
Don't blindly trust judge
```

### Expensive (API Costs)

**Cost:**
```
1000 evaluations × $0.01 = $10
10,000 evaluations × $0.01 = $100

Can add up quickly
```

### Judge Bias (Needs Careful Prompting)

See "Reducing Judge Bias" section

### Not Suitable for All Tasks

See "When NOT to Use" section

---

## Implementation

### Judge Prompt Templates

See "Judge Prompt Template" section

### Multi-Judge Aggregation

See "Multi-Judge Ensemble" section

### Calibration Scripts

```python
def calibrate_judge(judge_fn, test_set):
    """
    test_set: List of (question, answer, human_score)
    """
    judge_scores = []
    human_scores = []
    
    for question, answer, human_score in test_set:
        judge_score = judge_fn(question, answer)
        judge_scores.append(judge_score)
        human_scores.append(human_score)
    
    correlation, p_value = pearsonr(human_scores, judge_scores)
    
    return {
        "correlation": correlation,
        "p_value": p_value,
        "judge_scores": judge_scores,
        "human_scores": human_scores
    }
```

---

## Summary

### Quick Reference

**LLM-as-Judge:** Use LLMs to evaluate other LLM outputs

**Why:**
- Fast and cheap vs human eval
- Scales to thousands of examples
- High correlation with humans (>0.8)

**When to Use:**
- Subjective quality
- Complex rubrics
- Large-scale evaluation

**When NOT:**
- Objective correctness
- Math/code (use computation)
- Safety-critical (use humans)

**Judge Models:**
- GPT-4 (best quality)
- Claude (excellent reasoning)
- GPT-3.5 (cheaper)
- Open-source (free but lower quality)

**Prompt Patterns:**
- Single-answer grading
- Pairwise comparison (more reliable)
- Multi-aspect (rubric)
- Chain-of-thought (increases reliability)

**Bias Reduction:**
- Position bias: Randomize order
- Length bias: Normalize or prompt
- Self-preference: External judge

**Multi-Judge:**
- Use multiple judges
- Aggregate (majority vote, average)
- Increases reliability

**Cost Optimization:**
- Cheap judge for filtering
- Expensive judge for borderline
- Cache results

**Calibration:**
- Compare to human scores
- Target correlation >0.7
- Adjust prompt if low

**Limitations:**
- Can be wrong (validate with humans)
- Expensive (API costs)
- Biased (careful prompting)