# RAG Evaluation Metrics - Complete Guide

## 🎯 Overview

Dingo's RAG evaluation metrics system is based on best practices from the [RAGAS paper](https://arxiv.org/abs/2309.15217), DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities.

### ✅ Supported Metrics (5/5)

| Metric | Evaluation Dimension | Required Fields | Source |
|--------|---------------------|-----------------|--------|
| **Faithfulness** | Answer Faithfulness | user_input, response, retrieved_contexts | RAGAS |
| **Answer Relevancy** | Answer Relevance | user_input, response | RAGAS |
| **Context Relevancy** | Context Relevance | user_input, retrieved_contexts | RAGAS + DeepEval + TruLens |
| **Context Recall** | Context Recall | user_input, retrieved_contexts, reference | RAGAS |
| **Context Precision** | Context Precision | user_input, retrieved_contexts, reference | RAGAS |

## 🚀 Quick Start

### 1. Run Examples

```bash
# Dataset mode - batch evaluation (recommended)
python examples/rag/dataset_rag_eval_baseline.py

# SDK mode - single evaluation
python examples/rag/sdk_rag_eval.py

# Simulate RAG system and evaluate
python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py
```

### 2. SDK Mode - Single Evaluation

```python
import os
from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs
from dingo.io.input import Data
from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness

# Configure LLM
LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs(
    key=os.getenv("OPENAI_API_KEY"),
    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    model=os.getenv("OPENAI_MODEL", "deepseek-chat"),
)

# Prepare data
data = Data(
    data_id="example_1",
    prompt="What is machine learning?",
    content="Machine learning is a branch of AI that enables computers to learn from data.",
    context=[
        "Machine learning is a subfield of AI.",
        "ML systems learn from data without explicit programming."
    ]
)

# Evaluate
result = LLMRAGFaithfulness.eval(data)

# View results
print(f"Score: {result.score}/10")
print(f"Passed: {not result.status}")  # status=False means passed
print(f"Reason: {result.reason[0]}")
```

### 3. Dataset Mode - Batch Evaluation

```python
from dingo.config import InputArgs
from dingo.exec import Executor

# Configuration
llm_config = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
}

llm_config_embedding = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
    "embedding_config": {  # ⭐ Required for Answer Relevancy
        "model": "text-embedding-3-large",
        "api_url": "https://api.openai.com/v1",
        "key": "YOUR_API_KEY"
    },
    "strictness": 3,
    "threshold": 5
}

input_data = {
    "task_name": "rag_evaluation",
    "input_path": "test/data/fiqa.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "executor": {
        "max_workers": 10,
        "result_save": {"good": True, "bad": True, "all_labels": True}
    },
    "evaluator": [
        {
            "fields": {
                "prompt": "user_input",
                "content": "response",
                "context": "retrieved_contexts",
                "reference": "reference"
            },
            "evals": [
                {"name": "LLMRAGFaithfulness", "config": llm_config},
                {"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding},
                {"name": "LLMRAGContextRelevancy", "config": llm_config},
                {"name": "LLMRAGContextRecall", "config": llm_config},
                {"name": "LLMRAGContextPrecision", "config": llm_config}
            ]
        }
    ]
}

input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()
```

## 📋 Data Format

### Required Fields

| Metric | user_input | response | retrieved_contexts | reference | Notes |
|--------|-----------|----------|-------------------|-----------|-------|
| **Faithfulness** | ✅ | ✅ | ✅ | - | Measures if answer is based on context |
| **Answer Relevancy** | ✅ | ✅ | - | - | Measures if answer addresses the question |
| **Context Relevancy** | ✅ | - | ✅ | - | Measures if retrieved contexts are relevant |
| **Context Recall** | ✅ | - | ✅ | ✅ | Measures if all needed info is retrieved |
| **Context Precision** | ✅ | - | ✅ | ✅ | Measures ranking quality of retrieved contexts |

### Data Example (JSONL)

```jsonl
{"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]}
{"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."}
```

## ⚙️ Configuration

### Configurable Parameters

| Parameter | Applicable Metrics | Default | Description |
|-----------|-------------------|---------|-------------|
| `threshold` | All metrics | 5.0 | Pass threshold (0-10) |
| `strictness` | Answer Relevancy | 3 | Number of questions to generate (1-5) |
| `embedding_config` | Answer Relevancy | - | **Required**: includes `model`, `api_url`, `key` |

### Embedding Configuration (Answer Relevancy)

`LLMRAGAnswerRelevancy` **requires `embedding_config`**:

**Option 1: Cloud LLM + Cloud Embedding**

```python
"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # ⭐ Required
        "model": "text-embedding-3-large",
        "api_url": "https://api.deepseek.com",
        "key": "YOUR_API_KEY"
    },
    "strictness": 3,
    "threshold": 5
}
```

**Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)**

```python
"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # ⭐ Independent embedding service
        "model": "BAAI/bge-m3",
        "api_url": "http://localhost:8000/v1",  # Local vLLM/Xinference
        "key": "dummy-key"
    },
    "strictness": 3,
    "threshold": 5
}
```

**Deploy Local Embedding (vLLM)**:

```bash
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-m3 \
  --port 8000 \
  --host 0.0.0.0
```

**What happens if not configured?**

Runtime exception:

```
ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with:
  - model: embedding model name (e.g., 'BAAI/bge-m3')
  - api_url: embedding service URL
  - key: API key (optional for local services)
```

## 📊 Metric Details

### 1️⃣ Faithfulness (Answer Faithfulness)

**Evaluation Goal**: Measure if the answer is entirely based on retrieved context, avoiding hallucinations

**Calculation**:
1. Break down answer into independent statements (claims)
2. Judge if each statement is supported by context
3. Faithfulness score = (Supported statements / Total statements) × 10

**Formula**:
```
Faithfulness = (Context-supported claims / Total claims) × 10
```

**Recommended Threshold**: 7 (out of 10)

---

### 2️⃣ Answer Relevancy (Answer Relevance)

**Evaluation Goal**: Measure if the answer directly addresses the user question

**Calculation**:
1. Generate N reverse questions from the answer (questions inferred by LLM from the answer)
2. Calculate cosine similarity between embeddings of generated questions and original question
3. Answer Relevancy = Average of all similarities

**Formula**:
```
Answer Relevancy = (1/N) × Σ cosine_sim(E_gi, E_o)

Where:
- N: Number of generated questions, default 3 (adjustable via strictness parameter)
- E_gi: Embedding of the i-th generated question
- E_o: Embedding of the original question
```

**⚠️ Important**: This metric **requires `embedding_config`**:
- `model`: Embedding model name (e.g., `text-embedding-3-large`, `BAAI/bge-m3`)
- `api_url`: Embedding service address
- `key`: API key (optional for local services)

**Recommended Threshold**: 5 (out of 10)

---

### 3️⃣ Context Relevancy (Context Relevance)

**Evaluation Goal**: Measure if retrieved contexts are relevant to the question

**Calculation**:
Uses a **Dual-Judge System** from NVIDIA research:

**Judge 1 Scoring**:
- **0** = Context completely irrelevant
- **1** = Context partially relevant
- **2** = Context fully relevant

**Judge 2 Scoring**:
- Uses different prompt wording for another perspective
- Same 0-2 scoring standard
- Purpose: Reduce single-prompt bias

**Final Score**:
```
Context Relevancy = (Relevant contexts / Total contexts) × 10

Where:
- Relevant context: Average score from both judges ≥ threshold (default 1.0)
- Irrelevant context: Average score < threshold
```

**Recommended Threshold**: 5 (out of 10)

---

### 4️⃣ Context Recall (Context Recall)

**Evaluation Goal**: Measure if all needed information is retrieved (requires reference answer)

**Calculation**:
1. Extract independent statements from reference answer
2. Judge if each statement can be attributed from retrieved contexts
3. Recall = (Context-supported reference statements / Total reference statements) × 10

**Formula**:
```
Context Recall = (Context-supported reference claims / Total reference claims) × 10
```

**Note**: **Requires reference answer (reference)**, typically used in evaluation phase

**Recommended Threshold**: 5 (out of 10)

---

### 5️⃣ Context Precision (Context Precision)

**Evaluation Goal**: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer)

**Calculation**:
1. For each position k, judge if the context is relevant (supports reference answer)
2. Calculate Precision@k for each position
3. Use relevance indicator (v_k) for weighted sum

**Formula**:
```
Context Precision = Σ(Precision@k × v_k) / Total relevant items in top K

Where:
- K: Total retrieved documents, e.g., 5 documents
- k: Current position (1st, 2nd, 3rd, ..., K-th)
- v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant)
- Precision@k: Precision in first k documents, 0.0 to 1.0
- Precision@k = Relevant count in first k / k
```

**Note**: **Requires reference answer (reference)** to judge which contexts are relevant

**Recommended Threshold**: 5 (out of 10)

## 🌟 Best Practices

### 1. Metric Combinations

**Complete Evaluation** (5 metrics):
```python
"evals": [
    {"name": "LLMRAGFaithfulness"},       # Detect hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Check answer relevance
    {"name": "LLMRAGContextRelevancy"},   # Check context noise
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]
```

**Production Environment** (no reference needed):
```python
"evals": [
    {"name": "LLMRAGFaithfulness"},       # ⭐ Most important: prevent hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Ensure direct answers
    {"name": "LLMRAGContextRelevancy"}    # Check retrieval noise
]
```

**Evaluation Phase** (requires reference):
```python
"evals": [
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]
```

### 2. Threshold Adjustment

Adjust thresholds (default 5) based on scenario:

- **Strict scenarios** (finance, medical): threshold 7-8
- **General scenarios** (Q&A systems): threshold 5-6
- **Loose scenarios** (exploratory search): threshold 3-4

### 3. Iterative Optimization

1. **Initial Evaluation**: Evaluate current system with all 5 metrics
2. **Identify Issues**:
   - **Low Faithfulness** → Generation model produces hallucinations
     - Optimize: Adjust generation prompts, use stronger models, enhance fact-checking
   - **Low Answer Relevancy** → Answer off-topic or contains irrelevant info
     - Optimize: Improve generation prompts, limit answer length, enhance question understanding
   - **Low Context Relevancy** → Retrieval introduces noise
     - Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model
   - **Low Context Recall** → Retrieval misses important info
     - Optimize: Increase Top-K, improve query rewriting, expand knowledge base
   - **Low Context Precision** → Relevant docs ranked lower
     - Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation
3. **Targeted Optimization**: Adjust components based on issues
4. **Re-evaluate**: Verify optimization effects
5. **Continuous Monitoring**: Monitor key metrics in production

### 4. Important Notes

- **LLM Dependency**: All metrics depend on LLM API, requiring correct API key and endpoint
- **Embedding Dependency**:
  - Answer Relevancy **requires `embedding_config`**: `model`, `api_url`, `key`
  - Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference)
  - Not configuring will throw exception: `ValueError: Embedding model not initialized...`
- **Cost Considerations**: Evaluation generates API costs, recommendations:
  - Development: Sample evaluation (50-100 samples)
  - Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy)
  - Evaluation: Full evaluation of all metrics
- **Reference Requirements**:
  - Context Recall and Context Precision **require** reference
  - Other three metrics don't need reference
  - Reference mainly used in evaluation phase, production usually doesn't need it

## 📖 For More Details

See the [Chinese version](rag_evaluation_metrics_zh.md) for comprehensive examples and detailed explanations.