# RAG Evaluation Metrics - Complete Guide ## 🎯 Overview Dingo's RAG evaluation metrics system is based on best practices from the [RAGAS paper](https://arxiv.org/abs/2309.15217), DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities. ### ✅ Supported Metrics (5/5) | Metric | Evaluation Dimension | Required Fields | Source | |--------|---------------------|-----------------|--------| | **Faithfulness** | Answer Faithfulness | user_input, response, retrieved_contexts | RAGAS | | **Answer Relevancy** | Answer Relevance | user_input, response | RAGAS | | **Context Relevancy** | Context Relevance | user_input, retrieved_contexts | RAGAS + DeepEval + TruLens | | **Context Recall** | Context Recall | user_input, retrieved_contexts, reference | RAGAS | | **Context Precision** | Context Precision | user_input, retrieved_contexts, reference | RAGAS | ## 🚀 Quick Start ### 1. Run Examples ```bash # Dataset mode - batch evaluation (recommended) python examples/rag/dataset_rag_eval_baseline.py # SDK mode - single evaluation python examples/rag/sdk_rag_eval.py # Simulate RAG system and evaluate python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py ``` ### 2. SDK Mode - Single Evaluation ```python import os from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs from dingo.io.input import Data from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness # Configure LLM LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs( key=os.getenv("OPENAI_API_KEY"), api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), model=os.getenv("OPENAI_MODEL", "deepseek-chat"), ) # Prepare data data = Data( data_id="example_1", prompt="What is machine learning?", content="Machine learning is a branch of AI that enables computers to learn from data.", context=[ "Machine learning is a subfield of AI.", "ML systems learn from data without explicit programming." ] ) # Evaluate result = LLMRAGFaithfulness.eval(data) # View results print(f"Score: {result.score}/10") print(f"Passed: {not result.status}") # status=False means passed print(f"Reason: {result.reason[0]}") ``` ### 3. Dataset Mode - Batch Evaluation ```python from dingo.config import InputArgs from dingo.exec import Executor # Configuration llm_config = { "model": "gpt-4o-mini", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1", } llm_config_embedding = { "model": "gpt-4o-mini", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1", "embedding_config": { # ⭐ Required for Answer Relevancy "model": "text-embedding-3-large", "api_url": "https://api.openai.com/v1", "key": "YOUR_API_KEY" }, "strictness": 3, "threshold": 5 } input_data = { "task_name": "rag_evaluation", "input_path": "test/data/fiqa.jsonl", "output_path": "outputs/", "dataset": {"source": "local", "format": "jsonl"}, "executor": { "max_workers": 10, "result_save": {"good": True, "bad": True, "all_labels": True} }, "evaluator": [ { "fields": { "prompt": "user_input", "content": "response", "context": "retrieved_contexts", "reference": "reference" }, "evals": [ {"name": "LLMRAGFaithfulness", "config": llm_config}, {"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding}, {"name": "LLMRAGContextRelevancy", "config": llm_config}, {"name": "LLMRAGContextRecall", "config": llm_config}, {"name": "LLMRAGContextPrecision", "config": llm_config} ] } ] } input_args = InputArgs(**input_data) executor = Executor.exec_map["local"](input_args) summary = executor.execute() ``` ## 📋 Data Format ### Required Fields | Metric | user_input | response | retrieved_contexts | reference | Notes | |--------|-----------|----------|-------------------|-----------|-------| | **Faithfulness** | ✅ | ✅ | ✅ | - | Measures if answer is based on context | | **Answer Relevancy** | ✅ | ✅ | - | - | Measures if answer addresses the question | | **Context Relevancy** | ✅ | - | ✅ | - | Measures if retrieved contexts are relevant | | **Context Recall** | ✅ | - | ✅ | ✅ | Measures if all needed info is retrieved | | **Context Precision** | ✅ | - | ✅ | ✅ | Measures ranking quality of retrieved contexts | ### Data Example (JSONL) ```jsonl {"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]} {"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."} ``` ## ⚙️ Configuration ### Configurable Parameters | Parameter | Applicable Metrics | Default | Description | |-----------|-------------------|---------|-------------| | `threshold` | All metrics | 5.0 | Pass threshold (0-10) | | `strictness` | Answer Relevancy | 3 | Number of questions to generate (1-5) | | `embedding_config` | Answer Relevancy | - | **Required**: includes `model`, `api_url`, `key` | ### Embedding Configuration (Answer Relevancy) `LLMRAGAnswerRelevancy` **requires `embedding_config`**: **Option 1: Cloud LLM + Cloud Embedding** ```python "config": { "model": "deepseek-chat", "key": "YOUR_API_KEY", "api_url": "https://api.deepseek.com", "embedding_config": { # ⭐ Required "model": "text-embedding-3-large", "api_url": "https://api.deepseek.com", "key": "YOUR_API_KEY" }, "strictness": 3, "threshold": 5 } ``` **Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)** ```python "config": { "model": "deepseek-chat", "key": "YOUR_API_KEY", "api_url": "https://api.deepseek.com", "embedding_config": { # ⭐ Independent embedding service "model": "BAAI/bge-m3", "api_url": "http://localhost:8000/v1", # Local vLLM/Xinference "key": "dummy-key" }, "strictness": 3, "threshold": 5 } ``` **Deploy Local Embedding (vLLM)**: ```bash pip install vllm python -m vllm.entrypoints.openai.api_server \ --model BAAI/bge-m3 \ --port 8000 \ --host 0.0.0.0 ``` **What happens if not configured?** Runtime exception: ``` ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with: - model: embedding model name (e.g., 'BAAI/bge-m3') - api_url: embedding service URL - key: API key (optional for local services) ``` ## 📊 Metric Details ### 1️⃣ Faithfulness (Answer Faithfulness) **Evaluation Goal**: Measure if the answer is entirely based on retrieved context, avoiding hallucinations **Calculation**: 1. Break down answer into independent statements (claims) 2. Judge if each statement is supported by context 3. Faithfulness score = (Supported statements / Total statements) × 10 **Formula**: ``` Faithfulness = (Context-supported claims / Total claims) × 10 ``` **Recommended Threshold**: 7 (out of 10) --- ### 2️⃣ Answer Relevancy (Answer Relevance) **Evaluation Goal**: Measure if the answer directly addresses the user question **Calculation**: 1. Generate N reverse questions from the answer (questions inferred by LLM from the answer) 2. Calculate cosine similarity between embeddings of generated questions and original question 3. Answer Relevancy = Average of all similarities **Formula**: ``` Answer Relevancy = (1/N) × Σ cosine_sim(E_gi, E_o) Where: - N: Number of generated questions, default 3 (adjustable via strictness parameter) - E_gi: Embedding of the i-th generated question - E_o: Embedding of the original question ``` **⚠️ Important**: This metric **requires `embedding_config`**: - `model`: Embedding model name (e.g., `text-embedding-3-large`, `BAAI/bge-m3`) - `api_url`: Embedding service address - `key`: API key (optional for local services) **Recommended Threshold**: 5 (out of 10) --- ### 3️⃣ Context Relevancy (Context Relevance) **Evaluation Goal**: Measure if retrieved contexts are relevant to the question **Calculation**: Uses a **Dual-Judge System** from NVIDIA research: **Judge 1 Scoring**: - **0** = Context completely irrelevant - **1** = Context partially relevant - **2** = Context fully relevant **Judge 2 Scoring**: - Uses different prompt wording for another perspective - Same 0-2 scoring standard - Purpose: Reduce single-prompt bias **Final Score**: ``` Context Relevancy = (Relevant contexts / Total contexts) × 10 Where: - Relevant context: Average score from both judges ≥ threshold (default 1.0) - Irrelevant context: Average score < threshold ``` **Recommended Threshold**: 5 (out of 10) --- ### 4️⃣ Context Recall (Context Recall) **Evaluation Goal**: Measure if all needed information is retrieved (requires reference answer) **Calculation**: 1. Extract independent statements from reference answer 2. Judge if each statement can be attributed from retrieved contexts 3. Recall = (Context-supported reference statements / Total reference statements) × 10 **Formula**: ``` Context Recall = (Context-supported reference claims / Total reference claims) × 10 ``` **Note**: **Requires reference answer (reference)**, typically used in evaluation phase **Recommended Threshold**: 5 (out of 10) --- ### 5️⃣ Context Precision (Context Precision) **Evaluation Goal**: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer) **Calculation**: 1. For each position k, judge if the context is relevant (supports reference answer) 2. Calculate Precision@k for each position 3. Use relevance indicator (v_k) for weighted sum **Formula**: ``` Context Precision = Σ(Precision@k × v_k) / Total relevant items in top K Where: - K: Total retrieved documents, e.g., 5 documents - k: Current position (1st, 2nd, 3rd, ..., K-th) - v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant) - Precision@k: Precision in first k documents, 0.0 to 1.0 - Precision@k = Relevant count in first k / k ``` **Note**: **Requires reference answer (reference)** to judge which contexts are relevant **Recommended Threshold**: 5 (out of 10) ## 🌟 Best Practices ### 1. Metric Combinations **Complete Evaluation** (5 metrics): ```python "evals": [ {"name": "LLMRAGFaithfulness"}, # Detect hallucinations {"name": "LLMRAGAnswerRelevancy"}, # Check answer relevance {"name": "LLMRAGContextRelevancy"}, # Check context noise {"name": "LLMRAGContextRecall"}, # Evaluate retrieval completeness {"name": "LLMRAGContextPrecision"} # Evaluate retrieval ranking ] ``` **Production Environment** (no reference needed): ```python "evals": [ {"name": "LLMRAGFaithfulness"}, # ⭐ Most important: prevent hallucinations {"name": "LLMRAGAnswerRelevancy"}, # Ensure direct answers {"name": "LLMRAGContextRelevancy"} # Check retrieval noise ] ``` **Evaluation Phase** (requires reference): ```python "evals": [ {"name": "LLMRAGContextRecall"}, # Evaluate retrieval completeness {"name": "LLMRAGContextPrecision"} # Evaluate retrieval ranking ] ``` ### 2. Threshold Adjustment Adjust thresholds (default 5) based on scenario: - **Strict scenarios** (finance, medical): threshold 7-8 - **General scenarios** (Q&A systems): threshold 5-6 - **Loose scenarios** (exploratory search): threshold 3-4 ### 3. Iterative Optimization 1. **Initial Evaluation**: Evaluate current system with all 5 metrics 2. **Identify Issues**: - **Low Faithfulness** → Generation model produces hallucinations - Optimize: Adjust generation prompts, use stronger models, enhance fact-checking - **Low Answer Relevancy** → Answer off-topic or contains irrelevant info - Optimize: Improve generation prompts, limit answer length, enhance question understanding - **Low Context Relevancy** → Retrieval introduces noise - Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model - **Low Context Recall** → Retrieval misses important info - Optimize: Increase Top-K, improve query rewriting, expand knowledge base - **Low Context Precision** → Relevant docs ranked lower - Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation 3. **Targeted Optimization**: Adjust components based on issues 4. **Re-evaluate**: Verify optimization effects 5. **Continuous Monitoring**: Monitor key metrics in production ### 4. Important Notes - **LLM Dependency**: All metrics depend on LLM API, requiring correct API key and endpoint - **Embedding Dependency**: - Answer Relevancy **requires `embedding_config`**: `model`, `api_url`, `key` - Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference) - Not configuring will throw exception: `ValueError: Embedding model not initialized...` - **Cost Considerations**: Evaluation generates API costs, recommendations: - Development: Sample evaluation (50-100 samples) - Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy) - Evaluation: Full evaluation of all metrics - **Reference Requirements**: - Context Recall and Context Precision **require** reference - Other three metrics don't need reference - Reference mainly used in evaluation phase, production usually doesn't need it ## 📖 For More Details See the [Chinese version](rag_evaluation_metrics_zh.md) for comprehensive examples and detailed explanations.