# Dingo Factuality Assessment - Complete Guide This guide introduces how to use integrated factuality assessment features in Dingo to evaluate factual accuracy of LLM-generated content. ## 🎯 Feature Overview Factuality assessment evaluates whether LLM-generated responses contain factual errors or unverifiable claims. Particularly useful for: - **Content Quality Control**: Verify accuracy of generated content - **Knowledge Base Validation**: Ensure knowledge base information is accurate - **Training Data Filtering**: Filter out factually incorrect training samples - **Real-time Output Verification**: Check factual accuracy of model outputs ## 🔧 Core Principles ### Evaluation Process 1. **Claim Extraction**: Break down response into independent factual claims 2. **Fact Verification**: Verify each claim against reference materials or knowledge base 3. **Score Calculation**: Calculate overall factuality score 4. **Issue Identification**: Identify specific factual errors ### Scoring Mechanism - **Score Range**: 0.0 - 10.0 - **Score Meaning**: - 8.0-10.0 = High factual accuracy - 5.0-7.9 = Moderate accuracy, some errors - 0.0-4.9 = Low accuracy, significant errors - **Default Threshold**: 5.0 (configurable) ## 📋 Usage Requirements ### Data Format Requirements ```python from dingo.io.input import Data data = Data( data_id="test_1", prompt="User's question", # Original question (optional) content="LLM's response", # Response to assess context=["Reference material 1", "Reference material 2"] # Reference materials (optional but recommended) ) ``` ## 🚀 Quick Start ### SDK Mode - Single Assessment ```python import os from dingo.config.input_args import EvaluatorLLMArgs from dingo.io.input import Data from dingo.model.llm.llm_factcheck import LLMFactCheck # Configure LLM LLMFactCheck.dynamic_config = EvaluatorLLMArgs( key=os.getenv("OPENAI_API_KEY"), api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"), threshold=5.0 ) # Prepare data data = Data( data_id="test_1", prompt="When was Python released?", content="Python was released in 1991 by Guido van Rossum.", context=["Python was created by Guido van Rossum.", "Python was first released in 1991."] ) # Execute assessment result = LLMFactCheck.eval(data) # View results print(f"Score: {result.score}/10") print(f"Has issues: {result.status}") # True = below threshold, False = passed print(f"Reason: {result.reason[0]}") ``` ### Dataset Mode - Batch Assessment ```python from dingo.config import InputArgs from dingo.exec import Executor input_data = { "task_name": "factuality_assessment", "input_path": "test/data/responses.jsonl", "output_path": "outputs/", "dataset": {"source": "local", "format": "jsonl"}, "executor": { "max_workers": 10, "result_save": {"good": True, "bad": True, "all_labels": True} }, "evaluator": [ { "fields": { "prompt": "question", "content": "response", "context": "references" }, "evals": [ { "name": "LLMFactCheck", "config": { "model": "gpt-4o-mini", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1", "threshold": 5.0 } } ] } ] } input_args = InputArgs(**input_data) executor = Executor.exec_map["local"](input_args) summary = executor.execute() print(f"Total: {summary.total}") print(f"Passed: {summary.num_good}") print(f"Issues: {summary.num_bad}") print(f"Pass rate: {summary.score}%") ``` ### Data File Format (JSONL) ```jsonl {"question": "When was Python released?", "response": "Python was released in 1991 by Guido van Rossum.", "references": ["Python was created by Guido van Rossum.", "Python first appeared in 1991."]} {"question": "What is the capital of France?", "response": "The capital of France is Paris.", "references": ["Paris is the capital and largest city of France."]} ``` ## ⚙️ Configuration Options ### Threshold Adjustment ```python LLMFactCheck.dynamic_config = EvaluatorLLMArgs( key="YOUR_API_KEY", api_url="https://api.openai.com/v1", model="gpt-4o-mini", threshold=5.0 # Range: 0.0-10.0 ) ``` **Threshold Recommendations**: - **Strict scenarios** (medical, legal): threshold 7.0-8.0 - **General scenarios** (Q&A, documentation): threshold 5.0-6.0 - **Loose scenarios** (creative content, brainstorming): threshold 3.0-4.0 ### Model Selection ```python # Option 1: GPT-4 (highest accuracy, higher cost) LLMFactCheck.dynamic_config = EvaluatorLLMArgs( model="gpt-4o", key="YOUR_API_KEY", api_url="https://api.openai.com/v1" ) # Option 2: GPT-4o-mini (balanced, recommended) LLMFactCheck.dynamic_config = EvaluatorLLMArgs( model="gpt-4o-mini", key="YOUR_API_KEY", api_url="https://api.openai.com/v1" ) # Option 3: Alternative LLM (DeepSeek, etc.) LLMFactCheck.dynamic_config = EvaluatorLLMArgs( model="deepseek-chat", key="YOUR_API_KEY", api_url="https://api.deepseek.com" ) ``` ## 📊 Output Format ### SDK Mode Output ```python result = LLMFactCheck.eval(data) # Basic information result.score # Score: 0.0-10.0 result.status # Has issues: True (below threshold) / False (passed) result.label # Labels: ["QUALITY_GOOD.FACTCHECK_PASS"] or ["QUALITY_BAD.FACTCHECK_FAIL"] result.reason # Detailed reasons result.metric # Metric name: "LLMFactCheck" ``` **Output Example (Passed)**: ```python result.score = 8.5 result.status = False # False = passed result.label = ["QUALITY_GOOD.FACTCHECK_PASS"] result.reason = ["Factual accuracy assessment passed (score: 8.5/10). All claims verified: Python was released in 1991, Creator is Guido van Rossum."] ``` **Output Example (Failed)**: ```python result.score = 3.2 result.status = True # True = failed result.label = ["QUALITY_BAD.FACTCHECK_FAIL"] result.reason = ["Factual accuracy assessment failed (score: 3.2/10). Errors detected: Python was not released in 1995 (correct: 1991)"] ``` ## 🌟 Best Practices ### 1. Provide High-quality Reference Materials **Good References**: ```python context = [ "Python was created by Guido van Rossum and first released in February 1991.", "Python is an interpreted, high-level programming language.", "Python 2.0 was released in 2000, Python 3.0 was released in 2008." ] ``` **Poor References**: ```python context = [ "Python", # Too brief "Python is a programming language" # Lacks details ] ``` ### 2. Suitable Use Cases **✅ Suitable for**: - Verifiable factual claims (dates, names, numbers, events) - Historical facts - Technical specifications - Statistical data **❌ Not suitable for**: - Subjective opinions - Future predictions - Creative content - Open-ended questions ### 3. Combined Use with Other Metrics ```python "evaluator": [ { "fields": { "prompt": "user_input", "content": "response", "context": "retrieved_contexts" }, "evals": [ {"name": "LLMRAGFaithfulness"}, # Answer faithfulness {"name": "LLMFactCheck"}, # Factual accuracy {"name": "RuleHallucinationHHEM"} # Hallucination detection ] } ] ``` ### 4. Iterative Optimization 1. **Initial Testing**: Use default threshold (5.0) 2. **Analyze Results**: Review false positives and false negatives 3. **Adjust Threshold**: Fine-tune based on business requirements 4. **Re-validate**: Test with new threshold ## 📈 Metric Comparison | Metric | Purpose | Score Range | Requires Reference | Best For | |--------|---------|-------------|-------------------|----------| | **Factuality** | Verify factual accuracy | 0-10 | Optional (recommended) | Fact verification, knowledge base validation | | **Faithfulness** | Check if based on context | 0-10 | Yes | RAG systems, prevent hallucinations | | **Hallucination** | Detect contradictions with context | 0-1 | Yes | Fast hallucination detection | **Recommendations**: - **RAG evaluation**: Combine Faithfulness + Hallucination + Factuality - **Content generation**: Use Factuality alone - **Real-time verification**: Prioritize Hallucination (fast) or Faithfulness ## ❓ FAQ ### Q1: Difference between Factuality and Faithfulness? - **Factuality**: Verifies if content is factually correct (can use external knowledge) - **Faithfulness**: Checks if response is based on provided context (only looks at context-response relationship) ### Q2: What if no reference materials provided? LLM will use its internal knowledge for verification, but accuracy may be lower. **Recommendation**: Always provide reference materials for best results. ### Q3: How to handle domain-specific facts? 1. Provide domain-specific reference materials in `context` 2. Use domain-specific LLM models 3. Lower threshold to reduce false positives ### Q4: How to interpret scores? - **8.0-10.0**: High accuracy, all or most facts verified - **5.0-7.9**: Moderate accuracy, some errors or unverifiable claims - **3.0-4.9**: Low accuracy, multiple errors - **0.0-2.9**: Very low accuracy, serious factual errors ## 📖 Related Documents - [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md) - [Hallucination Detection Guide](hallucination_detection_guide.md) - [Response Quality Evaluation](../README.md#evaluation-metrics) ## 📝 Example Scenarios ### Scenario 1: Verify Historical Facts ```python data = Data( content="Python was released in 1991 by Guido van Rossum at CWI in the Netherlands.", context=[ "Python was created by Guido van Rossum.", "Python was first released in February 1991.", "Guido van Rossum began working on Python at CWI." ] ) result = LLMFactCheck.eval(data) # Expected: High score (>8.0), all facts verified ``` ### Scenario 2: Detect Factual Errors ```python data = Data( content="Python was released in 1995 by James Gosling.", # Wrong year and author context=[ "Python was created by Guido van Rossum.", "Python was first released in 1991." ] ) result = LLMFactCheck.eval(data) # Expected: Low score (<4.0), multiple errors detected ``` ### Scenario 3: Assess Partially Correct Content ```python data = Data( content="Python 3.0 was released in 2008. It introduced many breaking changes and removed backward compatibility with Python 2.x.", context=[ "Python 3.0 was released on December 3, 2008.", "Python 3.0 was not backward compatible with Python 2.x series." ] ) result = LLMFactCheck.eval(data) # Expected: High score (7-9), facts mostly correct with minor imprecisions ``` ### Scenario 4: Handle Unverifiable Claims ```python data = Data( content="Python will become the most popular programming language in 2030.", # Future prediction context=["Python is currently one of the most popular programming languages."] ) result = LLMFactCheck.eval(data) # Expected: Moderate score (4-6), future prediction cannot be verified ```