# Dingo Factuality Assessment - Complete Guide

This guide introduces how to use integrated factuality assessment features in Dingo to evaluate factual accuracy of LLM-generated content.

## 🎯 Feature Overview

Factuality assessment evaluates whether LLM-generated responses contain factual errors or unverifiable claims. Particularly useful for:

- **Content Quality Control**: Verify accuracy of generated content
- **Knowledge Base Validation**: Ensure knowledge base information is accurate
- **Training Data Filtering**: Filter out factually incorrect training samples
- **Real-time Output Verification**: Check factual accuracy of model outputs

## 🔧 Core Principles

### Evaluation Process

1. **Claim Extraction**: Break down response into independent factual claims
2. **Fact Verification**: Verify each claim against reference materials or knowledge base
3. **Score Calculation**: Calculate overall factuality score
4. **Issue Identification**: Identify specific factual errors

### Scoring Mechanism

- **Score Range**: 0.0 - 10.0
- **Score Meaning**:
  - 8.0-10.0 = High factual accuracy
  - 5.0-7.9 = Moderate accuracy, some errors
  - 0.0-4.9 = Low accuracy, significant errors
- **Default Threshold**: 5.0 (configurable)

## 📋 Usage Requirements

### Data Format Requirements

```python
from dingo.io.input import Data

data = Data(
    data_id="test_1",
    prompt="User's question",  # Original question (optional)
    content="LLM's response",  # Response to assess
    context=["Reference material 1", "Reference material 2"]  # Reference materials (optional but recommended)
)
```

## 🚀 Quick Start

### SDK Mode - Single Assessment

```python
import os
from dingo.config.input_args import EvaluatorLLMArgs
from dingo.io.input import Data
from dingo.model.llm.llm_factcheck import LLMFactCheck

# Configure LLM
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
    key=os.getenv("OPENAI_API_KEY"),
    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
    threshold=5.0
)

# Prepare data
data = Data(
    data_id="test_1",
    prompt="When was Python released?",
    content="Python was released in 1991 by Guido van Rossum.",
    context=["Python was created by Guido van Rossum.", "Python was first released in 1991."]
)

# Execute assessment
result = LLMFactCheck.eval(data)

# View results
print(f"Score: {result.score}/10")
print(f"Has issues: {result.status}")  # True = below threshold, False = passed
print(f"Reason: {result.reason[0]}")
```

### Dataset Mode - Batch Assessment

```python
from dingo.config import InputArgs
from dingo.exec import Executor

input_data = {
    "task_name": "factuality_assessment",
    "input_path": "test/data/responses.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "executor": {
        "max_workers": 10,
        "result_save": {"good": True, "bad": True, "all_labels": True}
    },
    "evaluator": [
        {
            "fields": {
                "prompt": "question",
                "content": "response",
                "context": "references"
            },
            "evals": [
                {
                    "name": "LLMFactCheck",
                    "config": {
                        "model": "gpt-4o-mini",
                        "key": "YOUR_API_KEY",
                        "api_url": "https://api.openai.com/v1",
                        "threshold": 5.0
                    }
                }
            ]
        }
    ]
}

input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()

print(f"Total: {summary.total}")
print(f"Passed: {summary.num_good}")
print(f"Issues: {summary.num_bad}")
print(f"Pass rate: {summary.score}%")
```

### Data File Format (JSONL)

```jsonl
{"question": "When was Python released?", "response": "Python was released in 1991 by Guido van Rossum.", "references": ["Python was created by Guido van Rossum.", "Python first appeared in 1991."]}
{"question": "What is the capital of France?", "response": "The capital of France is Paris.", "references": ["Paris is the capital and largest city of France."]}
```

## ⚙️ Configuration Options

### Threshold Adjustment

```python
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
    key="YOUR_API_KEY",
    api_url="https://api.openai.com/v1",
    model="gpt-4o-mini",
    threshold=5.0  # Range: 0.0-10.0
)
```

**Threshold Recommendations**:
- **Strict scenarios** (medical, legal): threshold 7.0-8.0
- **General scenarios** (Q&A, documentation): threshold 5.0-6.0
- **Loose scenarios** (creative content, brainstorming): threshold 3.0-4.0

### Model Selection

```python
# Option 1: GPT-4 (highest accuracy, higher cost)
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
    model="gpt-4o",
    key="YOUR_API_KEY",
    api_url="https://api.openai.com/v1"
)

# Option 2: GPT-4o-mini (balanced, recommended)
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
    model="gpt-4o-mini",
    key="YOUR_API_KEY",
    api_url="https://api.openai.com/v1"
)

# Option 3: Alternative LLM (DeepSeek, etc.)
LLMFactCheck.dynamic_config = EvaluatorLLMArgs(
    model="deepseek-chat",
    key="YOUR_API_KEY",
    api_url="https://api.deepseek.com"
)
```

## 📊 Output Format

### SDK Mode Output

```python
result = LLMFactCheck.eval(data)

# Basic information
result.score          # Score: 0.0-10.0
result.status         # Has issues: True (below threshold) / False (passed)
result.label          # Labels: ["QUALITY_GOOD.FACTCHECK_PASS"] or ["QUALITY_BAD.FACTCHECK_FAIL"]
result.reason         # Detailed reasons
result.metric         # Metric name: "LLMFactCheck"
```

**Output Example (Passed)**:
```python
result.score = 8.5
result.status = False  # False = passed
result.label = ["QUALITY_GOOD.FACTCHECK_PASS"]
result.reason = ["Factual accuracy assessment passed (score: 8.5/10). All claims verified: Python was released in 1991, Creator is Guido van Rossum."]
```

**Output Example (Failed)**:
```python
result.score = 3.2
result.status = True  # True = failed
result.label = ["QUALITY_BAD.FACTCHECK_FAIL"]
result.reason = ["Factual accuracy assessment failed (score: 3.2/10). Errors detected: Python was not released in 1995 (correct: 1991)"]
```

## 🌟 Best Practices

### 1. Provide High-quality Reference Materials

**Good References**:
```python
context = [
    "Python was created by Guido van Rossum and first released in February 1991.",
    "Python is an interpreted, high-level programming language.",
    "Python 2.0 was released in 2000, Python 3.0 was released in 2008."
]
```

**Poor References**:
```python
context = [
    "Python",  # Too brief
    "Python is a programming language"  # Lacks details
]
```

### 2. Suitable Use Cases

**✅ Suitable for**:
- Verifiable factual claims (dates, names, numbers, events)
- Historical facts
- Technical specifications
- Statistical data

**❌ Not suitable for**:
- Subjective opinions
- Future predictions
- Creative content
- Open-ended questions

### 3. Combined Use with Other Metrics

```python
"evaluator": [
    {
        "fields": {
            "prompt": "user_input",
            "content": "response",
            "context": "retrieved_contexts"
        },
        "evals": [
            {"name": "LLMRAGFaithfulness"},       # Answer faithfulness
            {"name": "LLMFactCheck"},             # Factual accuracy
            {"name": "RuleHallucinationHHEM"}     # Hallucination detection
        ]
    }
]
```

### 4. Iterative Optimization

1. **Initial Testing**: Use default threshold (5.0)
2. **Analyze Results**: Review false positives and false negatives
3. **Adjust Threshold**: Fine-tune based on business requirements
4. **Re-validate**: Test with new threshold

## 📈 Metric Comparison

| Metric | Purpose | Score Range | Requires Reference | Best For |
|--------|---------|-------------|-------------------|----------|
| **Factuality** | Verify factual accuracy | 0-10 | Optional (recommended) | Fact verification, knowledge base validation |
| **Faithfulness** | Check if based on context | 0-10 | Yes | RAG systems, prevent hallucinations |
| **Hallucination** | Detect contradictions with context | 0-1 | Yes | Fast hallucination detection |

**Recommendations**:
- **RAG evaluation**: Combine Faithfulness + Hallucination + Factuality
- **Content generation**: Use Factuality alone
- **Real-time verification**: Prioritize Hallucination (fast) or Faithfulness

## ❓ FAQ

### Q1: Difference between Factuality and Faithfulness?

- **Factuality**: Verifies if content is factually correct (can use external knowledge)
- **Faithfulness**: Checks if response is based on provided context (only looks at context-response relationship)

### Q2: What if no reference materials provided?

LLM will use its internal knowledge for verification, but accuracy may be lower. **Recommendation**: Always provide reference materials for best results.

### Q3: How to handle domain-specific facts?

1. Provide domain-specific reference materials in `context`
2. Use domain-specific LLM models
3. Lower threshold to reduce false positives

### Q4: How to interpret scores?

- **8.0-10.0**: High accuracy, all or most facts verified
- **5.0-7.9**: Moderate accuracy, some errors or unverifiable claims
- **3.0-4.9**: Low accuracy, multiple errors
- **0.0-2.9**: Very low accuracy, serious factual errors

## 📖 Related Documents

- [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md)
- [Hallucination Detection Guide](hallucination_detection_guide.md)
- [Response Quality Evaluation](../README.md#evaluation-metrics)

## 📝 Example Scenarios

### Scenario 1: Verify Historical Facts

```python
data = Data(
    content="Python was released in 1991 by Guido van Rossum at CWI in the Netherlands.",
    context=[
        "Python was created by Guido van Rossum.",
        "Python was first released in February 1991.",
        "Guido van Rossum began working on Python at CWI."
    ]
)

result = LLMFactCheck.eval(data)
# Expected: High score (>8.0), all facts verified
```

### Scenario 2: Detect Factual Errors

```python
data = Data(
    content="Python was released in 1995 by James Gosling.",  # Wrong year and author
    context=[
        "Python was created by Guido van Rossum.",
        "Python was first released in 1991."
    ]
)

result = LLMFactCheck.eval(data)
# Expected: Low score (<4.0), multiple errors detected
```

### Scenario 3: Assess Partially Correct Content

```python
data = Data(
    content="Python 3.0 was released in 2008. It introduced many breaking changes and removed backward compatibility with Python 2.x.",
    context=[
        "Python 3.0 was released on December 3, 2008.",
        "Python 3.0 was not backward compatible with Python 2.x series."
    ]
)

result = LLMFactCheck.eval(data)
# Expected: High score (7-9), facts mostly correct with minor imprecisions
```

### Scenario 4: Handle Unverifiable Claims

```python
data = Data(
    content="Python will become the most popular programming language in 2030.",  # Future prediction
    context=["Python is currently one of the most popular programming languages."]
)

result = LLMFactCheck.eval(data)
# Expected: Moderate score (4-6), future prediction cannot be verified
```