# Dingo Hallucination Detection - Complete Guide

This guide introduces how to use integrated hallucination detection features in Dingo, supporting two detection methods: **HHEM-2.1-Open local model** (recommended) and **GPT-based cloud detection**.

## 🎯 Feature Overview

Hallucination detection evaluates whether LLM-generated responses contain factual contradictions with provided reference context. Particularly useful for:

- **RAG System Evaluation**: Detect consistency between generated responses and retrieved documents
- **SFT Data Quality Assessment**: Verify factual accuracy of responses in training data
- **LLM Output Verification**: Real-time detection of hallucination issues in model outputs

## 🔧 Core Principles

### Evaluation Process

1. **Data Preparation**: Provide response to detect and reference context
2. **Consistency Analysis**: Judge if response is consistent with each context
3. **Score Calculation**: Calculate overall hallucination score
4. **Threshold Judgment**: Decide if flagging is needed based on set threshold

### Scoring Mechanism

- **Score Range**: 0.0 - 1.0
- **Score Meaning**:
  - 0.0 = No hallucination
  - 1.0 = Complete hallucination
- **Default Threshold**: 0.5 (configurable)

## 📋 Usage Requirements

### Data Format Requirements

```python
from dingo.io.input import Data

data = Data(
    data_id="test_1",
    prompt="User's question",  # Original question (optional)
    content="LLM's response",  # Response to detect
    context=["Reference context 1", "Reference context 2"]  # Reference context (required)
)
```

### Supported Context Formats

```python
# Method 1: String list
context = ["Context 1", "Context 2", "Context 3"]

# Method 2: Single string
context = "Complete context text"

# Method 3: Dict with passages key
context = {"passages": ["Context 1", "Context 2"]}
```

## 🚀 Quick Start

### Method 1: HHEM-2.1-Open Local Model (Recommended ⭐)

**Advantages**:
- ✅ Fast speed
- ✅ No API costs
- ✅ Data privacy
- ✅ Can run offline

**Installation**:

```bash
# Install extra dependencies
pip install dingo-python[hhem]

# Or install dependencies manually
pip install sentence-transformers torch
```

**Usage**:

```python
from dingo.config.input_args import EvaluatorRuleArgs
from dingo.io.input import Data
from dingo.model.rule.rule_hallucination_hhem import RuleHallucinationHHEM

# Configure (first run will auto-download model ~400MB)
RuleHallucinationHHEM.dynamic_config = EvaluatorRuleArgs(
    threshold=0.5  # Hallucination threshold, higher = stricter
)

# Prepare data
data = Data(
    data_id="test_1",
    content="Paris is the capital of Germany.",  # Response to detect
    context=["Paris is the capital of France."]  # Reference context
)

# Execute detection
result = RuleHallucinationHHEM.eval(data)

# View results
print(f"Score: {result.score}")  # 0.0-1.0, higher = more hallucination
print(f"Has issues: {result.status}")  # True = has hallucination, False = no hallucination
print(f"Reason: {result.reason}")
```

**Output Example**:

```
Score: 0.85
Has issues: True
Reason: ['Hallucination detected (score: 0.85, threshold: 0.5). Inconsistent parts: Paris is capital of Germany (context states: Paris is capital of France)']
```

### Method 2: GPT-based Cloud Detection

**Advantages**:
- ✅ No local model download needed
- ✅ High-quality detection with powerful LLM
- ✅ Easy integration

**Usage**:

```python
import os
from dingo.config.input_args import EvaluatorLLMArgs
from dingo.io.input import Data
from dingo.model.llm.llm_hallucination import LLMHallucination

# Configure LLM
LLMHallucination.dynamic_config = EvaluatorLLMArgs(
    key=os.getenv("OPENAI_API_KEY"),
    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
    threshold=0.5
)

# Prepare data
data = Data(
    data_id="test_1",
    content="Paris is the capital of Germany.",
    context=["Paris is the capital of France."]
)

# Execute detection
result = LLMHallucination.eval(data)

# View results
print(f"Score: {result.score}")
print(f"Has issues: {result.status}")
print(f"Reason: {result.reason}")
```

## 📊 Batch Processing

### Dataset Mode

```python
from dingo.config import InputArgs
from dingo.exec import Executor

input_data = {
    "task_name": "hallucination_detection",
    "input_path": "test/data/rag_responses.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "executor": {
        "max_workers": 10,
        "result_save": {
            "good": True,
            "bad": True,
            "all_labels": True
        }
    },
    "evaluator": [
        {
            "fields": {
                "content": "response",
                "context": "retrieved_contexts"
            },
            "evals": [
                {
                    "name": "RuleHallucinationHHEM",  # Or "LLMHallucination"
                    "config": {"threshold": 0.5}
                }
            ]
        }
    ]
}

input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()

print(f"Total: {summary.total}")
print(f"Issues: {summary.num_bad}")
print(f"Pass rate: {summary.score}%")
```

### Data File Format (JSONL)

```jsonl
{"response": "Paris is the capital of France.", "retrieved_contexts": ["Paris is the capital of France.", "France is in Western Europe."]}
{"response": "Python was created by Guido van Rossum.", "retrieved_contexts": ["Python was designed by Guido van Rossum.", "Python was first released in 1991."]}
```

## ⚙️ Configuration Options

### Threshold Adjustment

```python
# Method 1: Rule-based (HHEM)
RuleHallucinationHHEM.dynamic_config = EvaluatorRuleArgs(
    threshold=0.5  # Range: 0.0-1.0
)

# Method 2: LLM-based
LLMHallucination.dynamic_config = EvaluatorLLMArgs(
    key="YOUR_API_KEY",
    api_url="https://api.openai.com/v1",
    model="gpt-4o-mini",
    threshold=0.5  # Range: 0.0-1.0
)
```

**Threshold Recommendations**:
- **Strict scenarios** (finance, medical): 0.3-0.4
- **General scenarios** (Q&A systems): 0.5-0.6
- **Loose scenarios** (creative content): 0.7-0.8

### Device Selection (HHEM Only)

```python
# Auto-select (default: uses GPU if available)
RuleHallucinationHHEM.dynamic_config = EvaluatorRuleArgs()

# Force CPU
import torch
RuleHallucinationHHEM.device = "cpu"

# Force GPU
RuleHallucinationHHEM.device = "cuda"

# Specific GPU
RuleHallucinationHHEM.device = "cuda:0"
```

## 📈 Performance Comparison

| Feature | HHEM-2.1-Open | GPT-based |
|---------|---------------|-----------|
| **Speed** | Fast (~50ms/sample) | Slower (~1-2s/sample) |
| **Cost** | Free | API costs |
| **Accuracy** | High (F1: 0.84) | Very High |
| **Privacy** | Local, secure | Data sent to API |
| **Deployment** | Needs model download (~400MB) | Needs API key |
| **Offline** | ✅ Supported | ❌ Requires network |

**Recommendations**:
- **Production environment**: HHEM-2.1-Open (fast, free, private)
- **High-precision scenarios**: GPT-based (highest accuracy)
- **Offline scenarios**: HHEM-2.1-Open (can run completely offline)

## 🌟 Best Practices

### 1. Context Quality

**Good Context**:
```python
context = [
    "Paris is the capital of France, located in northern France.",
    "France is a country in Western Europe with a population of about 67 million."
]
```

**Poor Context**:
```python
context = [
    "Paris",  # Too short, lacks information
    "France has many cities."  # Too vague
]
```

### 2. Handling Multiple Contexts

```python
# When multiple contexts exist, system automatically analyzes consistency with each
data = Data(
    content="Paris is the capital of France and the largest city in France.",
    context=[
        "Paris is the capital of France.",  # Supports first half
        "Paris is the largest city in France."  # Supports second half
    ]
)
```

### 3. Iterative Optimization

1. **Initial Testing**: Use default threshold (0.5)
2. **Analyze Results**: Check for false positives/negatives
3. **Adjust Threshold**: Refine based on business needs
4. **Verify Effects**: Re-test with new threshold

### 4. Integration with RAG Evaluation

```python
"evaluator": [
    {
        "fields": {
            "prompt": "user_input",
            "content": "response",
            "context": "retrieved_contexts"
        },
        "evals": [
            {"name": "LLMRAGFaithfulness"},       # Faithfulness (based on LLM)
            {"name": "RuleHallucinationHHEM"},    # Hallucination (model-based)
            {"name": "LLMRAGAnswerRelevancy"}     # Answer relevance
        ]
    }
]
```

## ❓ FAQ

### Q1: HHEM vs GPT-based, which to choose?

- **Production/large-scale**: HHEM (fast, free, private)
- **High-precision evaluation**: GPT-based (highest accuracy, but has costs)
- **Offline scenarios**: HHEM (can run completely offline)

### Q2: Why does HHEM download model on first run?

HHEM uses Sentence-Transformers model (~400MB), auto-downloads and caches on first run. Subsequent runs load directly from cache, no re-download needed.

### Q3: What if model download fails?

```bash
# Manually download
huggingface-cli download vectara/hallucination_evaluation_model --local-dir ~/.cache/huggingface/hub/models--vectara--hallucination_evaluation_model

# Or use mirror
export HF_ENDPOINT=https://hf-mirror.com
```

### Q4: How to interpret scores?

- **0.0-0.3**: Low hallucination risk, response highly consistent with context
- **0.3-0.5**: Moderate risk, some parts may be inconsistent, needs attention
- **0.5-0.7**: High risk, significant inconsistencies, needs review
- **0.7-1.0**: Severe hallucination, response seriously contradicts context

## 📖 Related Documents

- [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md)
- [Factuality Assessment Guide](factuality_assessment_guide.md)
- [HHEM Paper](https://arxiv.org/abs/2406.09053)

## 📝 Example Scenarios

### Scenario 1: Detect Factual Errors

```python
data = Data(
    content="Python was released in 1995 by James Gosling.",  # Wrong: year and author
    context=["Python was created by Guido van Rossum and first released in 1991."]
)

result = RuleHallucinationHHEM.eval(data)
# Expected: High score (>0.7), detected as having hallucination
```

### Scenario 2: Detect Partial Hallucination

```python
data = Data(
    content="Machine learning is a branch of AI. It was invented in the 1950s by Alan Turing.",  # First sentence correct, second incorrect
    context=["Machine learning is a subfield of artificial intelligence."]
)

result = RuleHallucinationHHEM.eval(data)
# Expected: Moderate score (0.4-0.6), partial hallucination
```

### Scenario 3: Verify No Hallucination

```python
data = Data(
    content="Deep learning is a subset of machine learning that uses multi-layer neural networks.",
    context=["Deep learning is part of machine learning, characterized by using multi-layer neural networks."]
)

result = RuleHallucinationHHEM.eval(data)
# Expected: Low score (<0.3), no hallucination
```