# SOAP Note Evaluation Leaderboard

Benchmark comparing clinical SOAP note generation models on 300 synthetic doctor-patient dialogues.

---

## Overall Rankings

| Rank | Model | Composite | Safety | Evidence | Coverage | Generalist |
|:----:|-------|:---------:|:------:|:--------:|:--------:|:----------:|
| 1 | **gpt-5.2** | **4.723** | 4.543 | 4.358 | 4.954 | 4.824 |
| 2 | gemini-3-pro-preview | 4.699 | 4.541 | 4.357 | 4.864 | 4.848 |
| 3 | Omi-SOAP-edge-v1* | 4.654 | 4.547 | 4.421 | 4.862 | 4.595 |
| 4 | Kimi-K2-Thinking | 4.546 | 4.217 | 3.890 | 4.906 | 4.828 |
| 5 | claude-opus-4-5 | 4.543 | 4.202 | 3.870 | 4.947 | 4.793 |
| 6 | GPT-5 | 4.285 | 3.805 | 3.316 | 4.843 | 4.646 |

*All scores 0-5 scale. Higher = better.*
**Omi scores averaged across 5 evaluations with different judge panels.*

---

## Hallucination Risk Analysis

Using **Omi-SOAP-edge-v1 as baseline (1.0x risk)**:

| Model | Major Halls/Note | Risk vs Omi | Minor Halls/Note | Risk vs Omi |
|-------|:----------------:|:-----------:|:----------------:|:-----------:|
| **gpt-5.2** | 0.114 | **0.89x** | 0.358 | 1.85x |
| **gemini-3-pro-preview** | 0.127 | **0.99x** | 0.280 | 1.45x |
| Omi-SOAP-edge-v1 | 0.128 | 1.00x | 0.193 | 1.00x |
| Kimi-K2-Thinking | 0.351 | 2.74x | 0.382 | 1.98x |
| claude-opus-4-5 | 0.397 | 3.10x | 0.191 | 0.99x |
| GPT-5 | 0.553 | 4.32x | 0.382 | 1.98x |

### Key Findings

**Major Hallucinations (clinical fabrications - high risk):**
- gpt-5.2 and gemini-3-pro are **safest** (~0.9-1.0x Omi baseline)
- Kimi-K2-Thinking has **2.7x more** major errors than Omi
- claude-opus-4-5 has **3.1x more** major errors than Omi
- GPT-5 has **4.3x more** major errors than Omi

**Minor Hallucinations (wording/citation issues - low risk):**
- Omi and Claude have fewest minor errors
- gpt-5.2 trades minor errors for major accuracy (2x minor, 0.9x major)

---

## Majority Agreement on Major Errors

*% of dialogues where 2+ judges flagged a major hallucination:*

| Model | Majority Major Rate | Interpretation |
|-------|:-------------------:|----------------|
| gpt-5.2 | 4.0% | 12 of 300 dialogues |
| Omi-SOAP-edge-v1 | 6.7% | 20 of 300 dialogues |
| gemini-3-pro-preview | 8.0% | 24 of 300 dialogues |
| Kimi-K2-Thinking | 19.3% | 58 of 300 dialogues |
| claude-opus-4-5 | 25.3% | 76 of 300 dialogues |
| GPT-5 | 36.7% | 110 of 300 dialogues |

---

## Head-to-Head Results

All comparisons vs Omi-SOAP-edge-v1:

| Opponent | Omi Wins | Opponent Wins | Ties | H2H Winner |
|----------|:--------:|:-------------:|:----:|:----------:|
| gpt-5.2 | 53 (17.7%) | **96 (32.0%)** | 151 (50.3%) | **gpt-5.2** |
| gemini-3-pro-preview | 56 (18.7%) | **90 (30.0%)** | 154 (51.3%) | **gemini-3-pro** |
| Kimi-K2-Thinking | **100 (33.3%)** | 79 (26.3%) | 121 (40.3%) | **Omi** |
| claude-opus-4-5 | **105 (35.0%)** | 59 (19.7%) | 136 (45.3%) | **Omi** |
| GPT-5 | **148 (49.3%)** | 43 (14.3%) | 109 (36.3%) | **Omi** |

*Tie threshold: |composite diff| < 0.25*

---

## Model Profiles

### Tier 1: Leaders

**gpt-5.2** - Best overall
- Highest composite (4.723) and coverage (4.954)
- Lowest major hallucination rate (0.89x Omi)
- Beats Omi in 32% of dialogues, loses only 17.7%

**gemini-3-pro-preview** - Best readability
- Highest generalist score (4.848) - judges find it most readable
- Nearly identical safety to gpt-5.2
- Beats Omi in 30% of dialogues

### Tier 2: Baseline

**Omi-SOAP-edge-v1** - Conservative & safe
- Lowest minor hallucination rate
- Balanced safety-first approach
- Reference baseline for comparisons

### Tier 3: Tradeoffs

**Kimi-K2-Thinking** - High quality, moderate risk
- Excellent generalist score (4.828) and coverage (4.906)
- But 2.7x more major hallucinations than Omi
- Closest competitor to Omi in head-to-head (33% vs 26%)

**claude-opus-4-5** - Completeness vs accuracy
- Highest coverage (4.947) - most thorough notes
- But 3.1x more major hallucinations than Omi
- Safety penalty outweighs completeness gains

**GPT-5** - Superseded
- 4.3x more major hallucinations than Omi
- Replaced by gpt-5.2 in rankings

---

## Scoring Methodology

### Composite Formula
```
Composite = 0.5 × Safety + 0.3 × Coverage + 0.2 × Generalist
Safety    = 0.7 × Evidence + 0.3 × Numeric
```

### Dimension Definitions
| Dimension | Formula | What It Measures |
|-----------|---------|------------------|
| Evidence | 5 - 1×minor - 3×major | Accuracy of claims |
| Numeric | 5 - 0.5×mismatch - 0.25×omission | Number/dosage accuracy |
| Coverage | (flags_present / 5) × 5 | Vitals, meds, assessment, safety, followup |
| Generalist | mean(factual, complete, readable) | Overall quality judgment |

### Judge Panels
| Model Evaluated | Judges (3 per eval) |
|-----------------|---------------------|
| gpt-5.2, GPT-5 | gemini_flash, claude_haiku, kimi_instruct |
| claude-opus-4-5 | gemini_flash, gpt5_mini, kimi_instruct |
| gemini-3-pro-preview | gpt5_mini, claude_haiku, kimi_instruct |
| Kimi-K2-Thinking | gemini_flash, gpt5_mini, claude_haiku |

*Same-family judges excluded to prevent bias.*

---

## Reproduce Results

```bash
# Generate SOAP notes (output dir auto-derived from model env var)
python scripts/generate_soap.py --model kimi

# Run evaluation
python scripts/run_evaluation.py \
    --a_root data/outputs/omi_soap_edge_v1 \
    --b_root data/outputs/Kimi-K2-Thinking \
    --a_label "Omi-SOAP-edge-v1" \
    --b_label "Kimi-K2-Thinking" \
    --judges gemini_flash,gpt5_mini,claude_haiku
```

---

## Changelog

- **2025-12-14**: Added Kimi-K2-Thinking evaluation. Ranks #4 with 2.7x hallucination risk vs Omi.
- **2025-12-13**: Full 300-dialogue evaluations for gpt-5.2, gemini-3-pro-preview, claude-opus-4-5. gpt-5.2 takes #1.
- **2025-12-11**: Full 300-dialogue GPT-5 evaluation with 3 judges.
- **2025-09-07**: Initial benchmark with Omi-SOAP-edge-v1.

---

**Maintained by [Omi Health B.V.](https://www.omi.health)**