# SOAP Note Evaluation Leaderboard Benchmark comparing clinical SOAP note generation models on 300 synthetic doctor-patient dialogues. --- ## Overall Rankings | Rank | Model | Composite | Safety | Evidence | Coverage | Generalist | |:----:|-------|:---------:|:------:|:--------:|:--------:|:----------:| | 1 | **gpt-5.2** | **4.723** | 4.543 | 4.358 | 4.954 | 4.824 | | 2 | gemini-3-pro-preview | 4.699 | 4.541 | 4.357 | 4.864 | 4.848 | | 3 | Omi-SOAP-edge-v1* | 4.654 | 4.547 | 4.421 | 4.862 | 4.595 | | 4 | Kimi-K2-Thinking | 4.546 | 4.217 | 3.890 | 4.906 | 4.828 | | 5 | claude-opus-4-5 | 4.543 | 4.202 | 3.870 | 4.947 | 4.793 | | 6 | GPT-5 | 4.285 | 3.805 | 3.316 | 4.843 | 4.646 | *All scores 0-5 scale. Higher = better.* **Omi scores averaged across 5 evaluations with different judge panels.* --- ## Hallucination Risk Analysis Using **Omi-SOAP-edge-v1 as baseline (1.0x risk)**: | Model | Major Halls/Note | Risk vs Omi | Minor Halls/Note | Risk vs Omi | |-------|:----------------:|:-----------:|:----------------:|:-----------:| | **gpt-5.2** | 0.114 | **0.89x** | 0.358 | 1.85x | | **gemini-3-pro-preview** | 0.127 | **0.99x** | 0.280 | 1.45x | | Omi-SOAP-edge-v1 | 0.128 | 1.00x | 0.193 | 1.00x | | Kimi-K2-Thinking | 0.351 | 2.74x | 0.382 | 1.98x | | claude-opus-4-5 | 0.397 | 3.10x | 0.191 | 0.99x | | GPT-5 | 0.553 | 4.32x | 0.382 | 1.98x | ### Key Findings **Major Hallucinations (clinical fabrications - high risk):** - gpt-5.2 and gemini-3-pro are **safest** (~0.9-1.0x Omi baseline) - Kimi-K2-Thinking has **2.7x more** major errors than Omi - claude-opus-4-5 has **3.1x more** major errors than Omi - GPT-5 has **4.3x more** major errors than Omi **Minor Hallucinations (wording/citation issues - low risk):** - Omi and Claude have fewest minor errors - gpt-5.2 trades minor errors for major accuracy (2x minor, 0.9x major) --- ## Majority Agreement on Major Errors *% of dialogues where 2+ judges flagged a major hallucination:* | Model | Majority Major Rate | Interpretation | |-------|:-------------------:|----------------| | gpt-5.2 | 4.0% | 12 of 300 dialogues | | Omi-SOAP-edge-v1 | 6.7% | 20 of 300 dialogues | | gemini-3-pro-preview | 8.0% | 24 of 300 dialogues | | Kimi-K2-Thinking | 19.3% | 58 of 300 dialogues | | claude-opus-4-5 | 25.3% | 76 of 300 dialogues | | GPT-5 | 36.7% | 110 of 300 dialogues | --- ## Head-to-Head Results All comparisons vs Omi-SOAP-edge-v1: | Opponent | Omi Wins | Opponent Wins | Ties | H2H Winner | |----------|:--------:|:-------------:|:----:|:----------:| | gpt-5.2 | 53 (17.7%) | **96 (32.0%)** | 151 (50.3%) | **gpt-5.2** | | gemini-3-pro-preview | 56 (18.7%) | **90 (30.0%)** | 154 (51.3%) | **gemini-3-pro** | | Kimi-K2-Thinking | **100 (33.3%)** | 79 (26.3%) | 121 (40.3%) | **Omi** | | claude-opus-4-5 | **105 (35.0%)** | 59 (19.7%) | 136 (45.3%) | **Omi** | | GPT-5 | **148 (49.3%)** | 43 (14.3%) | 109 (36.3%) | **Omi** | *Tie threshold: |composite diff| < 0.25* --- ## Model Profiles ### Tier 1: Leaders **gpt-5.2** - Best overall - Highest composite (4.723) and coverage (4.954) - Lowest major hallucination rate (0.89x Omi) - Beats Omi in 32% of dialogues, loses only 17.7% **gemini-3-pro-preview** - Best readability - Highest generalist score (4.848) - judges find it most readable - Nearly identical safety to gpt-5.2 - Beats Omi in 30% of dialogues ### Tier 2: Baseline **Omi-SOAP-edge-v1** - Conservative & safe - Lowest minor hallucination rate - Balanced safety-first approach - Reference baseline for comparisons ### Tier 3: Tradeoffs **Kimi-K2-Thinking** - High quality, moderate risk - Excellent generalist score (4.828) and coverage (4.906) - But 2.7x more major hallucinations than Omi - Closest competitor to Omi in head-to-head (33% vs 26%) **claude-opus-4-5** - Completeness vs accuracy - Highest coverage (4.947) - most thorough notes - But 3.1x more major hallucinations than Omi - Safety penalty outweighs completeness gains **GPT-5** - Superseded - 4.3x more major hallucinations than Omi - Replaced by gpt-5.2 in rankings --- ## Scoring Methodology ### Composite Formula ``` Composite = 0.5 × Safety + 0.3 × Coverage + 0.2 × Generalist Safety = 0.7 × Evidence + 0.3 × Numeric ``` ### Dimension Definitions | Dimension | Formula | What It Measures | |-----------|---------|------------------| | Evidence | 5 - 1×minor - 3×major | Accuracy of claims | | Numeric | 5 - 0.5×mismatch - 0.25×omission | Number/dosage accuracy | | Coverage | (flags_present / 5) × 5 | Vitals, meds, assessment, safety, followup | | Generalist | mean(factual, complete, readable) | Overall quality judgment | ### Judge Panels | Model Evaluated | Judges (3 per eval) | |-----------------|---------------------| | gpt-5.2, GPT-5 | gemini_flash, claude_haiku, kimi_instruct | | claude-opus-4-5 | gemini_flash, gpt5_mini, kimi_instruct | | gemini-3-pro-preview | gpt5_mini, claude_haiku, kimi_instruct | | Kimi-K2-Thinking | gemini_flash, gpt5_mini, claude_haiku | *Same-family judges excluded to prevent bias.* --- ## Reproduce Results ```bash # Generate SOAP notes (output dir auto-derived from model env var) python scripts/generate_soap.py --model kimi # Run evaluation python scripts/run_evaluation.py \ --a_root data/outputs/omi_soap_edge_v1 \ --b_root data/outputs/Kimi-K2-Thinking \ --a_label "Omi-SOAP-edge-v1" \ --b_label "Kimi-K2-Thinking" \ --judges gemini_flash,gpt5_mini,claude_haiku ``` --- ## Changelog - **2025-12-14**: Added Kimi-K2-Thinking evaluation. Ranks #4 with 2.7x hallucination risk vs Omi. - **2025-12-13**: Full 300-dialogue evaluations for gpt-5.2, gemini-3-pro-preview, claude-opus-4-5. gpt-5.2 takes #1. - **2025-12-11**: Full 300-dialogue GPT-5 evaluation with 3 judges. - **2025-09-07**: Initial benchmark with Omi-SOAP-edge-v1. --- **Maintained by [Omi Health B.V.](https://www.omi.health)**