| Rank | Model | Score | Visualization | Confidence Interval |
|---|---|---|---|---|
| 1 | o1-preview | 0.179404 | 0.179064 [0.155861, 0.200286] | |
| 2 | gpt-4o | 0.178305 | 0.178198 [0.156553, 0.200585] | |
| 3 | deepseek-chat | 0.167105 | 0.166883 [0.144720, 0.191760] | |
| 4 | gemini-2.0-flash-thinking-exp-1219 | 0.164732 | 0.165014 [0.142054, 0.187329] | |
| 5 | claude-3-5-sonnet-latest | 0.155571 | 0.155903 [0.133843, 0.177003] | |
| 6 | gemini-exp-1206 | 0.154884 | 0.154936 [0.133611, 0.179618] |
| Comparison | Significance |
|---|---|
| o1-preview_vs_gpt-4o | Not significant |
| gpt-4o_vs_deepseek-chat | Not significant |
| deepseek-chat_vs_gemini-2.0-flash-thinking-exp-1219 | Not significant |
| gemini-2.0-flash-thinking-exp-1219_vs_claude-3-5-sonnet-latest | Not significant |
| claude-3-5-sonnet-latest_vs_gemini-exp-1206 | Not significant |
| Rank | Model | Score | Visualization |
|---|---|---|---|
| 1 | o1-preview | 8.8571 | |
| 2 | gemini-exp-1206 | 8.8333 | |
| 3 | deepseek-chat | 8.5000 | |
| 4 | gemini-2.0-flash-thinking-exp-1219 | 8.0455 | |
| 5 | gpt-4o | 7.9231 | |
| 6 | claude-3-5-sonnet-latest | 6.8571 |
| Rank | Model | Score | Visualization |
|---|---|---|---|
| 1 | gpt-4o | 8.3333 | |
| 2 | deepseek-chat | 8.0000 | |
| 3 | gemini-exp-1206 | 8.0000 | |
| 4 | claude-3-5-sonnet-latest | 7.8889 | |
| 5 | gemini-2.0-flash-thinking-exp-1219 | 7.7500 | |
| 6 | o1-preview | 7.5000 |
| Rank | Model | Score | Visualization |
|---|---|---|---|
| 1 | gemini-2.0-flash-thinking-exp-1219 | 7.0000 | |
| 2 | claude-3-5-sonnet-latest | 6.8571 | |
| 3 | gemini-exp-1206 | 6.5714 | |
| 4 | gpt-4o | 6.1667 | |
| 5 | o1-preview | 5.8333 | |
| 6 | deepseek-chat | 4.3333 |
| Rank | Model | Score | Visualization |
|---|---|---|---|
| 1 | gpt-4o | 8.5000 | |
| 2 | deepseek-chat | 7.1667 | |
| 3 | gemini-exp-1206 | 6.7143 | |
| 4 | o1-preview | 6.2000 | |
| 5 | gemini-2.0-flash-thinking-exp-1219 | 6.1429 | |
| 6 | claude-3-5-sonnet-latest | 5.0000 |
| Rank | Model | Score | Visualization |
|---|---|---|---|
| 1 | o1-preview | 8.8000 | |
| 2 | deepseek-chat | 8.7667 | |
| 3 | gemini-exp-1206 | 8.6111 | |
| 4 | gpt-4o | 8.2121 | |
| 5 | gemini-2.0-flash-thinking-exp-1219 | 8.2069 | |
| 6 | claude-3-5-sonnet-latest | 6.9655 |
| Rank | Model | Score | Visualization |
|---|---|---|---|
| 1 | gemini-exp-1206 | 9.2500 | |
| 2 | o1-preview | 8.6667 | |
| 3 | deepseek-chat | 8.5000 | |
| 4 | claude-3-5-sonnet-latest | 8.0000 | |
| 5 | gemini-2.0-flash-thinking-exp-1219 | 7.3333 | |
| 6 | gpt-4o | 7.0000 |