name,length_controlled_winrate,win_rate,avg_length,link,samples,filter gemma-2-9b-it-SimPO,72.3508446939842,65.86422561532919,1833,https://huggingface.co/princeton-nlp/gemma-2-9b-it-SimPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gemma-2-9b-it-SimPO/model_outputs.json,community OpenPipe MoA GPT-4 Turbo,68.37866250336802,63.15493451236265,1856,https://openpipe.ai/blog/mixture-of-agents,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openpipe-moa-gpt-4-turbo-v1/model_outputs.json,community gemma-2-9b-it-DPO,67.6620382198043,65.35922380122982,2016,https://huggingface.co/princeton-nlp/gemma-2-9b-it-DPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gemma-2-9b-it-DPO/model_outputs.json,community Together MoA,65.37996976852163,59.8688062333292,1825,https://github.com/togethercomputer/moa,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Together-MoA/model_outputs.json,community Storm-7B (best-of-64),61.63789557199839,63.04099075186919,2340,https://huggingface.co/jieliu/Storm-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Storm-7B-best-of-64/model_outputs.json,community Together MoA-Lite,59.1415240989275,56.593045622273294,1968,https://github.com/togethercomputer/moa,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Together-MoA-Lite/model_outputs.json,community Aligner 2B+GPT-4 Turbo (04/09),58.33130206276722,46.77089325668323,1370,https://github.com/AlignInc/aligner-replication,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/aligner-2b_gpt-4-turbo-2024-04-09/model_outputs.json,community GPT-4 Omni (05/13),57.45682883335095,51.32757578249279,1873,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-4o-2024-05-13/model_outputs.json,minimal Higgs-Llama-3-70B V2,56.76317433000503,68.63519246435168,2657,https://boson.ai/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/higgs-llama-3-70b-v2/model_outputs.json,community GPT-4 Turbo (04/09),55.01530093647852,46.11526538763708,1802,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-4-turbo-2024-04-09/model_outputs.json,minimal SPPO-Gemma-2-9B-It-PairRM,53.96983730150777,48.23404468746583,1803,https://huggingface.co/UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/SPPO-Gemma-2-9B-It-PairRM/model_outputs.json,community Llama-3-Instruct-8B-WPO-HB-v2,53.37264268894168,57.33198613024009,2472,https://huggingface.co/wzhouad/Llama3-Instruct-8B-WPO-HB-v2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Llama-3-Instruct-8B-WPO-HB-v2/model_outputs.json,community Claude 3.5 Sonnet (06/20),52.36675427146999,40.56021409682828,1488,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-3-5-sonnet-20240620/model_outputs.json,community Yi-Large Preview,51.894415134099546,57.46724251946292,2335,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/yi-large-preview/model_outputs.json,verified Storm-7B,50.45110959343775,50.26886905528583,2045,https://huggingface.co/jieliu/Storm-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Storm-7B/model_outputs.json,community GPT-4 Preview (11/06),50.0,50.0,2049,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_1106_preview/model_outputs.json,minimal ExPO + Llama-3-Instruct-8B-SimPO,45.78021783946177,40.63285400856655,1765,https://huggingface.co/chujiezheng/Llama-3-Instruct-8B-SimPO-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Llama-3-Instruct-8B-SimPO-ExPO/model_outputs.json,community Llama-3-Instruct-8B-SimPO,44.65131348921881,40.52977498461182,1825,https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Llama-3-Instruct-8B-SimPO/model_outputs.json,community Nanbeige Plus Chat v0.1,44.45966240337981,56.70300973017392,2587,https://huggingface.co/spaces/Nanbeige/Nanbeige-Plus-Chat-v0.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Nanbeige-Plus-Chat-v0.1/model_outputs.json,community Qwen1.5 110B Chat,43.90555221078692,33.77709527565118,1631,https://huggingface.co/Qwen/Qwen1.5-110B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Qwen1.5-110B-Chat/model_outputs.json,community Aligner 2B+Claude 3 Opus,41.823071715247664,34.46337362321739,1669,https://github.com/AlignInc/aligner-replication,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/aligner-2b_claude-3-opus-20240229/model_outputs.json,community Nanbeige2 16B Chat,40.591286349562864,37.03608605005168,1867,https://huggingface.co/Nanbeige/Nanbeige2-16B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Nanbeige2-16B-Chat/model_outputs.json,community Claude 3 Opus (02/29),40.5095080124761,29.10526953334248,1388,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-3-opus-20240229/model_outputs.json,minimal Llama 3.1 405B Instruct,39.25732749961743,39.10666895419877,1988,https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Meta-Llama-3.1-405B-Instruct-Turbo/model_outputs.json,minimal SPPO-Llama-3-Instruct-8B-PairRM,38.56280663670214,39.67286090605648,2066,https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/SPPO-Llama-3-Instruct-8B-PairRM/model_outputs.json,community GPT-4,38.12808974440021,23.576789314782605,1365,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4/model_outputs.json,verified Llama 3.1 70B Instruct,38.05512453607286,39.12691443804968,2044,https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Meta-Llama-3.1-70B-Instruct-Turbo/model_outputs.json,minimal Infinity-Instruct-3M-0625-Llama3-70B,37.97881098506053,24.277231851026183,1294,https://huggingface.co/BAAI/Infinity-Instruct-3M-0625-Llama3-70B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0625-Llama3-70B/model_outputs.json,community Aligner 2B+Qwen1.5 72B Chat,36.725868878524274,31.773037737123104,1812,https://github.com/AlignInc/aligner-replication,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/aligner-2b_qwen1.5-72b-chat/model_outputs.json,community Qwen1.5 72B Chat,36.571754111987296,26.49828339562733,1549,https://huggingface.co/Qwen/Qwen1.5-72B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Qwen1.5-72B-Chat/model_outputs.json,verified GPT-4 (03/14),35.30706121640206,22.073258928708075,1371,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_0314/model_outputs.json,verified Ein 70B v0.1,35.029054008520646,24.84472049689441,1467,https://huggingface.co/SF-Foundation/EinBase-70B-v0.1-full,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Ein-70B-v0.1/model_outputs.json,community Claude 3 Sonnet (02/29),34.87247436243302,25.556325292273296,1420,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-3-sonnet-20240229/model_outputs.json,minimal FsfairX-Zephyr-Chat-v0.1,34.78744762311656,35.94648644102434,2275,https://huggingface.co/sfairXC/FsfairX-Zephyr-Chat-v0.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/FsfairX-Zephyr-Chat-v0.1/model_outputs.json,community Llama 3 70B Instruct,34.42459717459881,33.17785695886864,1919,https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Meta-Llama-3-70B-Instruct/model_outputs.json,minimal Mistral Large (24/02),32.65207998531868,21.43877598137888,1362,https://mistral.ai/news/la-plateforme/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/mistral-large-2402/model_outputs.json,verified ExPO + SPPO-Mistral7B-PairRM,31.822321960655582,35.4431306716895,2288,https://huggingface.co/chujiezheng/Mistral7B-PairRM-SPPO-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/SPPO-Mistral7B-PairRM-ExPO/model_outputs.json,community merlinite-7B-AOT,31.721885287042845,29.89635084070223,1855,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/merlinite-7B-AOT/model_outputs.json,community Infinity-Instruct-3M-0613-Llama3-70B,31.525606214845013,19.265008711394984,1192,https://huggingface.co/BAAI/Infinity-Instruct-3M-0613-Llama3-70B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0613-Llama3-70B/model_outputs.json,community Samba CoE v0.2 (best-of-16),31.506544268148147,26.988254318335404,1578,https://coe-1.cloud.snova.ai/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Samba-CoE-v0.2-best-of-16/model_outputs.json,community Infinity-Instruct-3M-0625-Mistral-7B,31.42101004652769,21.087714332440324,1305,https://huggingface.co/BAAI/Infinity-Instruct-3M-0625-Mistral-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0625-Mistral-7B/model_outputs.json,community REBEL-Llama-3-8B-Instruct,31.40409226280724,34.30642383142354,2372,https://huggingface.co/Cornell-AGI/REBEL-Llama-3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/REBEL-Llama-3-8B-Instruct/model_outputs.json,community Mixtral 8x22B v0.1,30.878810294279383,22.21017054750302,1445,https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Mixtral-8x22B-Instruct-v0.1/model_outputs.json,verified SPPO-Mistral7B-PairRM,30.494137965217423,32.2453123637764,2114,https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/SPPO-Mistral7B-PairRM/model_outputs.json,community GPT-4 (06/13),30.18332231673423,15.75503808763975,1140,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_0613/model_outputs.json,verified Snorkel (Mistral-PairRM-DPO+best-of-16),29.974321613074405,34.8601328912795,2616,https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Snorkel-Mistral-PairRM-DPO-best-of-16/model_outputs.json,community Contextual AI (KTO-Mistral-PairRM),29.705808939683976,33.227355200024846,2521,https://huggingface.co/ContextualAI/Contextual_KTO_Mistral_PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Contextual-KTO-Mistral-PairRM/model_outputs.json,verified PairRM 0.4B+Yi-34B-Chat (best-of-16),28.81484086684313,31.24128294680746,2195,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-Yi-34B-Chat/model_outputs.json,community Mistral Medium,28.614337401726104,21.855772543652176,1500,https://mistral.ai/news/la-plateforme/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/mistral-medium/model_outputs.json,verified Claude 2,28.155196141629148,17.188240356708075,1069,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-2/model_outputs.json,verified Samba CoE v0.2,27.62426735006872,21.847378669267083,1469,https://coe-1.cloud.snova.ai/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Samba-CoE-v0.2/model_outputs.json,community Infinity-Instruct-3M-0625-Llama3-8B,27.518835489680203,19.364378673728307,1336,https://huggingface.co/BAAI/Infinity-Instruct-3M-0625-Llama3-8B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0625-Llama3-8B/model_outputs.json,community Claude,27.289504443727107,16.98534361236025,1082,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude/model_outputs.json,verified ExPO + InternLM2 Chat 20B,27.225759480731792,46.185367468861,3335,https://huggingface.co/chujiezheng/internlm2-chat-20b-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/internlm2-chat-20b-ExPO/model_outputs.json,community Yi 34B Chat,27.19054787762733,29.65994671879504,2123,https://huggingface.co/01-ai/Yi-34B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Yi-34B-Chat/model_outputs.json,verified ExPO + Starling LM 7B beta,26.411156713811028,29.600851847906423,2215,https://huggingface.co/chujiezheng/Starling-LM-7B-beta-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Starling-LM-7B-beta-ExPO/model_outputs.json,community Snorkel (Mistral-PairRM-DPO),26.39144645733206,30.220052700671644,2736,https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Snorkel-Mistral-PairRM-DPO/model_outputs.json,community ExPO + Tulu-2-DPO-70B,25.72330817134933,22.98061970610497,1738,https://huggingface.co/chujiezheng/tulu-2-dpo-70b-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-70b-ExPO/model_outputs.json,community Claude Instant 1.2,25.61225902543337,16.12739962159006,1112,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-instant-1.2/model_outputs.json,community Infinity-Instruct-3M-0613-Mistral-7B,25.501557794727287,15.747828130770788,1180,https://huggingface.co/BAAI/Infinity-Instruct-3M-0613-Mistral-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0613-Mistral-7B/model_outputs.json,community DBRX Instruct,25.37544974044448,18.44834898407453,1450,https://huggingface.co/databricks/dbrx-instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/dbrx-instruct/model_outputs.json,verified Claude 2.1,25.251943886133027,15.733506736409938,1096,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-2.1/model_outputs.json,verified Nanbeige2 8B Chat,25.24207090175315,39.35450207219922,2709,https://huggingface.co/Nanbeige/Nanbeige2-8B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Nanbeige2-8B-Chat/model_outputs.json,community XwinLM 70b V0.1,24.649686057119272,21.812957073875776,1775,https://github.com/Xwin-LM/Xwin-LM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/xwinlm-70b-v0.1/model_outputs.json,community Gemini Pro,24.38177610802152,18.177644540571432,1456,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gemini-pro/model_outputs.json,minimal Qwen1.5 14B Chat,23.89664677030536,18.645814361932988,1607,https://huggingface.co/Qwen/Qwen1.5-14B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Qwen1.5-14B-Chat/model_outputs.json,verified Mixtral 8x7B v0.1,23.68848260134481,18.25531762637268,1465,https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Mixtral-8x7B-Instruct-v0.1/model_outputs.json,minimal Evo v2 7B,23.35770570204821,20.834113022583853,1754,https://evolusion.ai,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/evo-v2-7b/model_outputs.json,community Ghost 8B Beta (d0x5),23.117114799655457,29.140548966689888,2430,https://ghost-x.org/docs/models/ghost-8b-beta,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/ghost-8b-beta-disl-0x5/model_outputs.json,community Llama 3 8B Instruct,22.91878467313347,22.56990260931677,1899,https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Meta-Llama-3-8B-Instruct/model_outputs.json,minimal Samba CoE v0.1,22.865837334795227,16.835501870062114,1316,https://coe-1.cloud.snova.ai/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Samba-CoE-v0.1/model_outputs.json,community GPT 3.5 Turbo (06/13),22.720189163383225,14.13239070746584,1328,,,verified ExPO + InternLM2 Chat 7B,22.66748024879648,28.067817437082898,2390,https://huggingface.co/chujiezheng/internlm2-chat-7b-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/internlm2-chat-7b-ExPO/model_outputs.json,community GPT 3.5 Turbo (06/13),22.35251298054288,14.09579857390062,1331,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-3.5-turbo-0613/model_outputs.json,community Infinity-Instruct-3M-0625-Qwen2-7B,21.87399673499932,15.322182555525842,1315,https://huggingface.co/BAAI/Infinity-Instruct-3M-0625-Qwen2-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0625-Qwen2-7B/model_outputs.json,community PairRM 0.4B+Tulu 2+DPO 70B (best-of-16),21.428403975507223,18.638962967441,1607,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-tulu-2-70b/model_outputs.json,community Tulu 2+DPO 70B,21.238610038371124,15.982854374136648,1418,https://huggingface.co/allenai/tulu-2-dpo-70b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-70b/model_outputs.json,verified Llama 3.1 8B Instruct,20.85398744758185,21.841523410839937,2181,https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Meta-Llama-3.1-8B-Instruct-Turbo/model_outputs.json,minimal Mistral-7B-ReMax-v0.1,20.55136770233589,15.999331369031056,1478,https://huggingface.co/ziniuli/Mistral-7B-ReMax-v0.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Mistral-7B-ReMax-v0.1/model_outputs.json,community Infinity-Instruct-3M-0625-Yi-1.5-9B,20.538372631222003,16.203844277153284,1449,https://huggingface.co/BAAI/Infinity-Instruct-3M-0625-Yi-1.5-9B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Infinity-Instruct-3M-0625-Yi-1.5-9B/model_outputs.json,community ExPO + Starling LM 7B alpha,19.4741654606294,18.17975592036216,1821,https://huggingface.co/chujiezheng/Starling-LM-7B-alpha-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Starling-LM-7B-alpha-ExPO/model_outputs.json,community GPT 3.5 Turbo (11/06),19.30058903498905,9.177964561962735,796,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-3.5-turbo-1106/model_outputs.json,verified LMCocktail-10.7B-v1,18.950710386651053,13.153430917391304,1203,https://huggingface.co/Yhyu13/LMCocktail-10.7B-v1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/LMCocktail-10.7B-v1/model_outputs.json,community InternLM2 Chat 20B,18.748739485433603,21.74915450048448,2373,https://huggingface.co/internlm/internlm2-chat-20b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/internlm2-chat-20b-ppo/model_outputs.json,community GPT 3.5 Turbo (03/01),18.09324155198033,9.622453295105588,827,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-3.5-turbo-0301/model_outputs.json,verified XwinLM 13b V0.1,17.918937898189796,17.42793475019876,1894,https://github.com/Xwin-LM/Xwin-LM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/xwinlm-13b-v0.1/model_outputs.json,community DeepSeek LLM 67B Chat,17.843384089909343,12.093422264919258,1151,https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/deepseek-llm-67b-chat/model_outputs.json,community GPT-3.5,17.72780108286588,8.462446504415423,1018,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt35_turbo_instruct/model_outputs.json,community ExPO + Tulu-2-DPO-13B,17.591404469940848,15.551405429399557,1649,https://huggingface.co/chujiezheng/tulu-2-dpo-13b-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-13b-ExPO/model_outputs.json,community WizardLM 70B,17.575060737493747,14.383896086782608,1545,https://huggingface.co/WizardLM/WizardLM-70B-V1.0,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-70b/model_outputs.json,community Vicuna 33B v1.3,17.574575310874923,12.705947921540371,1479,https://huggingface.co/lmsys/vicuna-33b-v1.3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-33b-v1.3/model_outputs.json,verified PairRM 0.4B+Tulu 2+DPO 13B (best-of-16),17.40520369795085,13.831901016757762,1454,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-tulu-2-13b/model_outputs.json,community Conifer-7B-DPO,17.11249588276248,11.31358564916222,1253,https://github.com/ConiferLM/Conifer,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Conifer-7B-DPO/model_outputs.json,community Mistral 7B v0.2,17.111251846021165,14.722772657714286,1676,https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Mistral-7B-Instruct-v0.2/model_outputs.json,minimal Evo 7B,16.489386004239325,15.577437399527952,1774,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/evo-7b/model_outputs.json,community Humpback LLaMa2 70B,16.249164231428974,10.121771502645965,1107,https://arxiv.org/abs/2308.06259,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/humpback-llama2-70b/model_outputs.json,community OpenHermes-2.5-Mistral (7B),16.248577696674843,10.340415705751552,1107,https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/OpenHermes-2.5-Mistral-7B/model_outputs.json,verified DEITA 7B v1.0,16.05901353966741,12.646639472385097,1417,https://github.com/hkust-nlp/deita,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/deita-7b-v1.0/model_outputs.json,community JinaChat,15.866004049505932,7.786130393366459,676,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/jina-chat/model_outputs.json,community TempNet-LLaMA2-Chat-70B-v0.1,15.831162778430024,15.051894420220444,1830,https://github.com/zhqiu/TempNet,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/TempNet-LLaMA2-Chat-70B-v0.1/model_outputs.json,community CausalLM-14B,15.72032518895564,11.146160869950313,1391,https://huggingface.co/CausalLM/14B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/causallm-14b/model_outputs.json,community PairRM 0.4B+Zephyr 7B Beta (best-of-16),15.529867294986612,12.84127825562733,1487,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-zephyr-7b-beta/model_outputs.json,community Qwen1.5 7B Chat,14.748431044267305,11.770927069605952,1594,https://huggingface.co/Qwen/Qwen1.5-7B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Qwen1.5-7B-Chat/model_outputs.json,verified Mistral-ORPO-Beta,14.716749430705242,12.565408794559003,1636,https://huggingface.co/kaist-ai/mistral-orpo-beta,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/mistral-orpo-beta/model_outputs.json,community Starling LM 7B alpha,14.690471079424972,14.24592352162733,1895,https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Starling-LM-7B-alpha/model_outputs.json,community LLaMA2 Chat 70B,14.689648588392544,13.88825834374378,1790,https://ai.meta.com/llama/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/llama-2-70b-chat-hf/model_outputs.json,verified OpenChat V3.1 13B,14.50338795683784,11.082230489416148,1484,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat-v3.1-13b/model_outputs.json,community WizardLM 13B V1.2,14.462590694316631,12.027480342770186,1635,https://huggingface.co/WizardLM/WizardLM-13B-V1.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b-v1.2/model_outputs.json,community UltraLM 13B V2.0 (best-of-16),14.198987566645036,13.853373471242236,1720,https://huggingface.co/openbmb/UltraRM-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/ultralm-13b-v2.0-best-of-16/model_outputs.json,community ExPO + Zephyr 7B Beta,14.001211980232686,11.06111683239833,1405,https://huggingface.co/HuggingFaceH4/zephyr-7b-beta,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/zephyr-7b-beta-ExPO/model_outputs.json,community WizardLM 13B V1.1,13.91572059284851,11.233909572857142,1525,https://huggingface.co/WizardLM/WizardLM-13B-V1.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b-v1.1/model_outputs.json,community ExPO + Zephyr 7B Alpha,13.573089356781388,10.55935434569986,1248,https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/zephyr-7b-alpha-ExPO/model_outputs.json,community Zephyr 7B Beta,13.203198493136666,10.992885755354038,1444,https://huggingface.co/HuggingFaceH4/zephyr-7b-beta,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/zephyr-7b-beta/model_outputs.json,community Dolphin 2.2.1 Mistral 7B,13.121477650433736,9.039799728223604,1130,https://huggingface.co/cognitivecomputations/dolphin-2.2.1-mistral-7b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/dolphin-2.2.1-mistral-7b/model_outputs.json,community Humpback LLaMa 65B,12.799859995893623,9.425139047801242,1232,https://arxiv.org/abs/2308.06259,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/humpback-llama-65b/model_outputs.json,community OpenBudddy-LLaMA2-70B-v10.1,12.572173272324846,8.096422096285714,1077,https://huggingface.co/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-llama2-70b-v10.1/model_outputs.json,community OpenBuddy-LLaMA-65B-v8,12.469356289070015,8.77065015089441,1162,https://huggingface.co/OpenBuddy/openbuddy-llama-65b-v8-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-llama-65b-v8/model_outputs.json,community Qwen 14B Chat,12.378741790737235,7.502333484720497,1013,https://huggingface.co/Qwen/Qwen-14B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Qwen-14B-Chat/model_outputs.json,community GPT-4 (Adversarial),12.188764057640531,3.7383373713788814,68,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_gamed/model_outputs.json,community CUT 13B,12.154781753927743,10.779089202496897,1637,https://github.com/wwxu21/CUT,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/cut-13b/model_outputs.json,community OpenChat V2-W 13B,12.03042777097436,9.615344158447204,1566,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat-v2-w-13b/model_outputs.json,community Vicuna 13B v1.5 (together),11.685356965300867,6.958275369254574,1071,https://huggingface.co/lmsys/vicuna-13b-v1.5,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-13b-v1.5-togetherai/model_outputs.json,community ExPO + Tulu-2-DPO-7B,11.675059099417426,11.529221038762383,1742,https://huggingface.co/chujiezheng/tulu-2-dpo-7b-ExPO,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-7b-ExPO/model_outputs.json,community Tulu 2+DPO 13B,11.554479428088396,10.119788388347828,1614,https://huggingface.co/allenai/tulu-2-dpo-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-13b/model_outputs.json,community Claude2 Alpaca 13B,11.498898213160734,7.437351324770187,1127,https://github.com/Lichang-Chen/claude2-alpaca,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude2-alpaca-13b/model_outputs.json,community Minotaur 13B,11.46525131683203,5.738963669079602,881,https://huggingface.co/openaccess-ai-collective/minotaur-13b-fixed,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/minotaur-13b/model_outputs.json,community airoboros 65B,11.007642406363166,9.38895014967702,1512,https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/airoboros-65b/model_outputs.json,community Cohere Command,10.893020886573929,12.901455209677016,1983,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/cohere/model_outputs.json,verified Vicuna 13B v1.3,10.843164943694475,7.137240386509318,1132,https://huggingface.co/lmsys/vicuna-13b-v1.3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-13b-v1.3/model_outputs.json,verified XwinLM 7b V0.1,10.812205627329451,11.245651737801245,1894,https://github.com/Xwin-LM/Xwin-LM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/xwinlm-7b-v0.1/model_outputs.json,community airoboros 33B,10.719002678100868,9.053160396124223,1514,https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/airoboros-33b/model_outputs.json,community PlatoLM 7B,10.543402072797148,6.320828058468243,1344,https://huggingface.co/FreedomIntelligence/PlatoLM-7B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/platolm-7b/model_outputs.json,community Vicuna 13B v1.5,10.484438298504218,6.722122014857143,1061,https://huggingface.co/lmsys/vicuna-13b-v1.5,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-13b-v1.5/model_outputs.json,community Gemma Instruct (7B),10.425760403690134,6.937294379677018,1115,https://huggingface.co/google/gemma-7b-it,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gemma-7b-it/model_outputs.json,verified OpenChat V2 13B,10.399607338483346,8.435075644708077,1564,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat-v2-13b/model_outputs.json,community Zephyr 7B Alpha,10.289760888704258,8.352663968198758,1302,https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/zephyr-7b-alpha/model_outputs.json,community OpenBuddy-LLaMA-30B-v7.1,10.214494991204496,6.130014613975155,968,https://huggingface.co/OpenBuddy/openbuddy-llama-30b-v7.1-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-llama-30b-v7.1/model_outputs.json,community UltraLM 13B (best-of-16),9.87608881694948,11.307314947751552,1980,https://huggingface.co/openbmb/UltraRM-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/ultralm-13b-best-of-16/model_outputs.json,community LLaMA 33B OASST SFT,9.866412143759783,4.770390991565217,748,https://huggingface.co/OpenAssistant/oasst-sft-7-llama-30b-xor,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/oasst-sft-llama-33b/model_outputs.json,verified WizardLM 13B,9.82815076877079,5.878152589354039,985,https://huggingface.co/WizardLM/WizardLM-13B-1.0,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b/model_outputs.json,verified Nous Hermes 13B,9.717863417764642,5.411878933180125,844,https://huggingface.co/NousResearch/Nous-Hermes-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/nous-hermes-13b/model_outputs.json,verified Vicuna 13B,9.222060023704104,5.831103184496894,1037,https://huggingface.co/lmsys/vicuna-13b-delta-v1.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-13b/model_outputs.json,verified Tulu 2+DPO 7B,9.200265611470332,8.19751538447205,1663,https://huggingface.co/allenai/tulu-2-dpo-7b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-7b/model_outputs.json,community OpenBudddy-LLaMA2-13B-v11.1,9.159089775016035,6.174716489490684,1057,https://huggingface.co/OpenBuddy/openbuddy-llama2-13b-v11.1-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-llama2-13b-v11.1/model_outputs.json,community UltraLM 13B V2.0,9.129018444208118,7.504622955739131,1399,https://github.com/thunlp/UltraChat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/ultralm-13b-v2.0/model_outputs.json,community Davinci001,9.025728852143091,2.764005231108344,296,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/text_davinci_001/model_outputs.json,verified OpenBuddy-Falcon-40B-v9,8.988936477935635,5.955742846322981,1089,https://huggingface.co/OpenBuddy/openbuddy-falcon-40b-v9-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-falcon-40b-v9/model_outputs.json,community OpenChat-13B,8.806053491170802,8.022386010881988,1632,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat-13b/model_outputs.json,community TempNet-LLaMA2-Chat-13B-v0.1,8.57835531105755,7.728405066035775,1540,https://github.com/zhqiu/TempNet,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/TempNet-LLaMA2-Chat-13B-v0.1/model_outputs.json,community LLaMA2 Chat 13B,8.436014548885215,7.702309957875775,1513,https://ai.meta.com/llama/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/llama-2-13b-chat-hf/model_outputs.json,verified Guanaco 65B,8.252916991586922,6.858494513378882,1249,https://huggingface.co/timdettmers/guanaco-65b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-65b/model_outputs.json,verified OpenCoderPlus-15B,8.152410155715494,7.40622245099379,1628,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/opencoderplus-15b/model_outputs.json,community LLaMA 33B OASST RLHF,7.970921837335629,6.296434785813666,1079,https://huggingface.co/OpenAssistant/oasst-rlhf-2-llama-30b-7k-steps-xor,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/oasst-rlhf-llama-33b/model_outputs.json,verified OpenChat8192-13B,7.897061734563998,7.472766807962733,1664,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat8192-13b/model_outputs.json,community Phi-2 DPO,7.770894620325308,7.757095701776398,1687,https://huggingface.co/lxuechen/phi-2-dpo,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/phi-2-dpo/model_outputs.json,verified MiniChat 1.5 3B,7.701632821534051,6.553443052819875,1545,https://huggingface.co/GeneZC/MiniChat-1.5-3B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/minichat-1.5-3b/model_outputs.json,community Vicuna 7B v1.5,7.616892731870527,4.797493939167703,1083,https://huggingface.co/lmsys/vicuna-7b-v1.5,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-7b-v1.5/model_outputs.json,community LLaMA2 Chat 7B Evol70k-NEFT,7.533052655504213,7.602383512198759,1612,https://github.com/neelsjain/NEFTune,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/llama-2-chat-7b-evol70k-neft/model_outputs.json,community Recycled WizardLM 7B V2.0,7.521609955340597,7.337129370484472,1583,https://github.com/tianyi-lab/Reflection_Tuning,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/recycled-wizardlm-7b-v2.0/model_outputs.json,community Vicuna 7B v1.3,7.156460956443475,4.6425118574534165,1110,https://huggingface.co/lmsys/vicuna-7b-v1.3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-7b-v1.3/model_outputs.json,verified Alpaca Farm PPO Sim (GPT-4) 7B,7.121808101560879,3.450341987080745,511,https://huggingface.co/tatsu-lab/alpaca-farm-ppo-sim-gpt4-20k-wdiff,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/alpaca-farm-ppo-sim-gpt4-20k/model_outputs.json,verified UltraLM 13B,7.108191361311167,5.074590380484472,1087,https://github.com/thunlp/UltraChat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/ultralm-13b/model_outputs.json,community Baize-v2 13B,7.012247205044542,4.590545330645964,930,https://huggingface.co/project-baize/baize-v2-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/baize-v2-13b/model_outputs.json,community Recycled WizardLM 7B V1.0,6.9014773220018215,6.632749960459629,1494,https://github.com/tianyi-lab/Reflection_Tuning,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/recycled-wizardlm-7b-v1.0/model_outputs.json,community Ghost 7B Alpha,6.851138067048422,6.111136224334925,1681,https://huggingface.co/ghost-x/ghost-7b-alpha,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/ghost-7b-alpha/model_outputs.json,community Alpaca Farm PPO Human 7B,6.418603294911531,4.100426814981367,803,https://huggingface.co/tatsu-lab/alpaca-farm-ppo-human-wdiff,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/alpaca-farm-ppo-human/model_outputs.json,verified Vicuna 7B,6.277217738516609,4.16261116226087,1044,https://huggingface.co/lmsys/vicuna-7b-delta-v1.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-7b/model_outputs.json,verified Alpaca 7B,5.875487163278986,2.591450540223603,396,https://huggingface.co/tatsu-lab/alpaca-7b-wdiff,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/alpaca-7b/model_outputs.json,minimal Phi-2 SFT,5.853787690603355,3.977567775217392,1068,https://huggingface.co/lxuechen/phi-2-sft,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/phi-2-sft/model_outputs.json,verified TempNet-LLaMA2-Chat-7B-v0.1,5.739613836715224,5.430143264670806,1512,https://github.com/zhqiu/TempNet,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/TempNet-LLaMA2-Chat-7B-v0.1/model_outputs.json,community MiniChat 3B,5.729332875896306,3.0071507063602487,868,https://huggingface.co/GeneZC/MiniChat-3B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/minichat-3b/model_outputs.json,community Guanaco 33B,5.690019090866207,5.002493724956522,1311,https://huggingface.co/timdettmers/guanaco-33b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-33b/model_outputs.json,verified Falcon 40B Instruct,5.6075325447394455,3.3429188224720505,662,https://huggingface.co/tiiuae/falcon-40b-instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/falcon-40b-instruct/model_outputs.json,verified Gemma Instruct (2B),5.437453620377121,3.4019714381366457,1041,https://huggingface.co/google/gemma-2b-it,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gemma-2b-it/model_outputs.json,verified LLaMA2 Chat 7B,5.354821279508294,4.961339547167702,1479,https://ai.meta.com/llama/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/llama-2-7b-chat-hf/model_outputs.json,verified OpenBuddy-Falcon-7b-v6,4.8261244822302976,3.521174371975156,1152,https://huggingface.co/OpenBuddy/openbuddy-falcon-7b-v6-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-falcon-7b-v6/model_outputs.json,community Phi 2,4.398682270855682,2.350209543026152,626,https://huggingface.co/microsoft/phi-2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/phi-2/model_outputs.json,community Baize-v2 7B,4.382564905021367,3.404814977515528,1127,https://huggingface.co/project-baize/baize-v2-7b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/baize-v2-7b/model_outputs.json,community ChatGLM2-6B,4.35928292679035,2.7621847964596284,1027,https://huggingface.co/THUDM/chatglm2-6b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/chatglm2-6b/model_outputs.json,community Pythia 12B SFT,4.221361861408184,2.5780902809689445,913,https://huggingface.co/OpenAssistant/pythia-12b-sft-v8-7k-steps,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pythia-12b-mix-sft/model_outputs.json,verified Falcon 7B Instruct,4.036937566812824,2.146617553167702,478,https://huggingface.co/tiiuae/falcon-7b-instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/falcon-7b-instruct/model_outputs.json,verified Pythia 12B OASST SFT,3.270102114456748,1.790114083180124,726,https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/oasst-sft-pythia-12b/model_outputs.json,verified Guanaco 13B,3.003787329611614,3.469596859739131,1774,https://huggingface.co/timdettmers/guanaco-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-13b/model_outputs.json,verified Guanaco 7B,2.871116813131697,2.880002266173913,1364,https://huggingface.co/timdettmers/guanaco-7b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-7b/model_outputs.json,verified Qwen1.5 1.8B Chat,2.588498849185137,3.70555681579365,2673,https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Qwen1.5-1.8B-Chat/model_outputs.json,verified Baichuan-13B-Chat,2.062170253598568,1.9921455615279504,1727,https://huggingface.co/baichuan-inc/Baichuan-13B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/baichuan-13b-chat/model_outputs.json,community