| Model | Environment Setup | Implementation | Acceptance Testing | Unit Testing | ||
|---|---|---|---|---|---|---|
| Pass@ Example Usage§ | Pass@ Accept. Test¶ | Pass@ Unit Test¶ | Oracle Test§ | Oracle Test§ | Coverage$ | |
| GPT-3.5-Turbo | 33.3 | 4.2 | 4.3 | 11.7 | 28.7 | 24.6(61.4) |
| GPT-4-Turbo-1106 | 41.7 | 6.9 | 6.8 | 25.9 | 33.6 | 36.7(66.7) |
| GPT-4-Turbo-0125 | 41.7 | 7.1 | 8.0 | 29.2 | 36.5 | 33.2(66.3) |
| CodeLlama-7B-Instruct | 8.3 | 0.0 | 0.0 | 0.0 | 3.0 | 3.6(71.0) |
| CodeLlama-13B-Instruct | 25.0 | 0.6 | 0.0 | 0.0 | 5.1 | 8.6(57.6) |
| CodeLlama-34B-Instruct | 16.7 | 0.6 | 0.5 | 4.5 | 21.1 | 25.4(72.6) |
| DeepSeek-Coder-1.3B-Instruct | 8.3 | 0.0 | 0.1 | 0.0 | 5.6 | 2.7(27.0) |
| DeepSeek-Coder-6.7B-Instruct | 25.0 | 2.9 | 3.9 | 20.5♡ | 23.5 | 28.2(70.6) |
| DeepSeek-Coder-33B-Instruct | 16.7 | 4.4 | 5.5 | 13.6 | 32.8 | 35.7(79.4) |
| Model | w/ Tie | w/o Tie | ||
|---|---|---|---|---|
| General Principles† | Faithfulness‡ | General Principles | Faithfulness | |
| GPT-4-Turbo-0125 | 97.9 | 97.9 | 100.0 | 100.0 |
| GPT-4-Turbo-1106 | 91.7 | 85.4 | 100.0 | 100.0 |
| CodeLlama-7B-Instruct | 4.2 | 8.3 | 4.2 | 4.5 |
| CodeLlama-13B-Instruct | 18.8 | 14.6 | 10.5 | 5.3 |
| CodeLlama-34B-Instruct | 39.6 | 33.3 | 33.3 | 21.4 |
| DeepSeek-Coder-1.3B-Instruct | 16.7 | 16.7 | 5.5 | 5.6 |
| DeepSeek-Coder-6.7B-Instruct | 35.4 | 35.4 | 31.6 | 29.4 |
| DeepSeek-Coder-33B-Instruct | 52.1 | 50.0 | 53.8 | 50.0 |
| Agree w/ Human Majority | 60.4 | 51.6 | 79.2 | 83.2 |