# DevBench: Towards LLMs based Automated Software Development 👋 Overview | 📖 Benchmarking | ⚙️ Setup | 🚀 Usage | 🔎 Citation | 📄 License **📬 Contact**: libowen.ne@gmail.com, chao.peng@acm.org 📝 Check out our paper [HERE](https://arxiv.org/abs/2403.08604) ! ## 👋 Overview - **DevBench** is a comprehensive benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including *software design*, *environment setup*, *implementation*, *acceptance testing*, and *unit testing*. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development. - The DevBench dataset comprises **22 curated repositories** across **4 programming languages** (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.

- **DevBench** includes a comprehensive and automatic evaluation suite for all tasks involved. We provide extensive acceptance and unit test cases for the *implementation* task 🤗. Additionally, we utilize [LLM-as-a-Judge](./llm_judge/README.md) for evaluating the *software design* task 👩🏽‍⚖️. Further details on our task specifications can be found [here](./benchmark_data/README.md).

- We have developed a baseline agent system based on the popular multi-agent software development system, [ChatDev](https://github.com/OpenBMB/ChatDev). Special thanks to our collaborators at ChatDev! ## 📖 Benchmarking Code LLMs ### Evaluation results of the coding tasks on DevBench.

Model	Environment Setup	Implementation		Acceptance Testing	Unit Testing
Model	Pass@ Example Usage§	Pass@ Accept. Test¶	Pass@ Unit Test¶	Oracle Test§	Oracle Test§	Coverage$
GPT-3.5-Turbo	33.3	4.2	4.3	11.7	28.7	24.6(61.4)
GPT-4-Turbo-1106	41.7	6.9	6.8	25.9	33.6	36.7(66.7)
GPT-4-Turbo-0125	41.7	7.1	8.0	29.2	36.5	33.2(66.3)
CodeLlama-7B-Instruct	8.3	0.0	0.0	0.0	3.0	3.6(71.0)
CodeLlama-13B-Instruct	25.0	0.6	0.0	0.0	5.1	8.6(57.6)
CodeLlama-34B-Instruct	16.7	0.6	0.5	4.5	21.1	25.4(72.6)
DeepSeek-Coder-1.3B-Instruct	8.3	0.0	0.1	0.0	5.6	2.7(27.0)
DeepSeek-Coder-6.7B-Instruct	25.0	2.9	3.9	20.5♡	23.5	28.2(70.6)
DeepSeek-Coder-33B-Instruct	16.7	4.4	5.5	13.6	32.8	35.7(79.4)

Italic figures: test cases for the Environment Setup task are scarce compared to other tasks, therefore the results are more influenced by the randomness. §: all results are averaged across all repositories and weighted uniformly. ¶: all results are averaged across all repositories and weighted by the number of code lines. $: the results on the left side are averaged across all repositories and weighted uniformly, showing the overall scores. The results on the right side in the parenthesis are averaged across all valid repositories and weighted uniformly, where models have generated executable testing code. ♡: the model has generated meaningless but executable testing code. ### Evaluation results of the software design on DevBench. The code for the software design evaluation can be found [here](./llm_judge/)👩🏽‍⚖️.

Model	w/ Tie		w/o Tie
Model	General Principles†	Faithfulness‡	General Principles	Faithfulness
GPT-4-Turbo-0125	97.9	97.9	100.0	100.0
GPT-4-Turbo-1106	91.7	85.4	100.0	100.0
CodeLlama-7B-Instruct	4.2	8.3	4.2	4.5
CodeLlama-13B-Instruct	18.8	14.6	10.5	5.3
CodeLlama-34B-Instruct	39.6	33.3	33.3	21.4
DeepSeek-Coder-1.3B-Instruct	16.7	16.7	5.5	5.6
DeepSeek-Coder-6.7B-Instruct	35.4	35.4	31.6	29.4
DeepSeek-Coder-33B-Instruct	52.1	50.0	53.8	50.0
Agree w/ Human Majority	60.4	51.6	79.2	83.2

Win rate of pairwise comparison against GPT-3.5-Turbo on Software Desgin on a subset of DevBench where results are averaged across different repositories and sub-tasks uniformly.†: the general principles metric. ‡: the faithfulness metric. w/ Tie: inconsistent results are considered as a tie. We also report agreement with Human Majority. ## 🐳 Set Up with Docker For a secure and isolated environment, we offer Docker support for DevBench. Please refer to our detailed [Installation Guide](./wiki.md#installation-guide). ## 🚀 Usage ### 1. Prepare the environment variables Add your DevBench directory to your PYTHONPATH variable. ``` export PYTHONPATH="${PYTHONPATH}:${path_to_devbench}" ``` For running the `benchmark_data/java/Actor_relationship_game` repo, configure your TMDB key. ``` export TMDB_API_KEY=${your_TMDB_key} ``` ### 2. Prepare the chat models #### OpenAI GPT models Set your OpenAI API key as an environment variable. ``` export OPENAI_API_KEY="your_OpenAI_API_key" ``` #### Open source models For deploying open source models, please refer to [lmdeploy](https://github.com/InternLM/lmdeploy/tree/main) or [vllm](https://github.com/vllm-project/vllm). After the deployment, please configure the IP address in `open_source_model.json`. For codellama and deepseek-coder models, which are integrated into our experiments, simply fill in the IP address in `{"model_name": $model_ip_address}`. For example： ``` { "codellama-7b-instruct": "", "codellama-13b-instruct": "", "codellama-34b-instruct": "", "deepseek-coder-1.3b-instruct": "", "deepseek-coder-6.7b-instruct": "", "deepseek-coder-33b-instruct": "$model_ip_address" } ``` For additional models, add a new field as shown below. ``` { "customized-model": {"$model_name": "$model_ip_address"} } ``` ### 3. **Run the agent system** #### Run script ``` cd agent_sysyem/baseline python run.py --config Implementation --input_path ../../benchmark_data/python/TextCNN/ --model gpt-4-turbo-new --model_source openai --review execution --evaluate ``` #### Parameters - config (*str*) - Specifies the task in the DevBench: `SoftwareDesign` | `EnvironmentSetup` | `Implementation` | `AcceptanceTesting` | `UnitTesting`. - input_path (*str*) - Specifies the repo path. - project_name (*str*) - Specifies the repo name. If empty, defaults to the last segment of `input_path` (i.e., `input_path.split('/')[-1]`) - model (*str*) - Specifies the name of the language model: `gpt-3.5-turbo` | `gpt-4` | `gpt-4-32k` | `gpt-4-turbo` | `claude-2` | `claude-2.1` | `codellama-7b` | `codellama-13b` | `codellama-34b` | `deepseek-coder-1.3b` | `deepseek-coder-6.7b` | `deepseek-coder-33b` | `customized-model`. - customized_model_name (Optional, *str*) - Specifies the custom model name if the value of the `model` parameter is `customized-model`. - model_source (*str*) - Specifies the model type, open source model or openai closed source model : `open_source` ｜ `openai` - review (*str*) - Specifies the review mode: `none` | `normal` | `execution`. - `none`: a single forward pass of Coding. - `normal`: Coding and CodeReview in alternation, with CodeReview lacking program execution feedback. - `execution`: Coding and CodeReview in alternation, with CodeReview including program execution feedback. - read_src_code (*bool*) - Whether to read source code in the AcceptanceTesting and UnitTesting tasks. - evaluate (*bool*) - Whether to evaluate in the end. The evaluation for the software design can be found [here](./llm_judge/). - temperature (*float*) - temperature - top_p (*float*) - top_p When you use normal review and execution review, the `cyclenum` parameter of `CompanyConfig/{task_name}/ChatChainConfig.json` can be specified as the number of rounds of review. The default is 2. ## 🔎 Citation ``` @article{li2024devbench, title={DevBench: A Comprehensive Benchmark for Software Development}, author={Li, Bowen and Wu, Wenhan and Tang, Ziwei and Shi, Lin and Yang, John and Li, Jinyang and Yao, Shunyu and Qian, Chen and Hui, Binyuan and Zhang, Qicheng and others}, journal={arXiv preprint arXiv:2403.08604}, year={2024} } ``` ## 📄 License - Source Code Licensing: Our project's source code is licensed under the Apache 2.0 License. This license permits the use, modification, and distribution of the code, subject to certain conditions outlined in the Apache 2.0 License. - Data Licensing: The related data utilized in our project is licensed under CC BY 4.0, which allows anyone to copy, distribute, transmit, adapt and make commercial use of the dataset.