---
source: newsletter
source_url: auto
ingested: 2026-06-30
---

Title: RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

URL Source: http://arxiv.org/abs/2605.15846

Published Time: Wed, 20 May 2026 00:42:56 GMT

Markdown Content:
Authors:[Xinbo Xu](https://arxiv.org/search/cs?searchtype=author&query=Xu,+X), [Ruihan Yang](https://arxiv.org/search/cs?searchtype=author&query=Yang,+R), [Haiyang Shen](https://arxiv.org/search/cs?searchtype=author&query=Shen,+H), [Wendong Xu](https://arxiv.org/search/cs?searchtype=author&query=Xu,+W), [Bofei Gao](https://arxiv.org/search/cs?searchtype=author&query=Gao,+B), [Ruoyu Wu](https://arxiv.org/search/cs?searchtype=author&query=Wu,+R), [Kean Shi](https://arxiv.org/search/cs?searchtype=author&query=Shi,+K), [Weichu Xie](https://arxiv.org/search/cs?searchtype=author&query=Xie,+W), [Xuanzhong Chen](https://arxiv.org/search/cs?searchtype=author&query=Chen,+X), [Ming Wu](https://arxiv.org/search/cs?searchtype=author&query=Wu,+M), [Jason Zeng](https://arxiv.org/search/cs?searchtype=author&query=Zeng,+J), [Michael Heinrich](https://arxiv.org/search/cs?searchtype=author&query=Heinrich,+M), [Elvis Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang,+E), [Liang Chen](https://arxiv.org/search/cs?searchtype=author&query=Chen,+L), [Kuan Li](https://arxiv.org/search/cs?searchtype=author&query=Li,+K), [Baobao Chang](https://arxiv.org/search/cs?searchtype=author&query=Chang,+B)

[View PDF](https://arxiv.org/pdf/2605.15846)[HTML (experimental)](https://arxiv.org/html/2605.15846v2)

> Abstract:Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

Comments:30 pages, 15 figures
Subjects:Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:[arXiv:2605.15846](https://arxiv.org/abs/2605.15846) [cs.SE]
(or [arXiv:2605.15846v2](https://arxiv.org/abs/2605.15846v2) [cs.SE] for this version)
[https://doi.org/10.48550/arXiv.2605.15846](https://doi.org/10.48550/arXiv.2605.15846)

arXiv-issued DOI via DataCite

## Submission history

From: Xinbo Xu [[view email](https://arxiv.org/show-email/80f8b522/2605.15846)] 

**[[v1]](https://arxiv.org/abs/2605.15846v1)** Fri, 15 May 2026 11:00:33 UTC (4,936 KB)

**[v2]** Tue, 19 May 2026 08:10:44 UTC (4,935 KB)