---
title: "programbench swe agent benchmark"
source: newsletter
source_url: https://programbench.com/
source_feed: TLDR AI (newsletter)
ingested: 2026-05-08
review_value: 7
review_confidence: 7
review_verdict: strong
stars: 4
sha256: de38d71dd3226599
type: raw
created: 2026-05-10
updated: 2026-05-10
tags: []
---
[./ProgramBench](/)
[Leaderboard](/) [Paper](https://arxiv.org/abs/2605.03546) [GitHub](https://github.com/facebookresearch/ProgramBench) [Team](/team/)
More 
[Blog](/blog/) [HuggingFace](https://huggingface.co/programbench) [DockerHub](https://hub.docker.com/orgs/programbench/repositories)
# ./ProgramBench
Can language models rebuild programs from scratch?
Given only a compiled binary and its documentation, agents must architect and implement a complete codebase that reproduces the original program's behavior. 
[John Yang](https://john-b-yang.github.io/)*, [Kilian Lieret](https://www.lieret.net/)*, [Jeffrey Ma](https://18jeffreyma.github.io/), [Parth Thakkar](https://thakkarparth007.github.io/), [Dmitrii Pedchenko](https://scholar.google.com/citations?user=izGGAnQAAAAJ&hl=en), [Sten Sootla](https://scholar.google.com/citations?user=UAx_woYAAAAJ&hl=en), [Emily McMilin](https://2dot71mily.github.io/), [Pengcheng Yin](https://pengcheng.in/), [Rui Hou](https://scholar.google.com/citations?user=PKHKqX0AAAAJ&hl=en), [Gabriel Synnaeve](https://syhw.github.io/), [Diyi Yang](https://cs.stanford.edu/~diyiy/index.html), [Ofir Press](https://ofir.io/) Meta Superintelligence Labs • Stanford University • Harvard University
## Leaderboard
Evaluated with [mini-SWE-agent](https://github.com/swe-agent/mini-swe-agent/) · 200 tasks · [See extended results →](/extended/)
# |  | Model | Agent |  Resolved help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. |  Almost resolvedAlmost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests. [See extended results](/extended/).  
---|---|---|---|---|---  
1 |  |  Claude Opus 4.7 Anthropic | mini-SWE-agent |  0% |  3.0%  
2 |  |  Claude Opus 4.6 Anthropic | mini-SWE-agent |  0% |  2.5%  
3 |  |  Claude Sonnet 4.6 Anthropic | mini-SWE-agent |  0% |  1.0%  
4 |  |  GPT 5.4 OpenAI | mini-SWE-agent |  0% |  0.0%  
5 |  |  Gemini 3.1 Pro Google | mini-SWE-agent |  0% |  0.0%  
6 |  |  Gemini 3 Flash Google | mini-SWE-agent |  0% |  0.0%  
7 |  |  Claude Haiku 4.5 Anthropic | mini-SWE-agent |  0% |  0.0%  
8 |  |  GPT 5.4 mini OpenAI | mini-SWE-agent |  0% |  0.0%  
9 |  |  GPT 5 mini OpenAI | mini-SWE-agent |  0% |  0.0%  
Click row to see model details · Sorting: Resolved → Almost resolved → Avg. pass rate (more) [See extended results →](/extended/)
## About ProgramBench
In each task, the agent receives an executable and its documentation, and it must re-implement the given executable. It does not get access to _any_ of the executable's source code, it cannot de-compile the executable, and cannot use the internet. There are [200 tasks](/tasks/) in total covering different program complexities, ranging from small terminal utilities like [jq](/task/jqlang__jq.b33a763/) and [ripgrep](/task/burntsushi__ripgrep.3b7fd44/) to massive software projects like the [PHP compiler](/task/php__php-src.c891263/), [FFmpeg](/task/ffmpeg__ffmpeg.360a402/), and [SQLite](/task/sqlite__sqlite.839433d/).
The agent must choose a language, design the architecture, write all source code, and produce a build script. Every design decision is the model's to make.
Once the agent submits a program, our test suite compares the candidate program's behavior against the original program. A candidate program passes only if all tests for that task pass.
Our test suite is generated via agent-driven fuzzing, and it comprises more than 248,000 total behavioral tests for our [200 tasks](/tasks/).
How is ProgramBench different from other whole-repo generation projects?
**Free-form implementation.** Unlike most existing coding benchmarks, ProgramBench gives no hints or structure. There are no method signatures to fill in, no class skeletons, no product requirement documents, and no natural-language descriptions of the intended file layout. Agents must decide for themselves what abstractions to introduce, how to decompose functionality across modules, and what interfaces to expose. These are exactly the software design capabilities that prior benchmarks tend to abstract away.
**More than a case study.** Some prior and concurrent work on program recreation has reported success for a single large instance, or small collections of handcrafted instances. In contrast, ProgramBench gives a more diverse signal with more than 200 instances.
We discuss related work in detail in section 6 of the [paper](https://arxiv.org/abs/2605.03546).
Can tasks in ProgramBench be fully solved at all?
Yes. The agent can run the given program with any input and observe exactly what it does, so there's nothing hidden that can't be discovered through experimentation. The benchmark is hard, but it's solvable by design: all the reference executables pass our test suites. Read more in our [blog post](/blog/is-programbench-impossible/).
Why are ProgramBench scores so low?
Building a program from scratch is a fundamentally challenging task. Agents do currently make partial progress on many tasks (see the [extended results](/extended/) for details), but fully passing every test is still out of reach.
**Agents truly have to architect.** This is in part because unlike other whole-repo generation projects, we give no hints or structure to the agent, meaning that the agent truly has to architect its own solutions (see "How is ProgramBench different?").
**No harness tuning.** Other recent and concurrent work also performed substantial harness tuning for a single or a handful number of tasks. We deliberately avoid this, since headline scores from a tuned harness on a curated handful of tasks can substantially overstate how capable agents really are at building software from scratch. Instead, ProgramBench is evaluated with a single generic harness across the entire task set.
**Cleanroom implementation.** We take substantial precautions to prevent cheating. Agents run in sandboxed containers without internet access, so they cannot retrieve the original source code or obtain any other form of help.
**No decompilation.** See "Can tasks be solved with decompilation?"
We review related work in section 6 of the [paper](https://arxiv.org/abs/2605.03546). We also discuss cheating in the FAQ below and in section 4.1.
Is your agent scaffold sufficient to solve all tasks?
**Widely adopted baseline.** We use [mini-SWE-agent](https://github.com/swe-agent/mini-swe-agent/) because it is both widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in its scaffolding, reducing confounds between model capability and harness design. Most other agents (like Claude Code with apparently several 100k lines of code) are also constantly changing in non-transparent ways, while mini-SWE-agent will allow for apples-to-apples performance comparison of models for the foreseeable future.
**Almost no runtime limitations.** With very few exceptions, models submit their solutions deliberately rather than exceeding our generous time or step limits, and they never exhaust their context window. Because we do not limit total cost, our runs have cost up to $5k (for Sonnet 4.5).
**Varying degree of difficulty.** ProgramBench deliberately includes tasks from various degrees of difficulty, from very short repositories of only a few thousand lines of code to extremely large ones. We believe that the extremely low scores are therefore more of a signal of inadequate model capabilities rather than an indicator that only multi-agent systems can solve our tasks. Nonetheless, we would be excited to be one of the first systematic benchmarks that includes tasks that can only be solved by multi-agent systems.
**Kicking off a new scaffold race.** We believe that [mini-SWE-agent](https://github.com/swe-agent/mini-swe-agent/) is the right choice of baseline and that it can absolutely solve (some of) the tasks. However, we'd be more than excited if ProgramBench kicks off a new scaffold race! We will be opening submissions soon.
Can agents cheat?
Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. In early trials without these restrictions, models found shortcuts like cloning source repositories from GitHub or downloading code through package managers. Read more in our [blog post](/blog/is-programbench-impossible/) and in section 4.1 of the [paper](https://arxiv.org/abs/2605.03546).
Why and how do you block decompilation?
The executable that is given to the agent only has execution, not read permissions. That means that any operation that is not execution (such as running a decompiler, disassembler, `objdump`, `strings`, or `hexdump`) will fail.
We do this because we want ProgramBench to answer the question "How well can LMs build programs from scratch", rather than "How well can LMs patch together bits of decompiled code".
How is the leaderboard sorted? What's the primary metric?
The primary metric that should be reported for ProgramBench are fully resolved instances. We currently report "almost resolved" (more than 95% of test cases pass) as an additional point of reference while the scores of our primary metric are low. The leaderboard is sorted by fully resolved first, almost resolved second, and finally the average test pass rate.
For a detailed understanding of model performance, we recommend the plot at the [detailed leaderboard](/extended/). See also: "Have you considered other metrics?"
How do I submit to the leaderboard?
A public submission portal is coming soon. Follow the authors ([John](https://x.com/jyangballin), [Kilian](https://x.com/KLieret)) for updates.
Why do you not allow internet?
We have extensively studied different inference settings, including allowing internet. We find that allowing internet leads to an abundance of cheating that requires LM as a judge to flag and disqualify solutions. This makes the benchmark less reliable, especially because defining exactly what cheating means in the context of obtaining source online is not as clear cut as it might seem.
However, except for instances that contained cheating, we did not observe a dramatic improvement of scores when allowing internet.
Find our ablations in section 4.1 of the [paper](https://arxiv.org/abs/2605.03546) and John's [explanation](https://x.com/jyangballin/status/2051829957199011974).
Have you considered other metrics? E.g., average number of tests passed?
Yes, we've decided on our current resolved metric after a lot of thought. Our initial question was "Can LMs build programs from scratch?", and the most relevant metric is the fraction of programs that can be fully built. Reporting an average test pass rate would be extremely misleading, because every instance includes very simple tests (such as checking for the existence of flags, checking what happens if you call the executable with `--help` etc.).
We've also thought about using a more relaxed metric like "almost resolved". However, relaxing to ≥95% of tests solved or even 99% of tests solved is also problematic. First, for some of our tasks, we have almost 15k tests. Even 1% of that is still 100 tests. And even a single failed test can indicate severe issues with a program. Therefore "almost resolved" only serves as additional orientation, until the main "resolved" metric has enough signal to differentiate all models.
However, all auxiliary metrics are still useful for diagnosing and improving models and scaffolds! They're just not the right metric as a benchmark. Check the [extended results](/extended/) for more information. See also: "How is the leaderboard sorted?"
What about contamination/memorization?
All tasks are pulled from open-source repositories. As such, they were certainly seen during training by the LMs evaluated here. However, we are currently **not** worried about this:
**Zero scores.** Even if memorization is at play (it doesn't look like it to us), it is by far insufficient to score highly. Scores seem to be much more correlated with systematic efforts at implementing functionality.
**Different-language ablation.** In section 4.1 of the [paper](/static/paper.pdf), we introduce an alternative inference setting where models are forced to implement the program in a different language than the original, effectively bypassing memorization (especially because memorization is extremely prompt-dependent). The scores did not change drastically, leading to the conclusion that scores are currently not significantly correlated with memorization of the original source code.
Direct memorization is extremely simple to measure and we are actively monitoring it. Should it ever become a clear problem, we might mix up our inference settings to prevent it. For now, it was more important to make sure all tasks are solvable and natural, so we removed any constraints not deemed necessary to prevent cheating.
## [Task Instances](/tasks/)
[ junegunn/fzf :cherry_blossom: A command-line fuzzy finder 79,721  go Best score: 82% ](/task/junegunn__fzf.b56d614/) [ jesseduffield/lazygit simple terminal UI for git commands 76,901  go Best score: 56% ](/task/jesseduffield__lazygit.1d0db51/) [ BurntSushi/ripgrep ripgrep recursively searches directories for a regex pattern while respecting your gitignore 62,855  rs Best score: 80% ](/task/burntsushi__ripgrep.3b7fd44/) [ FFmpeg/FFmpeg Mirror of https://git.ffmpeg.org/ffmpeg.git 59,217  c Best score: 5% ](/task/ffmpeg__ffmpeg.360a402/) [ sharkdp/bat A cat(1) clone with wings. 58,487  rs Best score: 33% ](/task/sharkdp__bat.f822bd0/) [ typst/typst A markup-based typesetting system that is powerful and easy to learn. 52,957  rs Best score: 28% ](/task/typst__typst.88356d0/) [ jgm/pandoc Universal markup converter 43,632  hs Best score: 14% ](/task/jgm__pandoc.5caad90/) [ sharkdp/fd A simple, fast and user-friendly alternative to 'find' 42,668  rs Best score: 78% ](/task/sharkdp__fd.40d8eb3/) [ php/php-src The PHP Interpreter 40,030  c Best score: 5% ](/task/php__php-src.c891263/) [ duckdb/duckdb DuckDB is an analytical in-process SQL database management system 37,657  cpp Best score: 12% ](/task/duckdb__duckdb.bdb65ec/) [ ajeetdsouza/zoxide A smarter cd command. Supports all major shells. 35,994  rs Best score: 76% ](/task/ajeetdsouza__zoxide.67ca1bc/) [ jqlang/jq Command-line JSON processor 34,541  c Best score: 90% ](/task/jqlang__jq.b33a763/) [ dandavison/delta A syntax-highlighting pager for git, diff, grep, rg --json, and blame output 30,445  rs Best score: 37% ](/task/dandavison__delta.acd758f/) [ sharkdp/hyperfine A command-line benchmarking tool 27,960  rs Best score: 54% ](/task/sharkdp__hyperfine.327d5f4/) [ ggreer/the_silver_searcher A code-searching tool similar to ack, but faster. 27,080  c Best score: 59% ](/task/ggreer__the_silver_searcher.a61f178/) [ facebook/zstd Zstandard - Fast real-time compression algorithm 27,013  c Best score: 69% ](/task/facebook__zstd.1168da0/) [ facebookresearch/fastText Library for fast text representation and classification. 26,511  cpp Best score: 76% ](/task/facebookresearch__fasttext.1142dc4/) [ robertdavidgraham/masscan TCP port scanner, spews SYN packets asynchronously, scanning entire Internet in under 5 minutes. 25,544  c Best score: 57% ](/task/robertdavidgraham__masscan.b99d433/) [ tree-sitter/tree-sitter An incremental parsing system for programming tools 24,953  rs Best score: 37% ](/task/tree-sitter__tree-sitter.5e23cca/) [ FiloSottile/age A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability. 22,077  go Best score: 63% ](/task/filosottile__age.706dfc1/) [ rust-lang/mdBook Create book from markdown files. Like Gitbook but implemented in Rust 21,541  rs Best score: 55% ](/task/rust-lang__mdbook.37273ba/) [ jarun/nnn n³ The unorthodox terminal file manager 21,506  c Best score: 98% ](/task/jarun__nnn.cb2c535/) [ antonmedv/fx Terminal JSON viewer & processor 20,433  go Best score: 76% ](/task/antonmedv__fx.86d0d34/) [ mikefarah/yq yq is a portable command-line YAML, JSON, XML, CSV, TOML, HCL and properties processor 15,281  go Best score: 39% ](/task/mikefarah__yq.602586d/) [ Y2Z/monolith ⬛️ CLI tool and library for saving complete web pages as a single HTML file 15,024  rs Best score: 51% ](/task/y2z__monolith.8702e66/) [ direnv/direnv unclutter your .profile 14,998  go Best score: 62% ](/task/direnv__direnv.02040c7/) [ google/brotli Brotli compression format 14,673  c Best score: 91% ](/task/google__brotli.b3dc9cc/) [ tomnomnom/gron Make JSON greppable! 14,424  go Best score: 90% ](/task/tomnomnom__gron.88a6234/) [ XAMPPRocky/tokei Count your code, quickly. 14,300  rs Best score: 70% ](/task/xampprocky__tokei.505d648/) [ ast-grep/ast-grep ⚡A CLI tool for code structural search, lint and rewriting. Written in Rust 13,541  rs Best score: 12% ](/task/ast-grep__ast-grep.dde0fe0/) [ cheat/cheat cheat allows you to create and view interactive cheatsheets on the command-line. It was designed to help remind *nix system administrators of options for commands that they use frequently, but not frequently enough to remember. 13,278  go Best score: 60% ](/task/cheat__cheat.b8098dc/) [ jonas/tig Text-mode interface for git 13,200  c Best score: 84% ](/task/jonas__tig.8334123/) [ ninja-build/ninja a small build system with a focus on speed 12,895  cpp Best score: 72% ](/task/ninja-build__ninja.cc60300/) [ Canop/broot A new way to see and navigate directory trees : https://dystroy.org/broot 12,619  rs Best score: 67% ](/task/canop__broot.d6c798e/) [ orf/gping Ping, but with a graph 12,433  rs Best score: 78% ](/task/orf__gping.26eb5b9/) [ svenstaro/genact 🌀 A nonsense activity generator 11,995  rs Best score: 59% ](/task/svenstaro__genact.16f96e3/) [ lz4/lz4 Extremely Fast Compression algorithm 11,781  c Best score: 83% ](/task/lz4__lz4.1519f46/) [ o2sh/onefetch Command-line Git information tool 11,745  rs Best score: 82% ](/task/o2sh__onefetch.e5958ce/) [ bootandy/dust A more intuitive version of du in rust 11,609  rs Best score: 71% ](/task/bootandy__dust.62bf1e1/) [ ekzhang/bore 🕳 bore is a simple CLI tool for making tunnels to localhost 11,075  rs Best score: 69% ](/task/ekzhang__bore.8e059cd/) [ BurntSushi/xsv A fast CSV command line toolkit written in Rust. 10,757  rs Best score: 83% ](/task/burntsushi__xsv.f430466/) [ bellard/quickjs Public repository of the QuickJS Javascript Engine. 10,565  c Best score: 4% ](/task/bellard__quickjs.d7ae12a/) [ hatoo/oha Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation. 10,201  rs Best score: 73% ](/task/hatoo__oha.8dc6349/) [ tstack/lnav Log file navigator 10,200  cpp Best score: 13% ](/task/tstack__lnav.ee34494/) [ sharkdp/hexyl A command-line hex viewer 10,086  rs Best score: 83% ](/task/sharkdp__hexyl.2e26437/) [ lua/lua A copy of the Lua development repository, as seen by the Lua team. Mirrored irregularly. All communication should be through the Lua mailing list https://www.lua.org/lua-l.html 9,908  c Best score: 43% ](/task/lua__lua.c6b4848/) [ johnkerl/miller Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON 9,842  go Best score: 23% ](/task/johnkerl__miller.8d85b46/) [ sqlite/sqlite Official Git mirror of the SQLite source tree 9,434  c Best score: 67% ](/task/sqlite__sqlite.839433d/) [ boyter/scc Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go 8,320  go Best score: 38% ](/task/boyter__scc.515f91c/) [ ariga/atlas Declarative schema migrations with schema-as-code workflows 8,311  go Best score: 55% ](/task/ariga__atlas.6d81150/) [ pemistahl/grex A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases 8,103  rs Best score: 74% ](/task/pemistahl__grex.fa3e8ed/) [ htop-dev/htop htop - an interactive process viewer 8,021  c Best score: 85% ](/task/htop-dev__htop.523600b/) [ peco/peco Simplistic interactive filtering tool 7,881  go Best score: 77% ](/task/peco__peco.4e58dad/) [ bensadeh/tailspin 🌀 A log file highlighter 7,793  rs Best score: 76% ](/task/bensadeh__tailspin.6278437/) [ ducaale/xh Friendly and fast tool for sending HTTP requests 7,754  rs Best score: 50% ](/task/ducaale__xh.4a6e44f/) [ svenstaro/miniserve 🌟 For when you really just want to serve some files over HTTP right now! 7,561  rs Best score: 79% ](/task/svenstaro__miniserve.8449e8b/) [ mgdm/htmlq Like jq, but for HTML. 7,520  rs Best score: 94% ](/task/mgdm__htmlq.6e31bc8/) [ parcel-bundler/lightningcss An extremely fast CSS parser, transformer, bundler, and minifier written in Rust. 7,515  rs Best score: 54% ](/task/parcel-bundler__lightningcss.aa2ed1e/) [ universal-ctags/ctags A maintained ctags implementation 7,149  c Best score: 13% ](/task/universal-ctags__ctags.243595e/) [ chmln/sd Intuitive find & replace CLI (sed alternative) 7,072  rs Best score: 91% ](/task/chmln__sd.87d1ba5/) [ ogham/dog A command-line DNS client. 6,640  rs Best score: 84% ](/task/ogham__dog.721440b/) [ danmar/cppcheck static analysis of C/C++ code 6,599  cpp Best score: 15% ](/task/danmar__cppcheck.0a5b103/) [ doxygen/doxygen Official doxygen git repository 6,422  c Best score: 34% ](/task/doxygen__doxygen.966d98e/) [ sharkdp/pastel A command-line tool to generate, analyze, convert and manipulate colors 6,334  rs Best score: 77% ](/task/sharkdp__pastel.b60e899/) [ BLAKE3-team/BLAKE3 the official Rust and C implementations of the BLAKE3 cryptographic hash function 6,178  rs Best score: 98% ](/task/blake3-team__blake3.15e83a5/) [ Nukesor/pueue :stars: Manage your shell commands. 6,154  rs Best score: 15% ](/task/nukesor__pueue.8b9d6fe/) [ OSGeo/gdal GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats. 5,875  cpp Best score: 25% ](/task/osgeo__gdal.0847f12/) [ Byron/dua-cli View disk space usage and delete unwanted data, fast. 5,794  rs Best score: 87% ](/task/byron__dua-cli.8570c15/) [ dundee/gdu Fast disk usage analyzer with console interface written in Go 5,578  go Best score: 70% ](/task/dundee__gdu.ede21d2/) [ eradman/entr Run arbitrary commands when files change 5,551  c Best score: 89% ](/task/eradman__entr.8e2e8b4/) [ LuaJIT/LuaJIT Mirror of the LuaJIT git repository 5,518  c Best score: 71% ](/task/luajit__luajit.a553b3d/) [ mgechev/revive 🔥 ~6x faster, stricter, configurable, extensible, and beautiful drop-in replacement for golint 5,486  go Best score: 46% ](/task/mgechev__revive.201451e/) [ cweill/gotests Automatically generate Go test boilerplate from your source code. 5,294  go Best score: 62% ](/task/cweill__gotests.2a672c5/) [ cordx56/rustowl Visualize Ownership and Lifetimes in Rust 5,113  rs Best score: 75% ](/task/cordx56__rustowl.655bc5c/) [ abishekvashok/cmatrix Terminal based "The Matrix" like implementation 5,042  c Best score: 97% ](/task/abishekvashok__cmatrix.5c082c6/) [ quinn-rs/quinn Async-friendly QUIC implementation in Rust 5,041  rs Best score: 62% ](/task/quinn-rs__quinn.bb359cc/) [ alecthomas/chroma A general purpose syntax highlighter in pure Go  4,910  go Best score: 16% ](/task/alecthomas__chroma.8d04def/) [ anordal/shellharden The corrective bash syntax highlighter 4,778  rs Best score: 82% ](/task/anordal__shellharden.6a6ffd4/) [ yoav-lavi/melody Melody is a language that compiles to regular expressions and aims to be more readable and maintainable 4,748  rs Best score: 79% ](/task/yoav-lavi__melody.f4af9b4/) [ sayanarijit/xplr A hackable, minimal, fast TUI file explorer 4,735  rs Best score: 60% ](/task/sayanarijit__xplr.1751065/) [ hpjansson/chafa 📺🗿 Terminal graphics for the 21st century. 4,648  c Best score: 58% ](/task/hpjansson__chafa.dd4d4c1/) [ jhspetersson/fselect Find files with SQL-like queries 4,420  rs Best score: 44% ](/task/jhspetersson__fselect.c3559ca/) [ ivanceras/svgbob Convert your ascii diagram scribbles into happy little SVG 4,182  rs Best score: 41% ](/task/ivanceras__svgbob.6d00ad9/) [ multiprocessio/dsq Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more. 3,867  go Best score: 80% ](/task/multiprocessio__dsq.c3ae0ba/) [ rcoh/angle-grinder Slice and dice logs on the command line 3,727  rs Best score: 38% ](/task/rcoh__angle-grinder.9c2fc88/) [ rs/curlie The power of curl, the ease of use of httpie. 3,637  go Best score: 89% ](/task/rs__curlie.5dfcbb1/) [ antonmedv/walk Terminal file manager 3,598  go Best score: 74% ](/task/antonmedv__walk.bf802ef/) [ JohannesKaufmann/html-to-markdown ⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules. 3,586  go Best score: 86% ](/task/johanneskaufmann__html-to-markdown.3006818/) [ TheZoraiz/ascii-image-converter A cross-platform command-line tool to convert images into ascii art and print them on the console. Now supports braille art! 3,284  go Best score: 64% ](/task/thezoraiz__ascii-image-converter.d05a757/) [ hairyhenderson/gomplate A flexible commandline tool for template rendering. Supports lots of local and remote datasources. 3,135  go Best score: 75% ](/task/hairyhenderson__gomplate.05eb3aa/) [ ip7z/7zip 7-Zip 2,967  cpp Best score: 34% ](/task/ip7z__7zip.839151e/) [ madler/pigz A parallel implementation of gzip for modern multi-processor, multi-core machines. 2,924  c Best score: 83% ](/task/madler__pigz.fe4894f/) [ tinycc/tinycc Unofficial mirror of mob development branch 2,843  c Best score: 13% ](/task/tinycc__tinycc.9b8765d/) [ raviqqe/muffet Fast website link checker in Go 2,597  go Best score: 88% ](/task/raviqqe__muffet.a882908/) [ segmentio/chamber CLI for managing secrets 2,588  go Best score: 82% ](/task/segmentio__chamber.5f93f5f/) [ astaxie/bat Go implement CLI, cURL-like tool for humans 2,563  go Best score: 72% ](/task/astaxie__bat.17d1080/) [ zk-org/zk Plain text note-taking assistant 2,542  go Best score: 43% ](/task/zk-org__zk.10d93d5/) [ kisielk/errcheck errcheck checks that you checked errors. 2,480  go Best score: 80% ](/task/kisielk__errcheck.dacab89/) [ mkj/dropbear Dropbear SSH 2,231  c Best score: 58% ](/task/mkj__dropbear.75f699b/) [ noborus/trdsql CLI tool that can execute SQL queries on CSV, LTSV, JSON, YAML and TBLN. Can output to various formats. 2,159  go Best score: 67% ](/task/noborus__trdsql.d8c5ff6/) [ sheepla/pingu 🐧ping command but with pingu 2,087  go Best score: 97% ](/task/sheepla__pingu.926d475/) [ go-critic/go-critic The most opinionated Go source code linter for code audit. 2,041  go Best score: 42% ](/task/go-critic__go-critic.9aea378/) [ OSGeo/PROJ PROJ - Cartographic Projections and Coordinate Transformations Library 1,974  cpp Best score: 74% ](/task/osgeo__proj.75d455c/) [ noborus/ov 🎑Feature-rich terminal-based text viewer. It is a so-called terminal pager. 1,935  go Best score: 88% ](/task/noborus__ov.b96c2ba/) [ samtools/samtools Tools (written in C using htslib) for manipulating next-generation sequencing data 1,886  c Best score: 14% ](/task/samtools__samtools.aa823b5/) [ gabotechs/dep-tree Tool for helping developers keep their code bases clean and decoupled. It allows visualising a code base complexity using a 3d force-directed graph of files and the dependencies between them. 1,706  go Best score: 65% ](/task/gabotechs__dep-tree.60a95a2/) [ cmatsuoka/figlet Claudio's FIGlet tree 1,606  c Best score: 78% ](/task/cmatsuoka__figlet.202a0a8/) [ lh3/seqtk Toolkit for processing sequences in FASTA/Q formats 1,537  c Best score: 67% ](/task/lh3__seqtk.94e7070/) [ tukaani-project/xz XZ Utils 1,522  c Best score: 36% ](/task/tukaani-project__xz.1007bf0/) [ skeema/skeema Declarative pure-SQL schema management for MySQL and MariaDB 1,361  go Best score: 76% ](/task/skeema__skeema.6a76243/) [ mfridman/tparse CLI tool for summarizing go test output. Pipe friendly. CI/CD friendly. 1,246  go Best score: 78% ](/task/mfridman__tparse.2416b4b/) [ lfos/calcurse A text-based calendar and scheduling application 1,243  c Best score: 54% ](/task/lfos__calcurse.49180d5/) [ hooklift/gowsdl WSDL2Go code generation as well as its SOAP proxy 1,219  go Best score: 86% ](/task/hooklift__gowsdl.2a06cec/) [ guumaster/hostctl Your dev tool to manage /etc/hosts like a pro! 1,216  go Best score: 83% ](/task/guumaster__hostctl.d6d9699/) [ rs/jplot iTerm2 expvar/JSON monitoring tool 1,178  go Best score: 89% ](/task/rs__jplot.2a54bcc/) [ naggie/dstask Git powered terminal-based todo/note manager -- markdown note page per task. Single binary! 1,157  go Best score: 59% ](/task/naggie__dstask.ff57396/) [ sigoden/argc A Bash CLI framework, also a Bash command runner. 1,135  rs Best score: 44% ](/task/sigoden__argc.04a08f1/) [ sibprogrammer/xq Command-line XML and HTML beautifier and content extractor 1,109  go Best score: 76% ](/task/sibprogrammer__xq.b89f681/) [ xorg62/tty-clock Clock using lib ncurses 1,105  c Best score: 84% ](/task/xorg62__tty-clock.f2f847c/) [ unhappychoice/gittype A CLI code-typing game that turns your source code into typing challenges 1,075  rs Best score: 91% ](/task/unhappychoice__gittype.34b72d0/) [ eudoxia0/hashcards A plain text-based spaced repetition system. 1,071  rs Best score: 56% ](/task/eudoxia0__hashcards.48aa136/) [ rvben/rumdl Fast Markdown linter and formatter written in Rust 1,051  rs Best score: 41% ](/task/rvben__rumdl.2d75c4d/) [ sclevine/yj CLI - Convert between YAML, TOML, JSON, and HCL. Preserves map order. 1,041  go Best score: 74% ](/task/sclevine__yj.8016400/) [ arq5x/bedtools2 bedtools - the swiss army knife for genome arithmetic 1,029  c Best score: 39% ](/task/arq5x__bedtools2.dd57059/) [ cslarsen/jp2a Converts jpg images to ASCII 1,021  c Best score: 56% ](/task/cslarsen__jp2a.61d205f/) [ blacknon/hwatch A modern alternative to the watch command, records the differences in execution results and can check this differences at after. 1,016  rs Best score: 81% ](/task/blacknon__hwatch.edfcb62/) [ eliukblau/pixterm Draw images in your ANSI terminal with true color 1,014  go Best score: 75% ](/task/eliukblau__pixterm.1a93fd5/) [ Canop/rhit A nginx log explorer 1,006  rs Best score: 53% ](/task/canop__rhit.ae90bcb/) [ stathissideris/ditaa ditaa is a small command-line utility that can convert diagrams drawn using ascii art ('drawings' that contain characters that resemble lines like | / - ), into proper bitmap graphics. 1,005  java Best score: 20% ](/task/stathissideris__ditaa.f2286c4/) [ rbakbashev/elfcat ELF visualizer. Generates HTML files from ELF binaries. 990  rs Best score: 98% ](/task/rbakbashev__elfcat.52f8cc7/) [ nuta/nsh A command-line shell like fish, but POSIX compatible. 966  rs Best score: 84% ](/task/nuta__nsh.bdd0702/) [ dalance/amber A code search / replace tool 941  rs Best score: 71% ](/task/dalance__amber.69a0f52/) [ pls-rs/pls pls is a prettier and powerful ls(1) for the pros. 932  rs Best score: 62% ](/task/pls-rs__pls.4e1ae50/) [ Esubaalew/run Universal multi-language runner and smart REPL written in Rust. 919  rs Best score: 85% ](/task/esubaalew__run.0fb9dec/) [ chirlu/sox SoX, Swiss Army knife of sound processing 913  c Best score: 38% ](/task/chirlu__sox.42b3557/) [ clog-tool/clog-cli Generate beautiful changelogs from your Git commit history 912  rs Best score: 93% ](/task/clog-tool__clog-cli.7066cba/) [ tarka/xcp An extended `cp` 911  rs Best score: 93% ](/task/tarka__xcp.5e5b448/) [ oppiliappan/eva a calculator REPL, similar to bc(1) 907  rs Best score: 89% ](/task/oppiliappan__eva.41ae245/) [ git-bahn/git-graph Command line tool to show clear git graphs arranged for your branching model 904  rs Best score: 80% ](/task/git-bahn__git-graph.87b4473/) [ gromacs/gromacs Public/backup repository of the GROMACS molecular simulation toolkit. Please do not mine the metadata blindly; we use https://gitlab.com/gromacs/gromacs for code review and issue tracking. 901  cpp Best score: 9% ](/task/gromacs__gromacs.665ea4c/) [ sirwart/ripsecrets A command-line tool to prevent committing secret keys into your source code 901  rs Best score: 73% ](/task/sirwart__ripsecrets.34c9e03/) [ Drew-Alleman/DataSurgeon Quickly Extracts IP's, Email Addresses, Hashes, Files, Credit Cards, Social Security Numbers and a lot More From Text 890  rs Best score: 74% ](/task/drew-alleman__datasurgeon.d257cee/) [ alexpovel/srgn A grep-like tool which understands source code syntax and allows for manipulation in addition to search 889  rs Best score: 69% ](/task/alexpovel__srgn.89f943b/) [ kyoheiu/felix tui file manager with vim-like key mapping 888  rs Best score: 49% ](/task/kyoheiu__felix.95df390/) [ oppiliappan/statix lints and suggestions for the nix programming language 882  rs Best score: 43% ](/task/oppiliappan__statix.e9df54c/) [ nachoparker/dutree a tool to analyze file system usage written in Rust 871  rs Best score: 90% ](/task/nachoparker__dutree.44e877d/) [ simeg/eureka 💡 CLI tool to input and store your ideas without leaving the terminal 867  rs Best score: 79% ](/task/simeg__eureka.df3796c/) [ kyoh86/richgo Enrich `go test` outputs with text decorations. 863  go Best score: 85% ](/task/kyoh86__richgo.313114f/) [ rochacbruno/marmite Markdown makes sites - A Static Site Generator for Blogs 837  rs Best score: 45% ](/task/rochacbruno__marmite.7d4bc2d/) [ rust-embedded/svd2rust Generate Rust register maps (`struct`s) from SVD files 835  rs Best score: 73% ](/task/rust-embedded__svd2rust.1760b5e/) [ konradsz/igrep Interactive Grep 827  rs Best score: 74% ](/task/konradsz__igrep.aa75630/) [ nikolassv/bartib A simple timetracker for the command line. It saves a log of all tracked activities as a plaintext file and allows you to create flexible reports. 827  rs Best score: 87% ](/task/nikolassv__bartib.6b9b5ce/) [ yassinebridi/serpl A simple terminal UI for search and replace, ala VS Code. 824  rs Best score: 61% ](/task/yassinebridi__serpl.c48a9d7/) [ riquito/tuc When cut doesn't cut it 820  rs Best score: 93% ](/task/riquito__tuc.16fb471/) [ ecumene/rust-sloth A 3D software rasterizer... for the terminal! 818  rs Best score: 53% ](/task/ecumene__rust-sloth.051c559/) [ crowdagger/crowbook Converts books written in Markdown to HTML, LaTeX/PDF and EPUB 813  rs Best score: 60% ](/task/crowdagger__crowbook.ea214d7/) [ WGUNDERWOOD/tex-fmt An extremely fast LaTeX formatter written in Rust 789  rs Best score: 81% ](/task/wgunderwood__tex-fmt.3f1aef6/) [ Stranger6667/jsonschema A high-performance JSON Schema validator for Rust 770  rs Best score: 52% ](/task/stranger6667__jsonschema.d52e881/) [ rhysd/kiro-editor A small terminal UTF-8 text editor written in Rust 📝🦀 761  rs Best score: 93% ](/task/rhysd__kiro-editor.4157485/) [ astro/deadnix Scan Nix files for dead code 745  rs Best score: 86% ](/task/astro__deadnix.d590041/) [ sstadick/hck A sharp cut(1) clone. 738  rs Best score: 96% ](/task/sstadick__hck.b66c751/) [ trasta298/keifu Git genealogy, untangled. A TUI for navigating commit graphs with color and clarity. 729  rs Best score: 67% ](/task/trasta298__keifu.3331426/) [ AmmarAbouZor/tui-journal Your journal app if you live in a terminal 722  rs Best score: 71% ](/task/ammarabouzor__tui-journal.2b4540d/) [ incu6us/goimports-reviser Right imports sorting & code formatting tool (goimports alternative) 715  go Best score: 86% ](/task/incu6us__goimports-reviser.81bd549/) [ yaa110/nomino Batch rename utility for developers 710  rs Best score: 80% ](/task/yaa110__nomino.f892499/) [ wfxr/csview 📠 Pretty and fast csv viewer for cli with cjk/emoji support. 694  rs Best score: 96% ](/task/wfxr__csview.8ac4de0/) [ chmln/handlr A better xdg-utils 693  rs Best score: 91% ](/task/chmln__handlr.90e78ba/) [ Miserlou/Loop UNIX's missing `loop` command 692  rs Best score: 95% ](/task/miserlou__loop.209927c/) [ KSXGitHub/parallel-disk-usage Highly parallelized, blazing fast directory tree analyzer 689  rs Best score: 86% ](/task/ksxgithub__parallel-disk-usage.96978ed/) [ hush-shell/hush Hush is a unix shell based on the Lua programming language 688  rs Best score: 83% ](/task/hush-shell__hush.560c33a/) [ zevv/duc Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage 682  c Best score: 83% ](/task/zevv__duc.a58fa4e/) [ altdesktop/i3-style 🎨 Make your i3 config a little more stylish. 678  rs Best score: 80% ](/task/altdesktop__i3-style.f93821b/) [ wintermute-cell/ngrrram A TUI tool to help you type faster and learn new layouts. Includes a free cat. 674  rs Best score: 84% ](/task/wintermute-cell__ngrrram.8ea13c3/) [ psampaz/go-mod-outdated Find outdated dependencies of your Go projects. go-mod-outdated provides a table view of the go list -u -m -json all command which lists all dependencies of a Go project and their available minor and patch updates. It also provides a way to filter indirect dependencies and dependencies without updates. 669  go Best score: 98% ](/task/psampaz__go-mod-outdated.bb79367/) [ wfxr/code-minimap 🛰 A high performance code minimap render. 660  rs Best score: 89% ](/task/wfxr__code-minimap.0ddeea5/) [ kaushiksrini/parqeye Peek inside Parquet files right from your terminal 654  rs Best score: 59% ](/task/kaushiksrini__parqeye.8072121/) [ stacked-git/stgit Stacked Git 652  rs Best score: 20% ](/task/stacked-git__stgit.430027d/) [ Isona/dirble Fast directory scanning and scraping tool 632  rs Best score: 67% ](/task/isona__dirble.e2dea9f/) [ YS-L/flamelens Flamegraph viewer in the terminal 622  rs Best score: 59% ](/task/ys-l__flamelens.0b4dc33/) [ mookid/diffr Yet another diff highlighting tool 612  rs Best score: 85% ](/task/mookid__diffr.2152742/) [ shashwatah/jot ⚡Rapid note management for the terminal. 609  rs Best score: 85% ](/task/shashwatah__jot.a92aad8/) [ Epistates/treemd A (TUI/CLI) markdown navigator with tree-based structural navigation. 603  rs Best score: 55% ](/task/epistates__treemd.825c6dd/) [ pier-cli/pier A CLI to organize and run short Unix shell scripts 596  rs Best score: 84% ](/task/pier-cli__pier.5e1bde9/) [ jrnxf/thokr ✨ sleek typing tui with visualized results and historical logging 595  rs Best score: 82% ](/task/jrnxf__thokr.09375ef/) [ ismaelgv/rnr A command-line tool to batch rename files and directories 581  rs Best score: 82% ](/task/ismaelgv__rnr.fc0733b/) [ sitkevij/hex 🔮 Futuristic take on hexdump, made in Rust.  563  rs Best score: 92% ](/task/sitkevij__hex.61ae69b/) [ brocode/fblog Small command-line JSON Log viewer 561  rs Best score: 86% ](/task/brocode__fblog.3b54330/) [ codesnap-rs/codesnap 🦀️📸 Pure Rust tool to generate beautiful code snapshots, provide CLI and Library 557  rs Best score: 59% ](/task/codesnap-rs__codesnap.f81e4f3/) [ foriequal0/git-trim Automatically trims your branches whose tracking remote refs are merged or stray 548  rs Best score: 65% ](/task/foriequal0__git-trim.07c2f50/) [ axodotdev/oranda 🎁 generate beautiful landing pages for your developer tools  542  rs Best score: 54% ](/task/axodotdev__oranda.27d60c7/) [ elkowar/pipr A tool to interactively write shell pipelines. 541  rs Best score: 57% ](/task/elkowar__pipr.fae0b17/) [ paradigmxyz/solar Blazingly fast, modular and contributor friendly Solidity compiler, written in Rust 539  rs Best score: 43% ](/task/paradigmxyz__solar.5190d0e/) [ Lymphatus/caesium-clt Caesium Command Line Tools - Lossy/lossless image compression tool  537  rs Best score: 92% ](/task/lymphatus__caesium-clt.a529b2e/) [ agourlay/zip-password-finder Find the password of protected ZIP files. 534  rs Best score: 98% ](/task/agourlay__zip-password-finder.704700d/) [ rust-ethereum/ethabi Encode and decode smart contract invocations 525  rs Best score: 91% ](/task/rust-ethereum__ethabi.b1710ad/) [ ArthurSonzogni/json-tui A JSON terminal UI made in C++ 438  cpp Best score: 71% ](/task/arthursonzogni__json-tui.17a22b6/) [ tomarrell/wrapcheck A Go linter to check that errors from external packages are wrapped 374  go Best score: 81% ](/task/tomarrell__wrapcheck.c058da1/) [ NikolaDucak/caps-log A small TUI journaling tool. 📖 370  cpp Best score: 62% ](/task/nikoladucak__caps-log.2cf2d1e/) [ mibk/dupl a tool for code clone detection 367  go Best score: 85% ](/task/mibk__dupl.1bf052b/) [ HaliteChallenge/Halite @twosigma's first artificial intelligence programming challenge 202  cpp Best score: 80% ](/task/halitechallenge__halite.822cfb6/)
## Citation
    @misc{yang2026programbenchlanguagemodelsrebuild,
          title={ProgramBench: Can Language Models Rebuild Programs From Scratch?},
          author={John Yang and Kilian Lieret and Jeffrey Ma and Parth Thakkar and Dmitrii Pedchenko and Sten Sootla and Emily McMilin and Pengcheng Yin and Rui Hou and Gabriel Synnaeve and Diyi Yang and Ofir Press},
          year={2026},
          eprint={2605.03546},
          archivePrefix={arXiv},
          primaryClass={cs.SE},
          url={https://arxiv.org/abs/2605.03546},
    }
Copyright © 2026 Meta Platforms, Inc [Terms of Use](https://opensource.fb.com/legal/terms) [Privacy Policy](https://opensource.fb.com/legal/privacy)