# NanoGPT-Bench

**NanoGPT-Bench** is a benchmark for evaluating AI systems' ability to perform open-ended, long-horizon frontier ML research. It is built on top of the popular GPT-2 pretraining speedrun challenge [*NanoGPT Speedrun*](https://github.com/kellerjordan/modded-nanogpt), and measures how well autonomous coding agents can recover historical human progress on the leaderboard.

## Overview

In NanoGPT-Bench, agents work *fully autonomously* — with no human intervention and no internet access — to improve a strong human starting point on the *NanoGPT Speedrun*. Agents have a fixed compute budget for experimentation and submit candidate solutions through a `submit` command that:

1. Checks competition rules via an LLM judge (mirroring the original *NanoGPT Speedrun* review process).
2. Retimes the candidate across ten runs to confirm a statistically significant speedup.

The benchmark is parameterized by the starting human record and the compute budget, so the setup can be refreshed over time to avoid contamination.

### Why NanoGPT-Bench?

We've found three properties important for autonomous research evaluation:

1. **An open-ended problem** that requires agents to come up with ideas themselves, not just follow instructions.
2. **A strong, optimized starting point** so progress can't be confounded by low-hanging fruit.
3. **A long-horizon human reference** that highlights current deficiencies and indicates room for improvement.

The *NanoGPT Speedrun* is uniquely suited as an environment for autonomous research evaluation: it has a long history of expert human submissions, a clear validation oracle, and an open-ended optimization target.

### Initial Results

We evaluated three frontier coding agents — Codex (GPT-5.4 xhigh), Claude Code (Opus 4.6 Max), and a Claude Code variant using [Autoresearch](https://github.com/karpathy/autoresearch)-style prompting — each with a 512 H100-hour compute budget, starting from the September 3rd, 2025 human world record.

All baselines recover **less than 10%** of the speedup achieved by human world records over the subsequent five months (September 3rd, 2025 – January 19th, 2026):

| Baseline | % of Human Progress Recovered |
| --- | --- |
| Autoresearch (Opus 4.6 Max) | 9.3% |
| Codex (GPT-5.4 xhigh) | 8.6% |
| Claude Code (Opus 4.6 Max) | 8.2% |

Agents spent the majority of their compute on hyperparameter tuning. By contrast, ~77% of human world records introduce algorithmic changes. See the [blog post](#) for the full analysis.

## Repository Layout

```
NanoGPT-Bench/
├── nanogpt/                       # Host-side harness (driver, agents, prompts, launchers)
│   ├── driver.py                  # Container launcher invoked by nanogpt/run/*.sh
│   ├── prompts/                   # Shared agent prompts mounted into every run
│   │   ├── RULES.md
│   │   ├── problem.txt
│   │   ├── local_prompt.md
│   │   └── resume_prompt.md
│   ├── agents/                    # Per-agent harnesses (install.sh + run.sh entrypoint)
│   │   ├── claude/
│   │   ├── codex/
│   │   └── autoresearch/          # also carries its own `prompts/` overlay
│   └── run/                       # Top-level launcher scripts (entry points for a user)
│       ├── claude_local.sh
│       ├── claude_autoresearch_local.sh
│       └── codex_local.sh
├── image/                         # Docker build context (training environment + submit validator)
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── submit.sh                  # installed in-container as `submit`
│   ├── codebase/data/             # FineWeb10B shard fetcher (baked into the image)
│   └── tools/                     # submit validator (comparability judge + p-value retiming);
│                                  # copied to /opt/nanogpt/tools inside the container
└── human_baselines/               # Snapshot of historical human-record submissions (run.sh +
                                   # train_gpt.py per record). The 2025-09-03_FA3 record is the
                                   # comparability anchor; its serialized form is also baked
                                   # into image/tools/baseline_code.txt.
```

## Running the Benchmark

1. Build the image (one-time):
   ```bash
   docker build -t nanogpt-bench image
   ```
   The build prefetches 9 FineWeb10B training shards plus the validation shard into `/workspace/data/fineweb10B/` inside the image.

2. The Docker volume `nanogpt-bench-data` is mounted at `/workspace/data` at runtime. On first launch Docker auto-populates it from the shards baked into the image; subsequent runs reuse it. Override with `--data-volume` (or `BENCHMARK_DATA_VOLUME`) if you want a different volume name.

3. Export the credentials your chosen agent needs and the session-hours budget, then launch one of:
   ```bash
   export ANTHROPIC_API_KEY=...
   export BENCHMARK_SESSION_HOURS=24
   bash nanogpt/run/claude_local.sh
   # or: bash nanogpt/run/claude_autoresearch_local.sh
   # or: OPENAI_API_KEY=... CODEX_API_KEY=$OPENAI_API_KEY bash nanogpt/run/codex_local.sh
   ```

Each launcher invokes `nanogpt/driver.py`, which copies the `2025-09-03_FA3` human record into a fresh timestamped workspace under `runs/`, mounts the `nanogpt-bench-data` volume into the container, and runs the agent's `run.sh`. The driver streams the container logs to the terminal and persists the agent's events and renderer output under the run directory. Agents validate intermediate candidates by calling `submit /workspace/submissions/submission_N`, which runs the in-container comparability + p-value check and exits `0` on success.

## How to Cite

If you use NanoGPT-Bench in your research, please cite:

```bibtex
@misc{intology2026nanogptbench,
  title  = {NanoGPT-Bench: Evaluating Autonomous Research Agents on the NanoGPT Speedrun},
  author = {Intology},
  year   = {2026},
  howpublished = {\url{https://github.com/IntologyAI/NanoGPT-Bench}},
}
```

## Acknowledgements

NanoGPT-Bench is built on top of [modded-nanogpt](https://github.com/kellerjordan/modded-nanogpt) by Keller Jordan and the *NanoGPT Speedrun* community, whose record submissions provide the human reference trajectory used in this benchmark.