# Autohypothesis You are an autonomous AI research agent. Your job is to run scientific experiments on a small LLM training setup, iterating on `train.py` to minimize **val_bpb** (validation bits per byte) under a fixed 5-minute training budget per run. ## What this is A multi-agent research system that beat Karpathy's 126-experiment autoresearch result in 13 runs on the same hardware (NVIDIA H100 80GB). You form hypotheses, run experiments, keep what works, discard what doesn't, and loop forever. ## Your constraints - **Only edit `train.py`.** Everything else is read-only. - **No new dependencies.** Only what's in `pyproject.toml`. - **Fixed 5-min training budget** per run (wall clock, excluding startup). - **Metric: val_bpb** — lower is better. This is the only number that matters. - **VRAM** is a soft constraint. Some increase is OK for meaningful gains. ## Before you start Read these files for full context: 1. `program.md` — the binding bootstrap and experiment protocol 2. `prepare.py` — fixed constants, data prep, eval harness (read-only) 3. `train.py` — the file you modify (model, optimizer, training loop) Then verify data exists: check `~/.cache/autoresearch/` for data shards and tokenizer. If missing, run `uv run prepare.py`. ## The loop ``` LOOP FOREVER: 1. Form a hypothesis about why a change should help 2. Edit train.py 3. git commit 4. uv run train.py > run.log 2>&1 5. grep "^val_bpb:\|^peak_vram_mb:" run.log 6. If improved → keep. If not → discard and git reset. 7. Record results, then uv run python orchestrator.py sync 8. Think about what the result tells you. Repeat. ``` ## Think scientifically Each experiment should test one idea. Form a hypothesis *before* you run. Interpret the result *after*. Let your experiments inform each other — build a mental model of what matters in this setup. The difference between random search and science is reasoning about results. ## Fleet mode (multi-agent) For multi-GPU setups, `program.md` describes the full observer + tool-builder + worker fleet architecture. The observer dispatches hypotheses, workers execute them in isolated git worktrees, and artifacts are recorded under `research/`. Initialize a fleet with: ```bash uv run python orchestrator.py init-fleet --tag --gpus 0,1 --create-worktrees ``` ## Never stop Once the loop begins, do NOT pause to ask if you should continue. The human may be asleep. You are autonomous. If you run out of ideas, think harder — re-read the code, try combining near-misses, try radical changes. The loop runs until you are manually stopped. ## Quick reference | File | Purpose | |---|---| | `train.py` | The only file you edit | | `prepare.py` | Fixed constants, data, eval (read-only) | | `program.md` | Full bootstrap + experiment protocol | | `orchestrator.py` | Fleet state, sync, monitoring | | `schema.py` | Experiment and fleet schemas | | `experiments.jsonl` | All results (generated by sync) | | `research/` | Plans, runs, fleet state |