Define the problem and metric. Get breakthrough results.
Define your problem and evaluation criteria — EurekAgent coordinates off-the-shelf CLI agents to propose diverse approaches, implement them, run experiments, and iterate. Human intervention is optional but supported at every step.
News · Overview · Quick Start · New Problem · Useful Tips · Results · Contributing · Commercial Licensing · Citation
## 📰 News - **2026/06/13** — EurekAgent has been accepted to the BAAI Agent4S workshop! Join us for our presentation at the BAAI conference on June 13th, 2026 in Beijing. Slides will be available soon. - **2026/06/12** — v0.1.0 released! ## 🔍 Overview We present **EurekAgent**, an agent system for metric-driven autonomous scientific discovery. Define your problem and evaluation criteria — EurekAgent coordinates off-the-shelf CLI agents to propose diverse approaches, implement them, run experiments, and iterate. Human intervention is optional but supported at every step. https://github.com/user-attachments/assets/c5b45b20-7eec-454e-98c3-6880bcec878b ### Highlights - **Environment engineering first** — provides strong CLI agents with the resources, constraints, artifacts, budgets, and human interfaces needed for reliable autonomous discovery. - **End-to-end research loop** — proposes approaches, implements code, evaluates submissions, and iterates toward better results. - **Problem-defined evaluation** — uses your `INSTRUCTION.md`, `SUBMISSION_FORMAT.md`, and private `evaluate.py` as the source of truth. - **Isolated execution** — runs agent work and grading in separate Docker containers for secure, sandboxed experiments. - **Resumable long runs** — flexibly interrupt and resume a run from persisted state. - **User-friendly interfaces** — optionally chat with agents through the TUI, and track live cost stats, score evolution, and full session logs in the web monitor. ## 🚀 Quick Start ### 1. Install Docker and Node.js 22+ **Docker** — follow the [official guide](https://docs.docker.com/engine/install/) for your platform. Then add your user to the `docker` group: ```bash sudo usermod -aG docker $USER # Check if the user is added to docker group groups $USER ``` **Node.js 22+** — the agent container is built on the `node:22-bookworm` image, so install a matching **Node.js 22+** runtime on the host as well (from [nodejs.org](https://nodejs.org/) or via [nvm](https://github.com/nvm-sh/nvm)) and confirm: ```bash nvm install 22 node --version # must be v22 or newer ``` ### 2. Install Claude Code EurekAgent drives the experiment loop through [Claude Code](https://docs.claude.com/en/docs/claude-code/overview). It runs both on your **host** (for the `/generate-inputs` skill and problem authoring) and **inside the agent container** (preinstalled by the Docker image below). **a) Install Claude Code on the host (requires Node.js 22+ from Step 2):** ```bash npm install -g @anthropic-ai/claude-code claude --version # sanity check ``` **b) Authenticate and point Claude Code at your model endpoint.** EurekAgent forwards these into the agent container, so configure them once in `~/.claude/settings.json` under the `"env"` block: ```json { "env": { "ANTHROPIC_AUTH_TOKEN": "YOUR_KEY_HERE", "ANTHROPIC_BASE_URL": "YOUR_BASE_URL_HERE", "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.1", "API_TIMEOUT_MS": "3000000" }, "model": "sonnet" } ``` ### 3. Install Python dependencies ```bash # Install uv (if not already installed) curl -LsSf https://astral.sh/uv/install.sh | sh source ~/.bashrc # Clone and enter the project git clone https://github.com/THU-Team-Eureka/EurekAgent.git && cd EurekAgent # Install uv-managed Python 3.12 uv python install 3.12.12 ``` ### 4. Pull the base image and build the container ```bash docker pull node:22-bookworm bash docker/build.sh ``` Verify the image is available: ```bash docker images | grep eureka-agent ``` If you are behind a proxy or `docker pull` fails, see the [Docker troubleshooting guide](assets/TROUBLESHOOTING.md). ### 5. (Recommended) Configure MCP servers for web access During a run the agent can search the web for problem context and read live pages. These MCP servers are **optional** — when absent, the agent falls back to Claude Code's built-in `WebSearch`. `web-search-prime` is intended for GLM users; users of other model providers can skip it or configure their preferred search MCP. **a) [`web-search-prime`](https://docs.z.ai/devpack/mcp/search-mcp-server) — structured web search for GLM users only** ```bash claude mcp add -s user -t http web-search-prime https://api.z.ai/api/mcp/web_search_prime/mcp --header "Authorization: Bearer YOUR_KEY_HERE" ``` **b) [`playwright`](https://github.com/microsoft/playwright-mcp) — fetch and read actual webpage content.**: ```bash claude mcp add playwright npx @playwright/mcp@latest npx playwright install chromium # pre-install the headless browser ``` EurekAgent ships a Playwright config at `.claude/playwright-mcp.json` (headless Chromium, sandbox flags, timeouts). It is mounted read-only into the agent container automatically — create or edit that file to match your network (e.g. add a proxy) if needed. ### 6. Run an example ```bash bash examples/circle_packing/run.sh ``` ## 🧠 Setting Up a New Problem You can use the `/generate-inputs` skill in Claude Code to interactively generate all required files (INSTRUCTION.md, SUBMISSION_FORMAT.md, evaluate.py, run.sh) from a natural language description of your problem. Just type `/generate-inputs` and follow the prompts. Each problem lives in its own directory under `examples/`. You need the following files: ### Required Files | File | Purpose | Required? | |------|---------|-----------| | `INSTRUCTION.md` | Problem description for the LLM agent | Yes | | `SUBMISSION_FORMAT.md` | JSON schema for candidates + score semantics | Yes | | `hidden_eval_dir/evaluate.py` | Private evaluator with `grade_submission` and `is_better` | Yes | | `initial.py` | Starting code for the agent | Recommended | | `run.sh` | Convenience script to launch a run | Recommended | ### evaluate.py Specification The evaluator is the single source of truth for scoring and comparison. It must define two functions: #### `grade_submission(submission_path: str, context: dict) -> dict` Called by the secure grader server to score a candidate submission. - **Parameters**: - `submission_path`: path to the JSON file the agent submitted - `context`: dict with `workspace_root`, `approach_id`, `metadata` - **Returns** a dict with: - `score` (float): the raw objective value. Do NOT negate. Return the value as-is (e.g., the C5 value for a minimization problem, or sum of radii for a maximization problem). - `valid` (bool): whether the submission is valid - `message` (str): human-readable feedback - `opt_target_met` (bool, optional): whether an optimization target was met - `public_metrics` (dict, optional): additional metrics for display - **Invalid submissions**: return a score that can never be "best". Use `float("inf")` for minimization problems, `float("-inf")` for maximization, or `float("inf")` for approach-target problems. #### `is_better(new_score: float, old_score: float) -> bool` Defines which score is better. Called by the system to compare scores for ranking, best-result tracking, and display. - **Returns**: `True` if `new_score` represents a better result than `old_score` - **Examples**: - Minimization: `return new_score < old_score` - Maximization: `return new_score > old_score` - Approach target (e.g., π): `return abs(new_score - 3.14159) < abs(old_score - 3.14159)` Both functions are **required**. The system will fail at startup if either is missing. ### INSTRUCTION.md Must clearly state: - The optimization objective and its direction (minimize, maximize, approach target, etc.) - Constraints and validation rules - Known best results (if any) or target score - The contract for the `run()` function ### SUBMISSION_FORMAT.md Must describe: - Required JSON keys and their types - Score semantics (e.g., "Score is the raw C5 value. Lower is better.") - Invalid submission behavior ### run.sh A convenience script. Must pass at minimum: - `--problem`: path to INSTRUCTION.md - `--hidden-eval-dir`: path to the directory containing evaluate.py - `--submission-format`: path to SUBMISSION_FORMAT.md - `--model`: the model to use - Time budget flags: `--propose-time-limit-per-session` + `--implement-time-limit-per-session` Example: ```bash cd "$(dirname "$0")/../.." uv run python -m src \ --model glm-5.1 \ --problem examples/my_problem/INSTRUCTION.md \ --hidden-eval-dir examples/my_problem/hidden_eval_dir \ --submission-format examples/my_problem/SUBMISSION_FORMAT.md \ --initial-code examples/my_problem/initial.py \ --propose-time-limit-per-session "20 minutes" \ --implement-time-limit-per-session "120 minutes" \ --max-num-approaches 3 \ --max-loops 5 \ --gpus auto \ --adapter-mode "pty" ``` GPU selection defaults to `--gpus auto`. For CPU-only runs, pass `--gpus none`. For a Linux NVIDIA server, pass explicit IDs such as `--gpus 0,1` if you want to restrict the run to a subset of GPUs. ## 💡 Useful Tips ### Best practices for new problems - Design evaluators defensively: consider obvious reward-hacking paths, invalid outputs, hidden-test leakage, tolerance abuse, filesystem side effects, and score tampering. - Include the current SOTA, best known score, or target score in `INSTRUCTION.md` so agents know what result they are trying to beat. ### Monitor & snapshots - **Live monitor** — starts automatically in the background during a run (disable with `--no-monitor`, pick a port with `--monitor-port`). It prints a `Web monitor: http://127.0.0.1: