# AcademiClaw **When Students Set Challenges for AI Agents** _A bilingual benchmark of 80 long-horizon tasks drawn from real undergraduate academic workflows that current AI agents fail to solve._ [![arXiv](https://img.shields.io/badge/arXiv-2605.02661-b31b1b.svg)](https://arxiv.org/abs/2605.02661) [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Paper-yellow)](https://huggingface.co/papers/2605.02661) [![License](https://img.shields.io/badge/License-Apache%202.0%20%2B%20third--party-blue.svg)](#license) [![Tasks](https://img.shields.io/badge/tasks-80-informational)](academiclaw/) [![Languages](https://img.shields.io/badge/languages-EN%20%7C%20ZH-informational)](#dataset-composition) [![GPU Tasks](https://img.shields.io/badge/GPU%20tasks-16-green)](#dataset-composition) [![Models](https://img.shields.io/badge/evaluated%20models-6-purple)](#main-results) [![Leaderboard](https://img.shields.io/badge/๐Ÿ†%20Leaderboard-live-ff6b35?style=flat-square)](https://gair-nlp.github.io/AcademiClaw/) **Paper:** [arXiv:2605.02661](https://arxiv.org/abs/2605.02661) ยท [HuggingFace Papers](https://huggingface.co/papers/2605.02661) ### ๐ŸŒ [**Explore the interactive leaderboard โ†’**](https://gair-nlp.github.io/AcademiClaw/) _Live dashboard at **** โ€” compare six frontier models head-to-head on all 80 tasks, drill into per-task scores, tool-usage breakdowns, S1โ€“S5 safety audits, and cost per task._
--- ## What is AcademiClaw? AcademiClaw is a bilingual (English + Chinese) benchmark designed to measure AI agents on **the academic work their end users already struggle to get done**. Every task was contributed by an undergraduate student who had already attempted it with at least one mainstream AI coding agent (Claude Code, Codex, Cursor) and found the agent unable to finish the job without extensive hand-holding. From **230 student-submitted candidates** a two-round expert review distilled the final set to **80 high-quality tasks** spanning **six categories** and **25+ professional domains** โ€” from olympiad-level reasoning and full-stack system debugging to GPU-intensive reinforcement learning and long-context literary comprehension. Experiments on six frontier models show that **even the best one solves only 55% of the tasks**, with striking per-category capability gaps that aggregate scores hide.
AcademiClaw evaluation pipeline Each task runs in an isolated Docker sandbox built from a two-layer image hierarchy. The agent operates freely with tools (read / write / edit / exec / search / browser). A task-specific rubric then scores the output via six complementary techniques, producing a 0โ€“100 score.
--- ## Highlights - **80 tasks, real academic provenance** โ€” 49 English + 31 Chinese, curated from student-submitted problems that had already defeated mainstream AI agents - **Six-category taxonomy, 25+ domains** โ€” balanced across research, ML engineering, software engineering, STEM reasoning, language/creativity, and applied domains - **Mixed compute requirements** โ€” 64 CPU-only + 16 GPU-required tasks, routed automatically - **Reproducible by construction** โ€” every task runs in an isolated two-layer Docker image; dependencies are pinned - **Multi-dimensional evaluation** โ€” 3โ€“6 orthogonal scoring dimensions per task, combining six complementary techniques (pattern matching, code execution, LLM-as-Judge, vision LLM, end-to-end browser testing, structured-output validation) plus a five-category safety audit - **Multi-agent harness** โ€” ships with adapters for Claude Code, any OpenAI-compatible endpoint (`openclaw`), and a manual mode - **Released evaluation trajectories** โ€” full sanitized conversation logs and rubric scores for six frontier models are included in the repo --- ## Dataset composition
Task distribution across 6 categories and 25 domains
Task distribution across six primary categories and 25+ professional domains.
| Category | Tasks | Representative examples | | --- | :---: | --- | | Research & Analysis | 21 | ESP32-S3 multi-peripheral firmware analysis (I2S / I2C / SPI); environment-stripped F1 driver advantage estimation | | ML & AI Engineering | 17 | Ascend NPU multilingual ASR deployment (fairseq2); isotropic SVD multi-task model merging (Iso-C / Iso-CTS) | | Software Engineering | 17 | BVH-accelerated Monte Carlo path tracing renderer; incident forensics with obfuscated payload decryption | | STEM Reasoning | 11 | CMO 2024 proof; IOL 2025 linguistics olympiad; constraint-satisfaction murder-mystery deduction | | Language & Creativity | 7 | Classical-to-modern Chinese lyric adaptation; funk-track Locking dance choreography with musical analysis | | Applied & Domain-Specific | 7 | Riichi mahjong shanten and tile-acceptance calculator; multi-constraint travel itinerary synthesis | Additional splits: | Split | Count | Notes | | --- | :---: | --- | | English (`en_*`) | 49 | | | Chinese (`zh_*`) | 31 | _Natively_ Chinese (tonal prosody, Shuangpin, Chinese essay rubrics, โ€ฆ) โ€” not translated instructions | | GPU-required | 16 | CUDA-dependent training, rendering, or inference | | CPU-only | 64 | | | **Total** | **80** | | --- ## Repository layout ``` academiclaw/ 80 task directories (en_* + zh_*) / description.json task metadata: id, name, deliverables, upstream provenance Dockerfile | Dockerfile.cuda task-specific image spec (inherits from base) eval_task.py harness entry point workspace/query.md the task prompt the agent reads context/ read-only reference material (third-party content retains its original license) eval/rubric.py scoring logic (pattern match / exec / LLM judge / vision / Playwright / schema) openclaw// conversation_log.json full sanitized agent trajectory meta_eval.json rubric scores + per-dimension breakdown QUERY_CATALOG.md auto-generated per-task index (English) QUERY_SUMMARY_ZH.md auto-generated per-task index (Chinese) docker/ shared base images Dockerfile CPU base (agencybench-sandbox: Ubuntu 24.04 + Python 3.11 + Node.js 22) Dockerfile.cuda GPU base (agencybench-sandbox-cuda: +CUDA 12.2 + cuDNN) docker-entrypoint.sh requirements.txt assets/ figures used in this README run_in_docker.sh runs a single task end-to-end build_all_images.sh parallel build of per-task images (CPU / GPU / both) batch_eval.sh parallel dispatch of run_in_docker.sh across tasks analysis/ post-evaluation aggregation scripts token_efficiency.py rank models by avg tokens / task aggregate_report.py Markdown report matching the paper appendix pricing.example.json optional cost-estimation config .env.example credentials template (copy to .env and fill in) ``` Each task's `openclaw//` directory ships the sanitized conversation log and rubric score for the six evaluated models. Per-attempt workspace snapshots (`attempt_*/`) are not included in the repository to keep size manageable. --- ## Quick start ### Prerequisites - Docker โ‰ฅ 20.10 - A POSIX shell (`bash` / `zsh`) - For GPU tasks: NVIDIA Container Toolkit and a visible NVIDIA device - ImageMagick / fonts / extra runtimes are added per-task by each `Dockerfile`; you don't need to install them on the host ### 1. Build the base images ```bash docker build -t agencybench-sandbox -f docker/Dockerfile docker/ docker build -t agencybench-sandbox-cuda -f docker/Dockerfile.cuda docker/ # only if you will run GPU tasks ``` ### 2. Configure credentials ```bash cp .env.example .env $EDITOR .env ``` `.env.example` is documented inline. Minimum you need to fill in to run the shipped `openclaw` backend: | Variable | Meaning | | --- | --- | | `OPENCLAW_API_KEY` / `OPENCLAW_BASE_URL` / `OPENCLAW_MODEL` | OpenAI-compatible endpoint that will play the role of the agent | | `EVAL_TEXT_API_KEY` / `EVAL_TEXT_API_BASE_URL` / `EVAL_TEXT_MODEL` | LLM-as-Judge model used by some rubrics (we used GPT-5.2 in the paper) | | `EVAL_VISION_API_KEY` / `EVAL_VISION_API_BASE_URL` / `EVAL_VISION_MODEL` | Vision judge for tasks whose rubric grades rendered output | ### 3. Run a single task ```bash ./run_in_docker.sh academiclaw/en_graph_algorithms ``` Useful flags: | Flag | Effect | | --- | --- | | `-a ` | agent backend: `claude_code`, `openclaw`, `manual` | | `-m ` | override the model name configured in `.env` | | `-n ` | max retry attempts (default 1) | | `-e` | evaluation-only (don't re-run the agent; just score an existing workspace) | | `-r` | force rebuild of the task's image | | `-g` / `--no-gpu` | override GPU auto-detection | | `--debug` | drop into an interactive shell inside the container | Output lands in `academiclaw////`: `attempt_/`, `conversation_log.json`, `meta_eval.json`. --- ## Batch: build + evaluate the whole suite ### Build all 80 per-task images in parallel ```bash ./build_all_images.sh # default: 8 parallel builds ./build_all_images.sh -j 16 # wider parallelism ./build_all_images.sh --rebuild # force rebuild everything ./build_all_images.sh --cpu-only # skip the 16 GPU tasks ./build_all_images.sh --gpu-only # only GPU tasks ./build_all_images.sh --only en_cmo_proof,zh_huaxue_jingsai ``` The script first ensures the two base images exist, then walks `academiclaw/*/` and builds an `agencybench-academiclaw-` image for every task that ships its own Dockerfile. GPU / CPU base selection follows the four-heuristic classifier from the paper (Dockerfile `FROM` line, presence of `.cu` sources, env-var hints, query-text keywords). Build logs land in `/tmp/academiclaw_build_logs/`. ### Evaluate the whole suite ```bash # Uses the repo-root .env; copies it into each task dir just for the run, # and cleans up on exit. ./batch_eval.sh # 8-wide, openclaw backend, all 80 tasks ./batch_eval.sh -j 16 --agents openclaw # wider parallelism ./batch_eval.sh --agents claude_code,openclaw # two agents back-to-back per task ./batch_eval.sh --resume # skip tasks with an existing meta_eval.json ./batch_eval.sh --cpu-only # skip the 16 GPU tasks ./batch_eval.sh --gpu-only # only GPU tasks โ€” assigns one GPU per container ./batch_eval.sh --only en_cmo_proof,zh_huaxue_jingsai ./batch_eval.sh --dry-run # print the plan without executing ``` Per-run logs go to `logs//__.log`. A summary of successes / failures prints at the end; failed tasks can be re-run via `--resume`. > **Tip โ€” multiple models.** To sweep several models of the same agent > family, run `batch_eval.sh` once per model with a distinct `.env` that > sets `OPENCLAW_MODEL=...`, pointing each run at a different `--env` > file: > > ```bash > for M in claude-opus-4-6 claude-sonnet-4-6 gpt-5.4; do > OPENCLAW_MODEL=$M envsubst < .env.tmpl > ".env.$M" > ./batch_eval.sh --env ".env.$M" --resume > done > ``` ### 4. Aggregate metrics across models Once `meta_eval.json` files exist under `academiclaw//openclaw//`, two scripts in `analysis/` turn them into reports (see [`analysis/README.md`](analysis/README.md) for full details): ```bash # Rank models by token consumption python3 analysis/token_efficiency.py --output reports/token_efficiency.json # Full Markdown report: efficiency, tool use, safety (S1โ€“S5), # correlations, timeouts, per-task variation, key findings python3 analysis/aggregate_report.py \ --pricing analysis/pricing.example.json \ --output reports/aggregate_report.md ``` The safety scores themselves are computed **per-task inside each container** by the `SafetyScorer` embedded in every `eval_task.py` (a rule engine over tool-call traces, adjudicated by an LLM judge) and written into `meta_eval.json` at key `safety_scores`. The scripts above only aggregate โ€” rerunning them after more evaluations finish updates the report without re-running any container. Pricing is optional; `pricing.example.json` ships the list prices in effect during the paper experiments, but cloud rates change, so **verify before using for cost estimation**. --- ## Evaluation framework ### Multi-dimensional rubrics (six complementary techniques) All rubrics produce scores on a unified **0โ€“100 scale**, with a task considered **passed when the score reaches 75 or above**. Each task defines its own `eval/rubric.py` implementing `evaluate(answer_dir) โ†’ (score, report)` with **3โ€“6 orthogonal scoring dimensions that sum to 100 points**. Rather than relying on a single criterion, the rubrics draw on six complementary verification techniques: | Technique | What it does | | --- | --- | | **Pattern matching** | Regular expressions, keyword detection, and AST parsing verify structural properties of code and text. | | **Code execution** | Compiles agent-produced programs (C++, Python, โ€ฆ), runs unit tests, and compares outputs with reference solutions. | | **LLM-as-Judge** | Evaluates open-ended deliverables (reports, analyses, creative writing) against a structured sub-rubric. Backed by a deterministic heuristic fallback when the external model is unavailable. | | **Vision LLM** | Compares rendered graphics, charts, or GUI screenshots against reference images. | | **End-to-end browser testing** | Uses Playwright to launch agent-produced web apps in a headless browser, interact with dynamic elements, and capture screenshots for pixel-level comparison. | | **Structured-output validation** | JSON schema checks, CSV programmatic verification, BibTeX parsing with fuzzy title matching, Excel cell inspection. | All LLM-as-Judge rubrics in the paper were scored by a **single fixed judge model** (we used GPT-5.2) to keep per-dimension scores comparable across evaluated agents. ### Five-category safety audit A rule-based scorer inspects the agent's tool-call trajectory along five risk axes, with optional LLM verification for ambiguous cases: - **S1 โ€” Destructive operations** (unauthorized file deletion, system modification) - **S2 โ€” Information leakage** (unintended data exposure) - **S3 โ€” Boundary compliance** (adherence to the task's stated constraints) - **S4 โ€” Privilege escalation** (action beyond the agent's intended scope) - **S5 โ€” Supply-chain risks** (installing unvetted packages, executing untrusted code) Per-category scores are combined via weighted aggregation into a single 0โ€“100 safety score. ### Trajectory logging An API logging proxy captures every LLM call (token counts, latency, estimated cost) alongside the full conversation trace (tool invocations with arguments and results). This enables post-hoc analysis of tool-use patterns and cost-efficiency trade-offs. --- ## Sandbox architecture Each task is distributed as a self-contained package: a natural-language prompt (`workspace/query.md`), optional reference materials (`context/`), and structured metadata (`description.json`). **The evaluation rubric is withheld from the agent throughout execution.** Containers follow a **two-layer image hierarchy**: - **Base โ€” CPU**: `agencybench-sandbox` โ€” Ubuntu 24.04 with Python 3.11, Node.js 22, the OpenClaw CLI, `build-essential`, `supervisord`, and `fonts-noto-cjk` for bilingual rendering. - **Base โ€” GPU**: `agencybench-sandbox-cuda` โ€” inherits the CPU base and layers CUDA 12.2 + cuDNN on top. - **Per-task**: inherits from the appropriate base and adds task-specific dependencies (Playwright, PyTorch, JDK, LaTeX, โ€ฆ). **GPU-requirement detection** uses four heuristics combined by logical OR: the task's `Dockerfile` `FROM` directive; presence of `.cu` files under `context/`; CUDA-related env vars; keyword classification of `query.md`. Multi-GPU hosts receive one device per container via deterministic hashing. **Workspace isolation**: a recursive snapshot is taken before the agent runs; another is taken after. The rubric operates on the structural diff, so scoring reflects only the agent's own work โ€” not pre-existing files. --- ## Main results > ๐Ÿ† **Interactive leaderboard:** > โ€” sort by cost, safety, or efficiency; expand any row for per-task > scores, tool-call breakdown, and the full S1โ€“S5 safety audit. | Model | Avg Score | Pass (%) | Tokens / task (K) | Tools / task | Time (s) | Safety | | --- | :---: | :---: | :---: | :---: | :---: | :---: | | Claude Opus 4.6 | **71.9** | **55.0** | 1 425 | 33 | 673 | 87.4 | | Claude Sonnet 4.6 | 68.3 | **55.0** | 1 562 | 26 | 662 | **88.7** | | GPT-5.4 | 65.6 | 42.5 | **525** | **19** | **240** | 87.5 | | Gemini 3.1 Pro | 64.3 | 43.8 | 2 857 | 57 | 822 | 74.9 | | Qwen3.5-397B-A17Bโ€  | 64.7 | 40.0 | 970 | 26 | โ€” | 80.8 | | MiniMax M2.7 | 63.1 | 37.5 | 1 663 | 37 | 686 | 86.5 | โ€  Self-hosted open-source model; wall-clock latency not directly comparable. Efficiency metrics are per-task averages; safety is a weighted aggregate of five audit dimensions. Full per-task numbers and domain breakdowns are in the paper. Two findings worth highlighting:
What a task tests matters more than which model attempts it. Cross-category mean scores range from 76.9 (Language & Creativity) down to 50.6 (STEM Reasoning) โ€” a 26.3-point gap โ€” whereas the cross-model mean ranges only from 71.9 (Opus) to 63.1 (MiniMax). On competition-level tasks (zh_huaxue_jingsai, en_fullstack_debug) every model collapses to 23โ€“27 with near-zero variance, indicating systematic capability gaps rather than stochastic errors. Domain-level quality by model and category
Pairwise model score-vector correlations Frontier models occupy distinct capability phenotypes. Pairwise Pearson r between the six models' per-task score vectors ranges from 0.27 (GPT-5.4 vs. Gemini) to 0.73 (Qwen3.5 vs. MiniMax). The spread is statistically significant (Fisher z, p = 6.5 ร— 10โปโต) โ€” which means these models are not ranked along a single scalar ability axis; they excel on complementary subsets of tasks.
Token consumption also correlates essentially zero with task score (pooled Pearson r = โˆ’0.03, p = 0.49): frontier agents appear to lack an effective stopping criterion. --- ## Evaluating a new agent Any agent that can be driven programmatically fits one of two slots: - **`claude_code`** โ€” expects `ANTHROPIC_API_KEY` / `ANTHROPIC_BASE_URL` (Anthropic-compatible chat completion endpoint). - **`openclaw`** โ€” expects `OPENCLAW_API_KEY` / `OPENCLAW_BASE_URL` / `OPENCLAW_MODEL` and speaks the OpenAI-compatible chat completion protocol. Any proxy that emits this shape (vLLM, SGLang, LiteLLM, โ€ฆ) works as a drop-in. To add an entirely new backend, implement a new agent class under the agent registry in the harness and wire it up through `AGENT_TYPE` in `.env`. --- ## License AcademiClaw uses a **two-tier licensing scheme**, following the paper's Licensing section: 1. **Our own contributions** โ€” the rubric code, evaluation harness, Docker scaffolding, and the task prompts we ourselves authored โ€” are released under the **Apache License 2.0** (see [`LICENSE`](LICENSE)). 2. **Third-party reference material bundled inside each task's `context/` directory** โ€” e.g., course assignments originally drafted by their instructors, Olympiad problem sets, published research papers, vendor documentation โ€” **retains the licenses or terms of use of its respective upstream sources**. The upstream source and license of each such item is documented in that task's `description.json` under its provenance field. This two-tier pattern mirrors common practice in academic benchmarks that bundle third-party reference material for reproducibility. **Handling of third-party material.** For materials we did not ourselves author, we restrict inclusion to items that are either (a) publicly disseminated with explicit redistribution permission, (b) released under a compatible open license, or (c) quoted under academic fair-use conventions. We cite the original source in the corresponding task's `description.json`. No private datasets or identifiable third-party personal data are included. **Takedown policy.** Upon a takedown request from a copyright holder we will redact the affected file from the public release and replace the task's reference material with a synthetic or openly-licensed equivalent. Please open a GitHub issue or contact the maintainers. **When you redistribute content from this repo**, always review the licensing of any items you pull in from a task's `context/` directory โ€” the Apache 2.0 grant in `LICENSE` applies **only** to our own contributions, not to bundled third-party material. --- ## Responsible use AcademiClaw contains realistic prompts that reflect authentic graduate research workflows. This includes tasks that touch on security analysis, privacy auditing, systems forensics, and other bias-sensitive subject matter. These tasks are provided for **evaluation of model capability and safety โ€” not for running against production systems or people**. Contributors submitted their tasks voluntarily, with explicit consent for open-source release, and retained the right to withdraw prior to publication. No personally-identifiable information was collected beyond a pseudonymous contributor handle. The dataset is offered under the assumption that users are building better AI agents, not circumventing safeguards. --- ## Citation If you use AcademiClaw in your research, please cite: ```bibtex @article{yu2026academiclaw, title = {AcademiClaw: When Students Set Challenges for AI Agents}, author = {Yu, Junjie and Lu, Pengrui and Si, Weiye and others and Liu, Pengfei}, journal = {arXiv preprint arXiv:2605.02661}, year = {2026}, url = {https://arxiv.org/abs/2605.02661} } ``` See [`CITATION.cff`](CITATION.cff) for the full structured citation record (all authors and affiliations). The paper is also listed on [HuggingFace Papers](https://huggingface.co/papers/2605.02661). --- ## Contributing See [`CONTRIBUTING.md`](CONTRIBUTING.md) for issue reports, rubric improvements, and proposals for new tasks. --- ## Acknowledgements AcademiClaw is maintained by the GAIR-NLP group at Shanghai Jiao Tong University, in collaboration with SII. The task pool was contributed by undergraduate students of the _Large Language Model Technologies_ course (class **AI3625**) at SJTU; each task is rooted in a real academic workflow those students had attempted with mainstream AI agents. Many tasks derive from open-source course materials, benchmarks, and research projects โ€” their original authors are credited in task-level `context/` files and in each task's `description.json`.