๐งช Community Testing Guide
Help us stress-test the world's first fully autonomous research pipeline โ across every domain.
โญ Star the Repo ยท
๐ Quick Start ยท
๐ Feedback Template ยท
๐จ๐ณ ไธญๆๆต่ฏๆๅ ยท
๐ฏ๐ต ๆฅๆฌ่ชใในใใฌใคใ
---
## ๐ Welcome, Tester!
**AutoResearchClaw** is a fully autonomous academic paper generation pipeline. You give it a research idea โ it handles everything else: literature search, experiment design, code generation, experiment execution, paper writing, peer review, and final delivery. **23 stages, zero human intervention.**
We're looking for testers from **all disciplines and backgrounds** โ machine learning, NLP, computer vision, reinforcement learning, bioinformatics, physics, social sciences, and beyond. The more diverse the testing, the better the pipeline becomes.
**Your mission:** Run the pipeline with your own research idea, inspect the output, and submit a detailed feedback report. That's it. Every piece of feedback directly shapes the next version.
---
## ๐ Table of Contents
1. [Prerequisites](#-prerequisites)
2. [Installation & Setup](#-installation--setup)
3. [Running the Pipeline](#-running-the-pipeline)
4. [Inspecting the Output](#-inspecting-the-output)
5. [Feedback Report Requirements](#-feedback-report-requirements)
6. [Feedback Template](#-feedback-template)
7. [FAQ](#-faq)
---
## ๐ฆ Prerequisites
| Item | Minimum | Recommended |
|------|---------|-------------|
| OS | macOS / Linux / WSL2 | Linux (Ubuntu 22.04+) |
| Python | 3.11+ | 3.11 or 3.12 |
| Disk | 500 MB | 2 GB+ |
| RAM | 8 GB | 16 GB+ |
| GPU | Not required (sandbox mode) | NVIDIA GPU + CUDA 12.x (docker mode) |
| Network | Required (LLM API + literature search) | Stable connection |
| LLM API Key | **Required** | OpenAI or Anthropic |
### ๐ About API Keys
The pipeline calls a large language model (LLM) at every stage โ writing, coding, reviewing, and more. You'll need an API key from **OpenAI** or **Anthropic**.
> **We strongly recommend using the most capable models available for the best results:**
>
> | Provider | Recommended Model | Fallback |
> |----------|------------------|----------|
> | **OpenAI** | **GPT-5.4** (best) | GPT-5.1 or GPT-4.1 |
> | **Anthropic** | **Claude Opus 4.6** (best) | Claude Sonnet 4.6 |
>
> Using a top-tier model significantly improves paper quality, code correctness, and experiment design. Older models (e.g., GPT-4o) may produce noticeably weaker output.
---
## ๐ Installation & Setup
### โ ๏ธ Always Use the Latest Version
> **This project is under active development.** The codebase is updated frequently, and different versions can produce significantly different results.
>
> **Before every test run, always pull the latest code:**
>
> ```bash
> cd AutoResearchClaw
> git pull origin main
> pip install -e . # Re-install to pick up changes
> ```
>
> Record your version for the feedback report:
> ```bash
> git log --oneline -1
> ```
---
### Option A: Claude Code (Fastest โ Recommended โก)
If you have [Claude Code](https://claude.ai/claude-code) (Anthropic's CLI tool), just paste this:
```
Please clone and install AutoResearchClaw:
https://github.com/aiming-lab/AutoResearchClaw.git
If already cloned, run git pull origin main to update to the latest version first.
Then create a config file with:
- LLM: OpenAI with gpt-5.4 (or Anthropic Claude Opus 4.6)
- Experiment mode: sandbox (local execution)
- Research topic: ""
- Auto-approve all gate stages
My API key is: sk-xxxx (set it as an environment variable, don't hardcode it)
```
Claude Code will handle cloning, dependencies, configuration, and execution automatically.
### Option B: Manual Installation
```bash
# 1. Clone the repo
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
# 2. Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows (prefer WSL2)
# 3. Install
pip install -e .
# 4. Verify
researchclaw --help
```
### โ๏ธ Configuration
```bash
cp config.researchclaw.example.yaml config.yaml
```
Edit `config.yaml` โ here are the key fields:
```yaml
# === Project ===
project:
name: "my-test"
mode: "full-auto"
# === Research Topic โ describe your idea in English ===
research:
topic: "Your research idea in 1-2 sentences"
domains:
- "machine-learning" # Options: nlp, cv, rl, graph-learning, etc.
# === LLM โ use the strongest model you have access to! ===
#
# Option 1: OpenAI (GPT-5.4 recommended)
llm:
provider: "openai-compatible"
base_url: "https://api.openai.com/v1"
api_key_env: "OPENAI_API_KEY"
primary_model: "gpt-5.4" # Best available
fallback_models:
- "gpt-5.1"
- "gpt-4.1"
# Option 2: Anthropic Claude (Claude Opus 4.6 recommended)
# llm:
# provider: "openai-compatible"
# base_url: "https://api.anthropic.com/v1"
# api_key_env: "ANTHROPIC_API_KEY"
# primary_model: "claude-opus-4-6"
# fallback_models:
# - "claude-sonnet-4-6"
# === Experiment ===
experiment:
mode: "sandbox" # sandbox = local execution (recommended)
time_budget_sec: 600 # Max seconds per experiment run
max_iterations: 10
metric_key: "primary_metric"
metric_direction: "minimize" # or "maximize"
```
### ๐ Set Your API Key
```bash
# OpenAI users:
export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxx"
# Anthropic users:
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxxxxxxxxxxxxxx"
# Optional: Semantic Scholar API key (speeds up literature search)
export S2_API_KEY="your-s2-key"
```
> **๐ Security:** Never hardcode API keys in files. Use `api_key_env` in the config to reference an environment variable.
---
## ๐ Running the Pipeline
### Quick Start
```bash
source .venv/bin/activate
export OPENAI_API_KEY="sk-xxxx" # or ANTHROPIC_API_KEY
researchclaw run --config config.yaml --auto-approve
```
### With a Specific Topic
```bash
researchclaw run \
--config config.yaml \
--topic "Investigating the effect of curriculum learning on image classification with adaptive difficulty scheduling" \
--auto-approve
```
### โฑ Expected Runtime
| Mode | Estimated Time | Notes |
|------|---------------|-------|
| sandbox | 30 min โ 2 hours | Depends on experiment complexity & API speed |
| docker (GPU) | 1 โ 4 hours | For heavier deep learning experiments |
The terminal shows real-time progress. **No manual intervention needed** โ sit back and let it run.
### โ
How to Know It's Done
You'll see output like:
```
[Stage 23/23] โ Deliverables packaged
Pipeline complete โ deliverables at: artifacts/rc-20260315-XXXXXX-YYYY/deliverables/
```
### ๐ If It Gets Interrupted
The pipeline supports checkpointing โ just resume:
```bash
researchclaw run --config config.yaml --resume
```
---
## ๐ Inspecting the Output
After completion, find your results in `artifacts/rc-YYYYMMDD-HHMMSS-/deliverables/`.
### ๐ Deliverables
| File / Directory | Description |
|-----------------|-------------|
| `paper_final.md` | Final paper in Markdown (5,000โ6,500 words) |
| `paper.tex` | Conference-ready LaTeX source (directly compilable) |
| `references.bib` | BibTeX bibliography (verified citations) |
| `code/main.py` | Auto-generated experiment code |
| `code/requirements.txt` | Python dependencies for experiments |
| `charts/` | Result visualization charts (PNG) |
| `verification_report.json` | Citation integrity verification report |
| `manifest.json` | Deliverable manifest with metadata |
### ๐ What to Check
1. **Paper Content** (`paper_final.md` or `paper.tex`)
- Is the title relevant to the topic?
- Does the abstract clearly state problem, method, and results?
- Does Related Work cite key papers in the field?
- Is the method description technically correct?
- Is the experiment design sound (datasets, baselines, metrics)?
- Are results meaningful (not all zeros, not NaN)?
- Are conclusions consistent with experimental findings?
2. **Experiment Code** (`code/main.py`)
- Can it run independently?
- Does it use real datasets (not randomly generated fake data)?
- Does it implement what the paper describes?
- Are hyperparameters reasonable?
3. **Charts** (`charts/`)
- Are they readable and clean?
- Are axis labels correct?
- Does the data match the paper's claims?
4. **References** (`references.bib`)
- Do the cited papers actually exist?
- Are citations relevant to the discussion?
### ๐ Auto-Generated Quality Report
The pipeline produces a quality assessment at `stage-20/quality_report.json` containing:
- `score_1_to_10` โ automated quality score
- `verdict` โ accept / reject recommendation
- `strengths` โ what went well
- `weaknesses` โ identified issues
- `required_actions` โ suggested improvements
Please reference this in your feedback, and add your own expert judgment.
---
## ๐ Feedback Report Requirements
**Your feedback is the single most important input for improving this project.** Please be thorough and honest โ critical feedback is just as valuable as praise.
### What to Submit
| # | Item | Details |
|---|------|---------|
| F1 | **Feedback Report** (use template below) | Markdown format, named `feedback_.md` |
| F2 | **Full Output Directory** | Zip the entire `artifacts/rc-XXXXXX/` directory |
| F3 | **Config File** | Your `config.yaml` (**remove API keys first!**) |
| F4 | **Terminal Log** (optional but helpful) | Copy of the terminal output during the run |
### The Four Dimensions of Feedback
#### ๐ฏ (a) Quality Assessment
From your domain expertise:
- If this were a paper in your field, what level would it reach? (top venue / mid-tier / workshop / unpublishable)
- How does the writing compare to papers you normally read?
- Is the method technically correct? Any obvious errors?
- Is the experiment design reasonable?
#### ๐ก (b) Improvement Suggestions
- Which stage produced the weakest output? (literature search / experiment design / code generation / paper writing)
- Any obvious code errors or poor design choices?
- Specific suggestions for improving the paper structure or writing?
#### โ๏ธ (c) Pipeline Design Assessment
- Are the 23 stages well-designed? Any redundant or missing steps?
- Is the iterative experiment refinement effective?
- Is the LLM guidance at each stage appropriate?
#### ๐ (d) Bug Reports
Please report any issues you find, as specifically as possible:
- **Writing bugs:** grammar errors, repeated paragraphs, contradictions, references to non-existent figures
- **Code bugs:** runtime errors, logic errors, data handling issues
- **Result bugs:** all-zero results, NaN values, unreasonable metrics
- **Pipeline bugs:** stages getting stuck, unexpected crashes, resource exhaustion
---
## ๐ Feedback Template
Copy the template below, fill it out, and save as `feedback_.md`:
````markdown
# AutoResearchClaw โ Test Feedback Report
## Basic Information
- **Tester Name:**
- **Domain / Field:** (e.g., Computer Vision / NLP / Reinforcement Learning / Bioinformatics / ...)
- **Test Date:**
- **Code Version:** (output of `git log --oneline -1`, e.g., `44151b1 fix: Phase 3 regression test findings`)
- **Research Topic (English):**
- **LLM Model Used:** (e.g., gpt-5.4 / gpt-5.1 / claude-opus-4-6 / claude-sonnet-4-6)
- **Experiment Mode:** (sandbox / docker)
- **Total Runtime:** (~X minutes)
- **Completed All 23 Stages?:** Yes / No (if No, which stage failed?)
---
## 1. Quality Assessment (Score: 1โ10)
**My Score:** X / 10
### 1.1 Overall Paper Quality
- What level paper does this correspond to? (top venue / mid-tier / workshop / unpublishable)
- Reason for score:
### 1.2 Section-by-Section Assessment
| Section | Score (1-10) | Comments |
|---------|-------------|----------|
| Title | | |
| Abstract | | |
| Introduction | | |
| Related Work | | |
| Method | | |
| Experiment Design | | |
| Results & Analysis | | |
| Conclusion | | |
| References | | |
| Charts / Figures | | |
| Code Quality | | |
### 1.3 Comparison with Human-Written Papers
- Compared to papers you normally read/write, where are the gaps?
- Anything surprisingly good?
---
## 2. Improvement Suggestions
### 2.1 Top Issues (list 3-5, in priority order)
1.
2.
3.
### 2.2 Code Issues
- Can the code run independently?
- Does it use real datasets and baselines?
- Specific code issues (if any):
### 2.3 Writing Issues
- Is the paper structure reasonable?
- Is the technical description accurate?
- Specific writing issues (if any):
---
## 3. Pipeline Design Assessment
### 3.1 Pipeline Flow
- Is the 23-stage design reasonable?
- Any redundant or missing steps?
### 3.2 Experiment Execution
- Is the experiment design sound? (dataset choices, comparison methods, metrics)
- Is the iterative refinement effective?
### 3.3 LLM Usage
- How well did the LLM perform at each stage?
- Any obvious "hallucinations" or unreasonable outputs?
---
## 4. Bug Reports
### 4.1 Writing Bugs
| # | Location (section/paragraph) | Description | Severity (High/Med/Low) |
|---|------------------------------|-------------|------------------------|
| W1 | | | |
| W2 | | | |
### 4.2 Code Bugs
| # | File / Line | Description | Severity (High/Med/Low) |
|---|-------------|-------------|------------------------|
| C1 | | | |
| C2 | | | |
### 4.3 Result Bugs
| # | Description | Affected Metrics/Charts | Severity (High/Med/Low) |
|---|-------------|------------------------|------------------------|
| R1 | | | |
| R2 | | | |
### 4.4 Pipeline Bugs
| # | Stage | Description | Severity (High/Med/Low) |
|---|-------|-------------|------------------------|
| P1 | | | |
| P2 | | | |
---
## 5. Additional Comments
(Free-form: any observations, ideas, or suggestions you think would be valuable)
---
## Attachments Checklist
- [ ] Feedback report (`feedback_.md`)
- [ ] Full output directory (`artifacts/rc-XXXXXX.zip`)
- [ ] Config file (`config.yaml`, API keys removed)
- [ ] Terminal log (optional)
````
---
## โ FAQ
### Q1: Can I test without a GPU?
**Yes!** Use `experiment.mode: "sandbox"` โ the pipeline runs experiments on your CPU. The experiments will be simpler, but still enough for a full end-to-end test.
### Q2: How much does an API call cost?
A full pipeline run costs roughly **$5โ15** in API fees, depending on the model, number of revision iterations, and experiment complexity. Top-tier models (GPT-5.4, Claude Opus 4.6) cost a bit more but produce significantly better results.
### Q3: What if the pipeline crashes mid-run?
Resume from the checkpoint:
```bash
researchclaw run --config config.yaml --resume
```
### Q4: Can I use a non-English research topic?
We recommend describing your topic in **English**. The pipeline's prompts, literature search, and paper generation are all English-based. If your idea is originally in another language, please translate it first.
### Q5: What kind of research topic should I pick?
Choose a **specific research question in a field you know well** โ that way you can meaningfully assess whether the output is technically correct. Tips:
- โ
Pick topics with clear experimental validation (classification, regression, RL tasks, etc.)
- โ Avoid overly broad or abstract topics (e.g., "AGI", "general intelligence")
- โ
Be specific: *"Investigating the effect of data augmentation strategies on few-shot learning for medical image classification"*
### Q6: How do I use Docker mode? (Advanced)
If you have an NVIDIA GPU with Docker + NVIDIA Container Toolkit:
```bash
# 1. Build the experiment image
docker build -t researchclaw/experiment:latest researchclaw/docker/
# 2. Update config.yaml:
# experiment:
# mode: "docker"
# docker:
# gpu_enabled: true
# memory_limit_mb: 8192
# network_policy: "setup_only" # recommended default
# 3. Run
researchclaw run --config config.yaml --auto-approve
```
Docker mode uses a three-phase execution model: pip install (network on) โ setup.py (network on) โ experiment (network off). The image includes pre-cached datasets (CIFAR-10/100, MNIST, FashionMNIST, STL-10, SVHN) so standard benchmarks work without network access.
### Q7: I tested before โ what should I do for a re-test?
**Always pull the latest code** before each test:
```bash
cd AutoResearchClaw
git pull origin main
pip install -e .
```
Then verify your version:
```bash
git log --oneline -1
```
Different versions can produce very different results. Always note the commit hash in your feedback report.
### Q8: Where do I submit my feedback?
Submit your feedback report and attachments through one of these channels:
- **GitHub Issues:** [Open an issue](https://github.com/aiming-lab/AutoResearchClaw/issues) with the label `feedback`
- **Pull Request:** Submit your `feedback_.md` to the `community-feedback/` directory
- **Email:** Contact the project maintainers (see repo for details)
---
## ๐ We Need Testers from Every Field
The pipeline has been tested primarily on ML topics so far. We especially welcome testers from:
- ๐งฌ **Bioinformatics & Computational Biology**
- ๐งช **Chemistry & Materials Science**
- ๐ **Statistics & Applied Mathematics**
- ๐ค **Robotics & Control Systems**
- ๐ฃ๏ธ **NLP & Computational Linguistics**
- ๐๏ธ **Computer Vision & Graphics**
- ๐ฎ **Reinforcement Learning & Game Theory**
- ๐ฅ **Medical AI & Healthcare**
- ๐ **Graph Learning & Network Science**
- ๐น **Financial ML & Econometrics**
- ๐ฐ๏ธ **Remote Sensing & Geospatial AI**
...and any other field where computational experiments are involved!
---
## ๐ Thank You
Every piece of feedback โ big or small โ directly improves AutoResearchClaw. Thank you for being part of this journey.
โญ If you find this project interesting, please give us a star on GitHub!