# GDPVal Benchmark Benchmark for evaluating OpenSpace on [GDPVal](https://huggingface.co/datasets/openai/gdpval) (220 occupational tasks across 44 occupations and 9 sectors). Measures token savings from skill accumulation by running each task twice: - **Phase 1** — Cold start. Skills accumulate as tasks run sequentially. - **Phase 2** — Warm start. Re-run all tasks with the full Phase 1 skill library. Evaluation uses [ClawWork](https://github.com/HKUDS/ClawWork)'s LLM evaluator (same rubrics, same 0.6 payment cliff). ## Project Layout ``` parent/ ├── OpenSpace/ ← this repo │ └── gdpval_bench/ ← this directory └── ClawWork/ ← required ├── eval/meta_prompts/ ← evaluation rubrics └── livebench/data/agent_data/ ← ClawWork agent results (for leaderboard) ``` ## Setup ```bash pip install -e . && pip install datasets git clone https://github.com/HKUDS/ClawWork.git ../ClawWork pip install -r gdpval_bench/requirements-eval.txt export OPENROUTER_API_KEY="sk-or-..." export EVALUATION_API_KEY="sk-..." ``` ## Run ```bash python -u -m gdpval_bench.run_benchmark \ --task-list gdpval_bench/tasks_50.json \ --model openrouter/qwen/qwen3.5-plus-02-15 \ --use-clawwork-productivity \ --clawwork-root ../ClawWork \ --resume ``` Key flags: `--phase1-only`, `--phase2-only`, `--no-eval`, `--concurrency N`, `--max-tasks N`, `--prefetch-only`, `--dry-run`. ## Included Data ``` skills/ # evolved skills .openspace/openspace.db # skill & tool quality DB (auto-generated during evolution) ``` `skills/` contains the full skill library produced by evolution — each subdirectory holds a `SKILL.md`. `.openspace/openspace.db` tracks skill lineage, tool quality records, and execution analyses accumulated across benchmark runs. ## Output ``` results// ├── phase1_results.jsonl # per-task Phase 1 ├── phase2_results.jsonl # per-task Phase 2 ├── comparison.jsonl # Phase 1 vs Phase 2 deltas ├── summary.json # aggregate statistics ├── skills_snapshot.json # skills after Phase 1 ├── config.json # run config ├── workspace/ # agent working directories └── recordings/ # execution trajectories ``` ## Analyze ```bash python -m gdpval_bench.calc_subset_performance ``` Produces leaderboard (OpenSpace vs ClawWork agents), head-to-head comparison, and token savings breakdown. ## Task List - **[`tasks_50.json`](tasks_50.json)** — 50 task IDs (deterministic subset of GDPVal-220). - **[`tasks_50_full.jsonl`](tasks_50_full.jsonl)** — Full task data downloaded from [HuggingFace](https://huggingface.co/datasets/openai/gdpval). One JSON object per line with all original fields (`task_id`, `sector`, `occupation`, `prompt`, `reference_files`, `reference_file_urls`, `rubric_json`, etc.). Covers all 9 sectors and 44 occupations.