--- name: gaia-submission description: Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation argument-hint: "[level] [limit] [models]" allowed-tools: Bash mcp__claude-flow__memory_store mcp__claude-flow__memory_search mcp__claude-flow__memory_list mcp__claude-flow__hooks_post_task mcp__claude-flow__hooks_pre_task --- # GAIA Submission Skill Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard. ## When to use When the user wants to: - Run a benchmark and submit results to the HAL leaderboard - Package an existing results file into a submission archive - Confirm their environment is ready for a benchmark run ## Prerequisites Before starting, confirm these are available: | Requirement | Check | |-------------|-------| | `ANTHROPIC_API_KEY` | `echo ${ANTHROPIC_API_KEY:0:8}…` (should show `sk-ant-…`) | | `HF_TOKEN` | `echo ${HF_TOKEN:0:5}…` (should show `hf_…`) | | Node.js 20+ | `node --version` | | CLI built | `node v3/@claude-flow/cli/bin/cli.js --version` | ## Phase 1 — Validate environment ```bash # Run all pre-flight checks /gaia validate ``` If any check fails, resolve it before continuing. ## Phase 2 — Estimate cost and confirm Ask the user for their configuration: - Level (default: 1) - Question limit (default: 53 for a quick run, 165 for the full L1 set) - Models (default: `claude-sonnet-4-6`) - Self-consistency voting (default: 1; use 3 for L2/L3) ```bash /gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING ``` If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)" ## Phase 3 — Run the benchmark ```bash /gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING ``` While running, progress is reported every 5 questions: ``` [12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18 ``` Store the run summary in memory for history tracking: ```bash npx @claude-flow/cli@latest memory store \ --namespace gaia-runs \ --key "run-$(date +%Y%m%d-%H%M)" \ --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}' ``` ## Phase 4 — Package for submission ```bash /gaia submit --results=~/.cache/ruflo/gaia/results-latest.json ``` This produces: ``` submission--/ ├── results.jsonl ← HAL-compatible, one JSON per line ├── trajectories.jsonl ← full agent traces ├── metadata.json ← harness info, model, tool catalogue ├── manifest.md.json ← Ed25519-signed witness └── README.md ← human summary + leaderboard comparison ``` ## Phase 5 — Compare and report ```bash /gaia leaderboard --level=$LEVEL /gaia history ``` Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the `/gaia-debugging` skill if needed. ## Phase 6 — Persist learnings ```bash npx @claude-flow/cli@latest hooks post-task \ --task-id "gaia-submission-$(date +%Y%m%d)" \ --success true \ --train-neural true ``` Store any discovered patterns: ```bash npx @claude-flow/cli@latest memory store \ --namespace gaia-patterns \ --key "submission-notes-$(date +%Y%m%d)" \ --value "Level $LEVEL, $MODEL: $NOTES" ``` ## Extensibility note This skill is intentionally structured to be benchmark-agnostic. The phase headers (validate → estimate → run → package → compare → learn) apply to SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.