--- name: gaia-architecture-comparison description: Side-by-side comparison of ruflo vs HAL vs other GAIA harnesses — capability gaps, design decisions, and improvement roadmap argument-hint: "[--focus=tools|routing|memory|cost]" allowed-tools: Bash Read mcp__claude-flow__memory_search mcp__claude-flow__memory_store --- # GAIA Architecture Comparison Skill Compare ruflo's GAIA benchmark harness against the Princeton HAL reference implementation and other open-source harnesses to understand capability gaps and prioritize improvements. ## When to use - Planning the next iteration of GAIA work - Evaluating which architectural change has the highest pass-rate ROI - Onboarding a new contributor to the benchmark codebase ## Architecture overview ### ruflo harness (current) ``` gaia-bench run └─ gaia-loader.ts — HF dataset download + cache └─ gaia-agent.ts — multi-turn Anthropic Messages loop └─ gaia-tools/ — web_search, file_read, web_browse, image_describe, python_exec └─ gaia-voting.ts — Track A self-consistency (N attempts → majority vote) └─ gaia-hardness/ — Track Q difficulty predictor (ADR-136) └─ gaia-judge.ts — two-stage LLM-as-judge scorer ``` ### HAL reference (Princeton) HAL uses a similar loop but with: - OpenAI function calling as the tool interface - BrowserBase / Playwright for real browser automation - Code interpreter sandbox (Jupyter kernel) - Larger token budget per turn (4096+) - Full 300-question evaluation set ### Key differences | Dimension | ruflo | HAL reference | Gap | |-----------|-------|--------------|-----| | Question count | 53 (partial L1) | 300 (full L1) | Use `--limit 165` for full L1 | | Web search | DuckDuckGo / Google CSE | BrowserBase live | Add Playwright or Browserless | | Code execution | python_exec stub | Real Jupyter kernel | Implement real sandbox | | Image OCR | image_describe (Gemini) | GPT-4V / Gemini | Functionally equivalent | | File handling | file_read | Full PDF/XLSX/ZIP parser | Expand file_read | | Self-consistency | voting.ts (Track A) | Not in reference | ruflo advantage | | Hardness routing | predictor.ts (Track Q) | Not in reference | ruflo advantage | | Memory | AgentDB HNSW | None | ruflo advantage | | Pass-rate L1 | ~20.8% (iter 23) | 74.6% (HAL Sonnet 4.5) | ~54 pp gap | ## Gap analysis ### Primary gaps (high impact) 1. **Real code execution** — many L2/L3 questions require running Python to compute a numerical answer. The current `python_exec` tool is a stub. Implementing a real sandbox (E2B, Pyodide, or subprocess) is the single highest-ROI change. 2. **Full question set** — running 53/300 L1 questions underestimates true pass-rate because the first 53 skew easier. Run `--limit 165` (full L1) for a comparable HAL score. 3. **Real browser** — `web_browse` currently fetches raw HTML. Replacing it with Playwright/Browserless for JavaScript-rendered pages would unlock many web navigation questions. ### Secondary gaps (medium impact) 4. **Structured file parsing** — PDF, XLSX, and ZIP attachments require dedicated parsers. `file_read` currently handles plain text and images only. 5. **Turn budget** — 12 turns may be insufficient for complex multi-step questions. HAL uses up to 20 turns for L3. 6. **System prompt tuning** — HAL's system prompt is more elaborate and explicitly instructs the model to use tools before answering. ### ruflo advantages 7. **Self-consistency voting** (Track A) — running N attempts per question and taking the majority answer reduces variance on borderline questions. HAL does not implement this. 8. **Hardness routing** (Track Q) — routing each question to an appropriate model and turn budget based on predicted difficulty. This reduces cost on easy questions while providing more resources for hard ones. 9. **AgentDB memory** — storing patterns across runs enables the agent to recall successful strategies for similar question types. ## Improvement roadmap | Priority | Change | Expected Lift | Effort | |----------|--------|--------------|--------| | P0 | Real python_exec sandbox (E2B) | +15-25 pp | High | | P0 | Full 165-Q L1 evaluation | Accurate baseline | Low | | P1 | Playwright-based web_browse | +5-10 pp | Medium | | P1 | PDF/XLSX file parser | +3-8 pp | Medium | | P2 | Increase max-turns to 20 for L2/L3 | +2-5 pp | Low | | P2 | System prompt tuning (iter 30 research) | +2-5 pp | Low | | P3 | Google Grounding via Gemini (iter 32) | +3-7 pp | Medium | | P3 | Multi-provider routing (Gemini Flash for cheap Q's) | Cost reduction | Medium | ## Loading context from past research ```bash npx @claude-flow/cli@latest memory search \ --namespace gaia-patterns \ --query "architecture comparison HAL benchmark" ``` ## Storing comparison findings ```bash npx @claude-flow/cli@latest memory store \ --namespace gaia-patterns \ --key "architecture-comparison-$(date +%Y%m%d)" \ --value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing." ```