# Research Papers This directory captures research ideas developed through analysis of real Claude Code usage logs from the devboy-tools production system. Each paper builds on empirical data collected via [devboy-tools-agent-usage](https://github.com/meteora-pro/devboy-tools-agent-usage). ## Papers | # | Title | Status | Core Idea | |---|-------|--------|-----------| | [1](./paper-1-trimtree.md) | TrimTree: Priority-Driven Pagination for LLM Tool Responses | draft | Knapsack-based item selection within token budget; p₁ metric | | [2](./paper-2-mckp-format-adaptive.md) | Format-Adaptive Tree Encoding via Multi-Choice Knapsack | draft | Per-subtree format selection (CSV / table / key:value) to minimize tokens | | [3](./paper-3-context-enrichment.md) | Context Enrichment Hypothesis in Agentic Tool Pipelines | draft | Thin initial context → more follow-up enrichment calls; measurement & mitigation | | [4](./paper-4-notebook-parquet.md) | Dataset-as-Context: Queryable Parquet Artifacts for LLM Agents | draft | Replace paginated tool calls with notebook + Parquet; LLM queries what it needs | ## Research Arc Papers 1–3 address **push-based** optimization: the server decides what to send and how. Paper 4 shifts to **pull-based**: the LLM receives a queryable dataset and extracts exactly what it needs via generated code. ``` Paper 1: binary knapsack → which items to include (fixed format) Paper 2: multi-choice knapsack → which items + which format per subtree Paper 3: enrichment detection → measure & reduce reactive follow-up calls Paper 4: pull-based query → LLM writes queries against Parquet artifacts ``` ## Shared Empirical Foundation All papers are grounded in the same dataset: real Claude Code sessions from devboy GitLab MCP pipeline (523 sessions, 10,644 MCP tool responses analyzed). Key baseline findings: - `get_merge_request_diffs`: P90 = 35k chars ≈ 10k tokens; 28% exceed 8k token budget - `get_epics`: P90 = 43k chars ≈ 12k tokens; 37% exceed 8k token budget - After large responses: agents always produce text (next turn), never paginate - Context enrichment: thin issues (< 200 chars/item) → 43% of turns add `get_issue` calls; rich issues (1.5k–4k chars/item) → only 2% do. Pearson r = −0.28 ## Candidate Benchmarks | Benchmark | Papers | Why | |-----------|--------|-----| | SWE-bench Verified (500 tasks) | 1, 3 | Ground truth which file/issue needed → measure p₁ | | τ-bench (retail + airline) | 1, 3 | Pass/fail per task → measure E[tool_calls] | | ACON eval suite | 2 | Direct baseline: also compresses agent observations | | FRAMES (824 multi-hop) | 1 | Multi-hop → E[chunks] measurement | | MCPAgentBench (180 tasks) | 2, 4 | MCP-native; Token Efficiency metric | | CostBench (travel domain) | 1, 4 | Cost-optimality = min tool calls to goal | | LLMLingua-2 eval suite | 2 | Standard token-savings measurement | | ToolBench RapidAPI dataset | all | 16k real REST responses → unit test fixtures | | Terminal-Bench / tbench.ai (89 tasks) | 1, 2 | Software eng + ML + sysadmin tasks; custom adapter → log tool calls for token savings |