# Paper 4: Dataset-as-Context — Queryable Parquet Artifacts for LLM Agents

**Status:** draft (early concept)  
**Target venue:** NeurIPS 2026 / ICML 2026  
**Authors:** Andrei Mazniak

---

## Problem

Papers 1–3 optimize **push-based** context delivery: the server selects and encodes what to
send. All three still produce a static snapshot — the LLM reads it passively.

But data has structure and volume that no static snapshot can efficiently represent. A project
with 1,247 issues cannot be meaningfully summarized at 4k tokens without losing something
important.

The key insight: **LLM agents can write code**. Instead of giving the agent a pre-selected
subset of data, give it a queryable dataset and let it retrieve exactly what it needs.

## Core Idea

Replace paginated tool calls with a **notebook + Parquet artifact** pair:

```
API Response (1247 issues, full data)
    ↓ [devboy-tools serializer]
issues.parquet (binary, columnar, ~50KB vs ~5MB JSON)
    +
notebook_header.py (schema + 5 sample rows + helper functions)
    ↓ [LLM generates query]
df[df.priority == 'high'][['id', 'title', 'due_date']].head(20)
    ↓ [code executor]
20 rows × 3 columns = ~200 tokens (vs 5MB full response)
```

## Paradigm Shift

| Approach | Who decides what to fetch | Format | Computed fields |
|----------|--------------------------|--------|-----------------|
| Raw API (no opt.) | Server | JSON | No |
| TrimTree (Paper 1) | Server (knapsack) | Markdown/JSON | No |
| MCKP (Paper 2) | Server (knapsack + format) | Adaptive Markdown | No |
| **Notebook+Parquet (Paper 4)** | **LLM (code)** | Query result | **Yes** |

The LLM becomes an active participant in data retrieval, not a passive consumer.

## Token Economics

For 1,000 issues with 7 fields:

| What the LLM sees | Tokens |
|-------------------|--------|
| Full JSON | ~45,000 |
| TrimTree (top-20, Markdown table) | ~800 |
| Notebook header (schema + 5 rows) | ~200 |
| LLM query code | ~50 |
| Query result (20 rows × 3 cols) | ~300 |
| **Total (notebook approach)** | **~550** |

Token cost is proportional to what the LLM actually needs, not what exists.

## Computed Fields

This is the capability that no static approach can match. LLM computes fields on demand:

```python
# Fields that don't exist in the raw API:
df['days_open'] = (pd.Timestamp.now() - df['created_at']).dt.days
df['is_overdue'] = df['due_date'] < pd.Timestamp.now()
df['complexity'] = df['comments_count'] * df['linked_issues_count']

# Aggregations:
df.groupby('assignee')['story_points'].sum()
df[df.sprint == current_sprint]['status'].value_counts()

# Cross-dataset joins (replacing enrichment tool calls from Paper 3):
issues.join(comments, on='id').join(linked_issues, on='id')
```

## Auto-Generated Notebook Header

devboy-tools (Rust) generates the notebook automatically from any API response:

```python
# === Issues Dataset | project: devboy | branch: DEV-570 ===
# Generated: 2026-04-21 | Rows: 1247 | Source: get_issues(project_id=42)
import polars as pl
df = pl.read_parquet('/tmp/ctx/issues_abc123.parquet')

# Schema:
# id: Int64 | title: Utf8 | status: Utf8 | priority: Utf8
# assignee: Utf8 | created_at: Datetime | due_date: Datetime | story_points: Float64

# Sample (5 rows):
# 524  Fix login bug      open  high   @alex  2026-03-01  2026-04-15  3.0
# 531  Upgrade deps       done  low    @mika  2026-01-10  2026-02-01  1.0

# Cross-references:
# issues.id ← mr_issues.issue_id → mrs.id   (join available)

# Helpers:
def high_priority(): return df.filter(pl.col('priority') == 'high')
def overdue(): return df.filter(pl.col('due_date') < pl.lit(datetime.now()))
```

## Stack

- **Rust serializer**: `arrow2` or `polars` crate — serialize API responses to Parquet
- **Python runtime**: polars (faster than pandas, Rust-native) or DuckDB (SQL dialect)
- **Notebook format**: `.py` script with structured header, or `.ipynb` for richer output
- **Code execution**: sandboxed Python interpreter or DuckDB wasm
- **Cross-dataset joins**: Arrow IPC for zero-copy sharing between datasets

## Relation to Previous Papers

| Paper | Connection |
|-------|-----------|
| Paper 1 | Parquet header is itself a TrimTree output (schema + sample) |
| Paper 2 | Column format selection (int vs string vs category) is MCKP at schema level |
| Paper 3 | JOIN queries replace enrichment tool calls — E[enrichment] → 0 when data is local |

The notebook approach is an orthogonal optimization layer: Papers 1–3 optimize a single
response; Paper 4 eliminates repeated responses by making data local to the agent.

## Experiments

1. **Token savings** — measure tokens(notebook_header + query + result) vs
   tokens(equivalent TrimTree response) across 200 SWE-bench Verified tasks.
   Expected: 60–80% additional savings on top of TrimTree.

2. **Computed field rate** — how often does the LLM generate derived fields?
   Measure on τ-bench: count computed fields per task vs baseline (no Parquet).
   Hypothesis: LLM computes ≥ 1 derived field in > 50% of tasks with Parquet.

3. **E[tool_calls] reduction** — compare total API tool calls per task:
   Condition A: paginated tool calls (current devboy)
   Condition B: Parquet artifact + notebook
   Metric: τ-bench pass rate, CostBench cost-optimality score.

4. **Query correctness** — does the LLM write syntactically and semantically correct queries?
   Measure query execution errors and hallucinated column names.
   Baseline: LLM knowledge of polars/pandas API.

## Key Claims

1. Notebook+Parquet reduces total tokens-per-task by 60–80% vs TrimTree (Papers 1–2) alone
2. LLM computes derived fields in > 50% of tasks when Parquet is available
3. E[enrichment_tool_calls] ≈ 0 with Parquet (replaces Paper 3's enrichment calls via JOIN)
4. Query correctness ≥ 85% on well-typed schemas with helper functions

## Risks and Open Questions

- **Code execution safety**: Parquet queries must run in a sandbox
- **Schema hallucination**: LLM may reference non-existent columns — mitigated by helper fns
- **Query latency**: Parquet read + query execution adds ~50–200ms per query
- **Model capability**: smaller models (Haiku, 7B) may write incorrect queries — needs eval
- **Large Parquet files**: at what row count does Parquet overhead exceed benefit?

## Implementation Status

- [ ] Rust Parquet serializer for MCP responses (`arrow2` or `polars`)
- [ ] Auto-generated notebook header template
- [ ] Sandboxed Python/DuckDB executor
- [ ] Cross-dataset join registration (issues × MRs × comments)
- [ ] Evaluation harness for token savings measurement

## Related Work

- Code Interpreter (OpenAI): LLM executes code, but on user-uploaded files, not API responses
- TabFact / TAPAS: table QA, LLM reasoning over structured tables
- DFSQL / Text-to-SQL: generating SQL from natural language
- Gorilla (UC Berkeley): LLM API calls — related to function/tool calling accuracy
- DuckDB in-process analytics: closest practical component