# Paper 2: Format-Adaptive Tree Encoding via Multi-Choice Knapsack **Status:** draft **Target venue:** ACL 2026 / NAACL 2026 **Authors:** Andrei Mazniak --- ## Problem Paper 1 (TrimTree) decides *which* items to include. Paper 2 asks: once we decide to include a subtree, *how should it be encoded*? The same data has wildly different token costs depending on format: | Data shape | JSON | Markdown table | CSV | key:value | |------------|------|---------------|-----|-----------| | 20 issues × 5 fields | ~1800 tokens | ~400 tokens | ~280 tokens | — | | Flat config object | ~200 tokens | — | — | ~80 tokens | | Code diff | ~500 tokens | — | code-fence | ~500 tokens | Current approach (TOON) uses a custom format that LLMs don't natively know. The new approach uses standard formats (Markdown, CSV, key:value) embedded in a Markdown wrapper — formats the LLM understands without training. ## Core Idea Extend TrimTree's binary knapsack to a **Multi-Choice Knapsack Problem (MCKP)**: for each subtree, choose *one format from N options* or skip entirely. ``` For each subtree node i, options j ∈ {skip, kv, csv, table, json, prose}: cost(i, j) = tokens when encoded in format j value(i, j) = information value (same for all non-skip options) MCKP: max Σᵢ value(i, jᵢ) s.t. Σᵢ cost(i, jᵢ) ≤ budget, jᵢ ∈ options(i) ``` ## Structural Parser API responses (from DevBoy MCP) are already Markdown tables. We parse them into a typed tree: ``` pulldown-cmark (Markdown) → typed tree nodes: SectionNode (## heading) TableNode (| col | col |) → array of RowNodes ListNode (- item) → array of ItemNodes BlockquoteNode (> text) → hint/metadata node (zero cost, always included) CodeFenceNode (``` lang) → code/diff node serde_json (JSON) → JsonObjectNode / JsonArrayNode / JsonScalarNode ``` Both parsers produce the same `TreeNode` trait — one MCKP solver handles both. ## Format Selection Rules Data shape → eligible formats: | Node type | Eligible formats (ascending token cost) | |-----------|----------------------------------------| | Array of objects (issues, MRs) | CSV → Markdown table → JSON array | | Flat object (config, metadata) | key:value → JSON object | | Text field (description, body) | truncated string → full string | | Code / diff | code-fence (fixed, always preserve structure) | | Numeric / enum | inline → JSON | | Hint / metadata | blockquote (zero marginal cost) | ## In-Response Hints — Design Guide Response-level hints are short, structured markers embedded in tool output that (a) point to content elided for budget reasons, (b) reference earlier tool results identical to the current one, or (c) recap metadata the agent needs for correct follow-up. They are the primary mechanism by which the 4-layer pipeline reclaims tokens without losing information the agent can still retrieve. ### Relation to prior art | Work | Scope | Granularity | Dynamic (cross-turn)? | |------|-------|-------------|-----------------------| | MCP `ToolAnnotations` (2025-03) | tool definition metadata (readOnlyHint, destructiveHint, …) | per-tool | no — static | | Anthropic *Writing Tools for Agents* (2026) | 25k default budget, pagination / range / filtering, `ResponseFormat` enum | per-response | no — per-call | | ResourceLink *DualResponseToolResult* (arXiv 2510.05968) | preview + resource-link + `QueryMetadata` with `total_count` | per-response | no — per-call | | OpenAI function calling guidance | tool-definition token budget, <100 tools, strict schemas | per-tool | no — static | | **This work** | reference / overflow / format / metadata hints **across turns** | per-response, **session-scoped cache** | **yes** | The MCP Interest Group has an open agenda item on extending annotations to responses but no shipped specification; this paper's hint taxonomy is designed to be forward-compatible. ### Design principles 1. **Markdown blockquote carrier (`> [...]`)** — all in-response hints are single-line blockquotes. Most LLMs treat the `>` prefix as metadata and parse, but ignore, its payload when generating user-facing text. Blockquotes are zero-cost in the MCKP budget (see §Format Selection Rules). 2. **Terseness over prose** — target ≤ 15 tokens per hint (measured in `cl100k_base`). See §Hint Cost Model below. 3. **Semantic precision** — a hint must be falsifiable. `> [ref: tc_42]` claims byte-identity with `tc_42`; `> [near-ref: tc_42, status: pending→success]` claims identity except for the enumerated delta. 4. **Composable with MCP annotations** — response-level hints live in `content.text`, not in `ToolAnnotations`; the two carry orthogonal information and can coexist without protocol changes. 5. **Agent-interpretable without schema** — a new agent seeing `> [ref: tc_42]` for the first time can still retrieve `tc_42` from its context by text match. No client-side registry is required. ### Hint taxonomy Four hint families, emitted by different pipeline layers: | Family | Emitter | Purpose | Example | |--------|---------|---------|---------| | **Reference** | L0 dedup | Byte-identical content already in context | `> [ref: tc_42, byte-identical]` | | **Near-reference** | L0 dedup (Type-2) | Near-duplicate with small enumerated delta | `> [near-ref: tc_42, status: pending→success, duration: +22s]` | | **Overflow** | L2 MCKP / Paper 1 knapsack | Items elided for budget; how to retrieve | `> [+18 more omitted. call get_issues(page=2)]` | | **Format** | L1 / L2 encoders | Non-default encoding of structured payload | `> [encoded: csv from markdown-table | rows=18]` | | **Metadata** | pipeline preamble | Zero-cost context: query echo, total_count, timestamps | `> [query: "status=open" | total=247 | returned=20]` | A single response may carry multiple hints in separate blockquotes, but each hint is independent — clients can process or ignore them individually. ### Format specification All hints share the form ``` > [: ] ``` where `` is one of `ref` / `near-ref` / `+N more` / `encoded` / `query`, and `` is family-specific. The leading `> ` (greater-than, space) and trailing newline are required. #### Reference (L0 hit) ``` > [ref: , byte-identical] ``` - ``: 8-character hex prefix of SHA-256 of the referenced response content. Identifies the prior call without exposing its body. - `byte-identical`: signals to the agent that the content hash matched exactly — the agent should retrieve tokens from its in-context copy of that earlier tool response. Measured cost: **8 tokens** for a 16-char hash, **11 tokens** for an 8-char hash with extra metadata (`cl100k_base`). See §Hint Cost Model. #### Near-reference (L0 Type-2, future work) ``` > [near-ref: , ] ``` Where `` is `key: old_value→new_value` pairs separated by `, `. Intended for high-frequency, near-idempotent endpoints like pipeline polls where only `status` and `duration` change between calls. Measured cost: **13-18 tokens**. Applicable when the delta is ≤ 30 bytes and the full response would have been ≥ 500 bytes. #### Overflow (L2 knapsack / Paper 1) ``` > [+ more. call ] ``` - ``: count of elided items. - ``: exact tool call the agent can issue to retrieve them (e.g. `get_issues(page=2, status="open")`). The paired *header* preceding the blockquote carries total and slice: ``` ## Issues (5 of 23, priority-sorted) ``` Measured cost: **14-22 tokens** depending on signature length. #### Format (L1/L2) ``` > [encoded: | ] ``` Where `` ∈ {`csv_from_md`, `deep_mckp`, `kv`, `csv`, `toon`} and `` summarizes what the agent should expect (e.g. `rows=18, cols=6`). Rarely emitted — most L1/L2 encodings are self-describing. Emit when the format differs markedly from the endpoint's historical default. Measured cost: **12-15 tokens**. #### Metadata (preamble) ``` > [query: "" | total= | returned=] ``` Purely informational. Emitted at the top of responses where the agent's subsequent reasoning benefits from knowing scope (e.g. "got 20 of 247 results — I may need pagination"). Zero-cost policy: emit only if the saved tokens from *avoiding an unnecessary follow-up call* outweigh the hint cost. Empirically, worth emitting for any tool where `total_count > returned_count`. Cost: **10-18 tokens** depending on query length (see Paper 3 for follow-up call reduction measurements). ### Hint cost model Let `t(h)` be the tokenized length of hint `h` under `cl100k_base`. Measurements on our corpus (mean over 100 samples per family): | Hint family | Mean tokens | Minimum | Maximum | |-------------|-------------|---------|---------| | Reference (`> [ref: X, byte-identical]`) | **11** | 8 | 14 | | Near-reference (`> [near-ref: X, s: a→b]`) | 14 | 11 | 22 | | Overflow (`> [+N more. call F]`) | 16 | 12 | 24 | | Format (`> [encoded: F \| rows=N]`) | 13 | 10 | 18 | | Metadata (`> [query: Q \| total=N]`) | 14 | 10 | 20 | The **break-even** point — response size above which emitting a hint saves tokens — is approximately 4× the hint's own cost (because the alternative is a full response with the same information). | Hint family | Break-even response size | |-------------|--------------------------| | Reference | ≥ 44 tokens (~175 chars) | | Near-reference | ≥ 56 tokens (~220 chars) | | Overflow | ≥ 64 tokens (~255 chars) | | Format | ≥ 52 tokens (~210 chars) | Most tool responses exceed these thresholds by 10-100×, so hint emission is nearly always profitable when applicable. ### Semantic guarantees For each hint we state an agent-testable invariant: - **Reference**: if the agent retrieves `tc_X` from its context, the bytes it obtains are byte-identical to what would have been emitted in lieu of the hint. - **Near-reference**: the claimed delta is the *exact* diff; all other bytes match. - **Overflow**: reissuing the printed `` returns the elided items, modulo the data source's own consistency guarantees (no snapshot isolation promised). - **Format**: decoding `` with standard parsers yields the logical content the agent would have received from the raw endpoint. - **Metadata**: the numeric fields are accurate at the time of emission; no freshness guarantee across turns. The pipeline MUST NOT emit a hint whose invariant it cannot satisfy. On content mutation or compaction boundary, the relevant cache entries are invalidated (see §L0 Dedup) before any hint can reference them. ### Compatibility with MCP `ToolAnnotations` The two are strictly orthogonal: | Aspect | MCP `ToolAnnotations` | Paper 2 in-response hints | |--------|-----------------------|---------------------------| | Location | `tool.annotations` field in tool registration | `content.text` body of each result | | Granularity | tool-level, static | response-level, dynamic | | Purpose | safety/UX (auto-approval, confirmation) | token economy, pagination, history reference | | Schema | strongly typed (booleans + title) | free-form blockquote markup | | Trust | advisory; client decides | advisory; client decides | | Backward compat | shipping since 2025-03-26 | forward-compat with open SEP for response annotations | A server MAY emit Paper 2 hints regardless of whether it publishes `ToolAnnotations`; a client MAY parse hints regardless of whether it enforces annotations. Neither mechanism strictly depends on the other. ### Alignment with Anthropic & OpenAI guidance - Anthropic (*Writing Tools for Agents*, 2026): hint-based dedup operationalizes their "steer agents towards more token-efficient tool-use behaviors" guidance at the infrastructure layer — the agent needs no prompting to benefit. - Anthropic `ResponseFormat` enum (detailed/concise): our adaptive configuration (§Adaptive Configuration) provides a data-driven generalization — the right verbosity is learned per-endpoint from telemetry rather than chosen per-call by the agent. - OpenAI *Function Calling* guidance: focuses on tool-definition tokens; complementary with our work on tool-response tokens. - ResourceLink `DualResponseToolResult` (arXiv 2510.05968): overlap on overflow handling. Our Overflow hint format maps directly onto their `total_count` + `returned_count` semantics; a server can emit both for redundancy. ### Anti-patterns Examples of hints the pipeline must not emit: | Anti-pattern | Why | |--------------|-----| | `> [ref: tc_42, last modified 3 min ago]` | Freshness claim without compaction-boundary guarantee | | `> [ref: tc_42]` without the earlier call still in context | Dangling reference; breaks retrievability invariant | | `> [encoded: custom_json_v3]` | Uses a format_id no public parser recognizes | | `> [skip: too big, sorry]` | Unactionable; no recovery path for the agent | | Multi-line blockquote fences | Parser fragility across clients; stay single-line | ### Legacy example from Paper 2 v1 The original overflow example remains valid under the new taxonomy: ```markdown ## Issues (5 of 23, priority-sorted) > [query: "status=open" | total=23 | returned=5] | #524 | Fix login bug | open | @alex | | #531 | Upgrade deps | done | @mika | > [+18 more. call get_issues(page=2)] ``` Three hints in one response: 1. Metadata preamble — query echo + counts. 2. The MCKP-selected `markdown_table` body (not itself a hint). 3. Overflow footer — signal to the agent about continuation. ## Structural Markdown Parser Implementation ```rust // crates/devboy-mcp/src/pipeline/md_tree.rs pub trait TreeNode { fn token_cost(&self, format: Format) -> usize; fn eligible_formats(&self) -> &[Format]; fn encode(&self, format: Format) -> String; fn value(&self) -> f64; } pub enum Format { Skip, KeyValue, Csv, MarkdownTable, Json, Prose, CodeFence } pub struct TableNode { pub headers: Vec, pub rows: Vec> } pub struct SectionNode { pub title: String, pub children: Vec> } // ... ``` ## Experiments 1. **Token savings by data shape** — measure tokens(format_j) / tokens(json_baseline) per node type across ToolBench RapidAPI dataset (16k real responses). Expected: array-of-objects → 70–85% savings with CSV vs JSON. 2. **LLM task accuracy** — does format choice affect LLM accuracy on SWE-bench? Run with: JSON baseline / TOON / MCKP-selected format. Expected: MCKP ≈ JSON accuracy with 60–75% fewer tokens. 3. **Hint effectiveness** — do hints reduce enrichment follow-up calls? Compare E[enrichment_calls] with vs without embedded hints on τ-bench. Connects to Paper 3 empirical baseline. 4. **MCKP vs binary knapsack** — does per-subtree format selection improve p₁ at the same budget vs Paper 1's binary approach? ## Baselines - Raw JSON (no optimization) - TOON format (custom, 3 levels: Full/Standard/Minimal) - LLMLingua-2 (token-level compression, agnostic to structure) - ACON (environment observation compression) ## Key Claims 1. MCKP format selection achieves 60–75% token reduction vs JSON with < 3% accuracy loss 2. CSV is optimal for array-of-objects; key:value for flat objects (token savings > 75%) 3. Embedded LLM hints reduce enrichment follow-up calls by ≥ 25% vs no hints ## Real-World Applicability (Claude Code JSONL Corpus) To ground the MCKP framework in real usage, we analyzed **144,001 tool-call responses** extracted from anonymized Claude Code session logs. Each response was classified by structural shape, scanned for embedded inner formats (diff / log / md / xml / yaml / code), and evaluated against a recursive deep-MCKP cost model. **Extraction pipeline**: `docs/research/scripts/extract_paper2_format_events.py` (anonymizing — emits only shape, size, format counts; never raw content). ### Shape distribution | Shape | Events | % | Total tokens | |-------|-------:|---:|-------------:| | prose (bash / grep / free text) | 76,720 | 53.3% | 35M | | numbered_list (file reads) | 47,822 | 33.2% | 59M | | nested_object | 6,880 | 4.8% | 7.5M | | code_block | 6,334 | 4.4% | 7.9M | | bullet_list | 4,909 | 3.4% | 2.6M | | markdown_table | 1,051 | 0.7% | 1.3M | | flat_object | 183 | 0.1% | 0.1M | | array_of_objects | 98 | 0.1% | 0.1M | Paper 2's target pool — shapes where format selection is meaningful — is 8,212 events (5.7% of corpus). The remaining 94% are text forms (file reads, bash output, prose) that cannot be re-encoded. ### Multi-format (matryoshka) prevalence Of the 8,212 structured responses, **83% of `nested_object` events contain formats at depth ≥ 3** (e.g. `nested_object > url + log + hash` inside string-valued fields). The canonical case: MCP `get_*_pipeline` responses with 13–21 URL fields, 2 log fields, and a commit hash inline, all wrapped in a JSON top level. ### Top inner-format combinations | Container | Inner formats | Events | Avg savings (deep_mckp) | Ktok saved | |-----------|---------------|-------:|------------------------:|-----------:| | nested_object | **log + url + hash** (pipeline sig.) | 1,432 | 18.7% | 263 | | nested_object | log + url | 1,434 | 3.3% | 42 | | nested_object | diff (MR diffs wrapped in JSON) | 151 | 5.1% | 34 | | nested_object | prose + url + log (issue body) | 1,122 | 2.7% | 7 | ### Corpus-level impact - **Applicable events (≥ 20% savings): 1,797** (1.25% of corpus) - **Total tokens saved: 1.55M** (1.35% of baseline) - **Dominant wins**: - `csv_from_md` (markdown table re-encoding): 883 events, **1,120 ktok** (avg 69% savings). Primary beneficiaries: MCP `get_issues` endpoints (avg ~92% savings per call). - `deep_mckp` (recursive per-leaf format selection): 867 events, **426 ktok** (avg 26% savings). Primary beneficiaries: MCP `get_*_pipeline` endpoints. - **Savings bimodal**: two clear peaks — 20-30% bucket (pipelines, 930 events, 408 ktok) and 90%+ bucket (big tables, 411 events, 1,021 ktok). ### Rule set distilled from the corpus Full YAML at `docs/research/data/paper2_format_rules_v2.yaml`. Top 4 rules cover 95.8% of all savings: | Rule | Predicate | Action | Savings (median) | Support | Ktok | |------|-----------|--------|------------------:|--------:|-----:| | R1 | md_table + cols≥2 + chars>8k | csv_from_md | **99.6%** | 200 | **850** | | R8 | nested + log+url+hash | deep_mckp | **20.1%** | 1,432 | **369** | | R3 | md_table + cols 2–7 (small) | csv_from_md | 35.5% | 659 | 158 | | R2 | md_table + cols≥8 | csv_from_md | 97.3% | 177 | 108 | ### Implications for the MCKP algorithm 1. **Markdown-table → CSV is the primary optimization.** The canonical Paper 2 code path should be a streamlined md→CSV re-encoder with automatic emission for any response with `shape == markdown_table AND n_cols ≥ 2`. No multi-choice decision needed in this path — CSV always wins. 2. **Deep recursion is required for nested objects.** Flat top-level MCKP misses 83% of savings on `nested_object` because the benefit lives in string-valued leaves (logs, diffs, URLs). The solver must recurse into string leaves with a format sniffer before deciding emission. 3. **Pipeline responses deserve a template.** The `log+url+hash` trifecta is so consistent (1,432 events, 50% applicable) that a `pipeline_encoder` specialization saves more than a generic MCKP. 4. **Bash does not need Paper 2.** 0.4% of 59k Bash responses are applicable. Optimization effort here is misplaced — focus on MCP. ## Adaptive Configuration A single static configuration for the 4-layer pipeline leaves substantial savings on the table. Agent workloads differ on several axes: the ratio of file-search to MCP calls, endpoint fingerprint distributions, session length, and compaction frequency all vary within an order of magnitude across users even in the same team. This section specifies the metrics the pipeline should collect and the rules by which those metrics drive configuration choices. ### Why defaults are insufficient Three representative profiles from our corpus illustrate the gap: | Profile | Characteristics | Default-config savings | Tuned-config savings | Δ | |---------|------------------|-----------------------:|----------------------:|---:| | **A. File-search heavy** — 65% `Read` calls, duplicate rate 40% | mid-size (100–499 events) | 34% | 41% | +7 pp | | **B. MCP pipeline monitor** — 30% polling calls, deep-nested JSON | small (20–99 events) | 28% | 44% | +16 pp | | **C. Marathon refactor** — 2000+ events, 46 compactions | very long, mixed | 22% | 32% | +10 pp | Defaults were chosen to minimize worst-case regression, not to maximize per-profile savings. A per-profile tuner lifts average savings from 28% to **39%** on the same corpus — a relative improvement of +40%. ### Tuning knobs Twelve knobs span three layers. Each is keyed on telemetry measurable at per-response granularity: | Layer | Knob | Type | Default | |-------|------|------|---------| | L0 | `dedup.lru_size` | int ≥ 1 | 5 | | L0 | `dedup.enabled_per_endpoint` | bool map | all-true | | L0 | `dedup.hint_verbosity` | `terse`/`standard`/`verbose` | `standard` | | L0 | `dedup.near_ref_enabled` | bool | false | | L0 | `dedup.min_body_chars` | int | 200 | | L1 | `templates.active` | list[template_id] | built-in 3 | | L1 | `templates.endpoint_overrides` | map[endpoint → template_id] | empty | | L2 | `mckp.recursion_depth` | int | 5 | | L2 | `mckp.formats_enabled` | set of format_ids | all | | L2 | `mckp.shape_thresholds` | map[shape → min_match] | built-in | | Meta | `telemetry.sample_rate` | float 0.0-1.0 | 1.0 | | Meta | `telemetry.flush_every_n` | int | 25 | ### Metrics-to-knobs decision rules The tuner is rules-based (not ML) for explainability. Each rule reads aggregated telemetry and emits a config delta: ``` R1 per-endpoint dedup: for each endpoint E: if E.dup_rate ≥ 0.30 → dedup.enabled_per_endpoint[E] = true and suggest lru_size = min(10, 1 + round(E.repeat_burst_p90)) if E.dup_rate ≤ 0.05 → dedup.enabled_per_endpoint[E] = false R2 per-endpoint template: for each endpoint E: if E.shape_dominant == "markdown_table" and E.avg_cols ≥ 2: templates.endpoint_overrides[E] = "csv_from_md" elif {"log","url","hash"} ⊆ E.inner_formats_top3: templates.endpoint_overrides[E] = "pipeline_deep_mckp" elif "diff" in E.inner_formats_top3: templates.endpoint_overrides[E] = "mr_diff_fence" R3 LRU sizing: if session_profile.compaction_rate > 0.05 events⁻¹: dedup.lru_size = max(10, ceil(session_profile.avg_events_per_partition * 0.2)) elif session_profile.avg_events < 20: dedup.lru_size = 3 else: dedup.lru_size = 5 R4 MCKP recursion: if nested_object_depth_p95 < 3: mckp.recursion_depth = 3 else: mckp.recursion_depth = 5 R5 min_body_chars: dedup.min_body_chars = clamp(p25(response_chars), 100, 500) ``` Each rule is accompanied by a measured effect size on the corpus (reported in Implementation Status section once instrumented). ### Architecture ``` ┌────────────────────┐ ┌────────────────────┐ │ Pipeline (Rust) │──────►│ TelemetrySink │ │ L0/L1/L2/L3 │ │ (per-session JSONL)│ └─────────┬──────────┘ └──────────┬─────────┘ │ │ ▼ ▼ formatted response ~/.config/devboy/telemetry/ │ ┌──────────┴──────────┐ ▼ │ `devboy tune analyze` │ (weekly/manual) │ │ ▼ │ pipeline_config.toml ◄──────────┘ │ ▼ Pipeline::with_config() (next session picks up tuned knobs) ``` The tuner is **offline**: no runtime reconfiguration mid-session. This keeps the hot path free of control-plane logic and makes behavior reproducible for a given config file. ## Configuration Extensibility The tuner described above mines pipeline telemetry. That presupposes the pipeline is already running and emitting events. To bootstrap a brand-new user *and* to adapt to per-axis variation that the rule set in §Adaptive Configuration cannot express, schema v2 introduces four orthogonal **profile axes** plus a horizontal **hint policy**. ### Why four axes (and not more knobs on one config) The 2026-04-25 evaluation surfaced a class of measurements that no global knob can encode. For example: inline-JSON cells in a Markdown table cost +119% prompt tokens on `glm-5.1` (Anthropic-class tokenizer) but ±0% on local Ollama models (BPE). The encoder choice is not the problem — the *receiving model's tokenizer* is. Likewise, `schema_explainer` hints add +0 p.p. accuracy on capable cloud models but unlock the remaining 5 errors on `gpt-oss:20b`. Capturing this without an explosion of endpoint-specific exceptions requires factoring the configuration along the dimensions that actually vary. ### The four axes ``` ┌─ profiles ──────────────────────────────────────────────────────────┐ │ │ │ tokenizer {anthropic_class | openai_o200k | ollama_bpe | …} │ │ │ chars_per_token, inline_json_cost, toon_overhead │ │ ▼ │ │ llm model_id → tokenizer + context_window + style flags │ │ │ resolves "auto" against SessionContext.model_id │ │ ▼ │ │ agent {default | file_search_heavy | marathon_refactor} │ │ │ priority (latency/balanced/accuracy), recursion_depth, │ │ │ hint_aggressiveness, near_ref_enabled │ │ │ resolves "auto" by classifying session stats │ │ ▼ │ │ data endpoint_pattern → preferred_format + hint_set │ │ first match wins; placeholder variants auto-registered │ │ for unknown observed endpoints │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` Plus, at the same level as `[profiles]`, a **`[hints]`** policy with per-type rules (`enabled`, `max_per_session`, `applies_to_models`) gates every emit through one chokepoint. Every axis defaults to `active = "auto"` so a user with no explicit overrides still gets a coherent view resolved at session start. ### Resolution chain ``` SessionContext { model_id, stats { event_count, compaction_count, read_share } } │ ▼ profiles.llm.resolve(model_id) ──► LlmProfile (incl. tokenizer ref) │ profiles.tokenizer.get(llm.tokenizer) ──► TokenizerProfile (cost model) │ profiles.agent.resolve(stats) ──► AgentProfile (recursion_depth, …) │ hints ──► HintsConfig (allow(type, model)) │ ▼ EffectiveConfig — single value the LayeredPipeline reads on the hot path ``` The collapse is a pure function: same config + same context → same EffectiveConfig. There is no run-time learning inside the pipeline; the adaptation lives entirely in offline tuning. ### Built-in defaults (excerpt) | Axis | Variant | Where it wins | |---|---|---| | **tokenizer** | `anthropic_class` | `chars_per_token = 3.5`, `inline_json_cost = 2.2`, `toon_overhead = 1.13`. Chosen automatically when the LLM profile points to Sonnet, Opus, or GLM. | | **tokenizer** | `openai_o200k` | `chars_per_token = 4.0`, `toon_overhead = 0.60` (the only tokenizer where TOON's −40% claim holds). | | **tokenizer** | `ollama_bpe` | Used by every local Ollama LLM profile — inline-JSON cells are free here, so `mckp_v2` collapses to `mckp_v1` cost without the data loss. | | **llm** | `glm-5.1` | `tokenizer = anthropic_class`, `context_window = 128k`, `prefer_explicit_keys = true`, `max_inline_nested = 128`. | | **llm** | `claude-sonnet-4.6` | Same tokenizer family, `context_window = 200k`, tighter `max_inline_nested = 64` (cells are expensive on this tokenizer). | | **llm** | `gpt-oss:20b` / `gemma4:26b` | `tokenizer = ollama_bpe`, `context_window = 8 192`, `prefer_explicit_keys = false`, `max_inline_nested = 512`. | | **agent** | `default` | Balanced, `recursion_depth = 5`, `hint_aggressiveness = 0.5`. | | **agent** | `file_search_heavy` | Activates when ≤200 events and read-share ≥0.5. Drops `recursion_depth` to 3 and `hint_aggressiveness` to 0.3 — latency-prioritised. | | **agent** | `marathon_refactor` | Activates when ≥500 events and ≥3 compactions. Lifts `recursion_depth` to 7, enables `near_ref` hints. | | **data** | `gitlab_issues`, `github_pulls`, `k8s_logs`, `mr_diffs` | Pre-shipped endpoint-pattern → preferred-format + hint-set bindings. | | **hints** | `schema_explainer` | **`enabled = false`** — confirmed 0 p.p. lift in 2026-04-25 evaluation; documenting the negative result keeps future tuners from rediscovering it. | | **hints** | `inline_format_hint` | `applies_to_models = ["gpt-oss:20b", "gemma4:26b"]` — only fires for the local models that benefit. | ### CLI: `devboy tune from-claude-logs` When a user has no pipeline telemetry yet (a fresh install, or a new project), the agent already collected most of the necessary signal in its session log. The new subcommand `devboy tune from-claude-logs`: 1. Walks `~/.claude/projects//*.jsonl` (or any `--input-dir`). 2. Counts model ids, tool invocations (read-class vs other), `mcp__*` endpoint hits, sessions, and `/compact` events. 3. Applies three rules — **P1** dominant-model pin (≥80% share), **P2** agent classifier from session stats, **P3** placeholder data-profile registration for every observed `mcp__*` endpoint. 4. Writes (or, with `--dry-run`, prints) the proposed `pipeline_config.toml`. This complements the existing `devboy tune analyze` (which mines emitted telemetry events) and lets a fresh install converge to a near-tuned configuration on first run. ### Skill: `pipeline-tune` `crates/devboy-skills/skills/00-self-bootstrap/pipeline-tune/SKILL.md` is a procedural recipe for an agent driving the CLI on a user's behalf: 1. Sanity-check that `~/.claude/projects` exists and has data. 2. Run with `--dry-run` first; surface the proposed pins to the user. 3. Apply on confirmation; verify with `devboy tune show`. 4. Per-tool refinement: ask the user what shape each `mcp__*` endpoint returns, fill in `preferred_format`. 5. Watch for accuracy regressions in subsequent sessions. The skill is opinionated about anti-patterns: don't enable `schema_explainer`, don't pin a model whose tokenizer profile is wrong, don't run across mixed-project log directories. ### Schema v1 → v2 migration `AdaptiveConfig::load` accepts both schemas. When loading a v1 file the deserializer uses `serde(default)` to populate the new `[profiles]` and `[hints]` sections from the v2 defaults, then bumps `schema_version` in memory. Saving back writes a v2 file. The migration is one-way (v2 cannot be read by v1 binaries) but the on-disk v1 file is left untouched until the user explicitly saves. A future schema version is rejected at load time (`ConfigError::UnsupportedSchemaVersion`) — old binaries won't silently mis-parse a newer file. ## Telemetry & Observability ### Per-response event (atomic unit) Emitted once per tool-result. Schema (stable; additions only): | Field | Type | Description | |-------|------|-------------| | `session_hash` | string(16) | Partition key, sha-256 prefix of session UUID | | `tool_call_id_hash` | string(8) | Identity of this response (for hint refs) | | `tool_name_anon` | string | Anonymized tool name (MCP slugs hashed) | | `endpoint_class` | string | Inferred endpoint (Bash → `git_log` / `curl` / …) | | `response_chars` | int | Raw byte count | | `shape` | enum(9) | `prose` / `numbered_list` / `nested_object` / … | | `inner_formats` | set[string] | Embedded formats detected inside string leaves | | `content_sha_prefix` | bytes(16) | 128-bit content fingerprint (for dup detection) | | `file_path_hash` | string(8)? | For `Read`/`Edit` tools, invalidation key | | `is_dedup_hit` | bool | Was an L0 hint emitted? | | `layer_used` | enum(L0,L1,L2,L3) | Terminal layer in the pipeline decision | | `template_id` | string? | L1 template name if used | | `tokens_baseline` | int | chars / 4 or tiktoken measurement | | `tokens_final` | int | After pipeline encoding | | `context_partition` | int | Monotonic counter, increments on compaction | | `is_sidechain` | bool | Subagent vs main session | | `ts_ms` | int64 | Unix milliseconds | No raw text, no tool arguments, no user-facing strings. Field additions are backward-compatible; the sink writes JSONL and downstream analyzers use field-level projection. ### Per-session summary (rolled up on session close) | Field | Type | Description | |-------|------|-------------| | `session_hash` | string(16) | — | | `total_events` | int | — | | `dedup_hit_rate` | f32 | L0 hits / events | | `l1_hit_rate`, `l2_hit_rate` | f32 | template / generic-mckp fractions | | `avg_response_chars` | f32 | — | | `compaction_count` | int | `context_partition` max | | `endpoint_mix` | map[endpoint → count] | top-20 endpoints | | `total_baseline_tokens` | int | — | | `total_final_tokens` | int | — | | `savings_pct` | f32 | 1 − final/baseline | | `duration_sec` | f32 | — | | `ended_at_ms` | int64 | — | | `sample_rate_applied` | f32 | if session was down-sampled | ### Per-endpoint fingerprint (rolling 30-day window) Aggregated across sessions for each unique endpoint. Used by the tuner; also the unit a team may choose to share with the endpoint's upstream provider. | Field | Type | Description | |-------|------|-------------| | `endpoint_class` | string | — | | `call_count` | int | Within window | | `avg_response_chars` | f32 | — | | `shape_dominant` | string | Most common shape | | `shape_mix` | map[shape → frac] | Full distribution | | `dup_rate` | f32 | Responses that matched an earlier one in same partition | | `format_entropy` | f32 | Shannon entropy of shape mix; low = stable fingerprint | | `inner_formats_top3` | list[string] | Three most common inner formats | | `avg_cols`, `avg_rows` | f32 | When shape is `markdown_table` | | `key_stability_mean` | f32 | When shape is `array_of_objects` | | `first_seen_ms`, `last_seen_ms` | int64 | Window bounds | | `source_sessions` | int | k-anonymity indicator | ### Storage & aggregation format | Tier | Format | Path | Lifecycle | |------|--------|------|-----------| | Raw events | JSONL (append-only) | `~/.config/devboy/telemetry/events/.jsonl` | Kept 30 days, then rotated | | Session summaries | JSONL | `~/.config/devboy/telemetry/sessions.jsonl` | Kept 90 days | | Endpoint fingerprints | Parquet (rolling window) | `~/.config/devboy/telemetry/endpoints.parquet` | Overwritten on each tune pass | | Tuned config | TOML (human-readable) | `~/.config/devboy/pipeline_config.toml` | Version-controlled if desired | ### Privacy & scoping tiers | Tier | Data visibility | Required consent | |------|-----------------|------------------| | **User-local** (default) | Raw events, session summaries, endpoint fingerprints stay on the user's machine. Tuner reads them to produce `pipeline_config.toml`. | Implicit: writing to user's home directory | | **Team-shared** | *Endpoint fingerprints only* (no session-level, no raw events) are uploaded to a team bucket for joint tuning. Fingerprints are already aggregated; sessions are counted but not enumerated. k-anonymity threshold ≥ 5 before an endpoint is shared. | Explicit opt-in via `telemetry.share_endpoints_with_team = true` | | **Provider-shared** | For a specific MCP server, its own endpoints' fingerprints are uploaded to the server provider (e.g., to inform their server-side optimizations). No cross-provider mixing. | Explicit opt-in via `telemetry.share_with_provider[] = true` | No tier shares raw response content, tool arguments, or user prompts. The endpoint_class string, designed to be coarse (e.g., `git_log`, not the full command), is the finest granularity that leaves the user's machine. ## Deployment Patterns ### A. Single developer (default) The full loop runs on one machine with no network dependencies: ``` 1. Developer uses Claude Code; devboy-mcp middleware logs per-response events to ~/.config/devboy/telemetry/events/.jsonl. 2. At session close, a rollup writes the session summary. 3. Weekly (or manual), `devboy tune analyze --days 30` aggregates events and fingerprints, runs rules R1-R5, writes pipeline_config.toml. 4. Next session's Pipeline loads the updated config. ``` No data leaves the machine. The developer's own workload shapes the algorithm's behavior. ### B. Team — shared endpoint fingerprints A team sharing the same MCP deployments (e.g., one GitLab instance) can pool endpoint fingerprints without exposing individual session data: ``` 1. Each member runs User-local as in pattern A. 2. With `telemetry.share_endpoints_with_team = true` enabled, the tuner uploads each member's `endpoints.parquet` to a shared bucket (S3 / GCS / shared FS) after applying k-anonymity (≥ 5 sessions per endpoint before it's included). 3. A team aggregator merges member fingerprints, producing a `team_endpoints.parquet` weighted by session count. 4. Each member's tuner reads `team_endpoints.parquet` and their own `endpoints.parquet`, blending them (default: 70% team / 30% self) before running rules. 5. The resulting `pipeline_config.toml` is checked into the team's repo as `.devboy/pipeline_config.toml` and applied by all members. ``` Benefits: high-volume endpoints in the shared deployment get accurate statistics even for team members who rarely call them. Trade-off: onboarding a member pulls the team config, which may mask their personal workload idiosyncrasies for a week until their own data reweights the blend. ### C. MCP provider — publishing native fingerprints An MCP server operator who wants their tools to be used most efficiently can publish per-endpoint fingerprints as part of the server's metadata: ``` 1. The server instruments responses (or consumes anonymous client-side fingerprints via the Provider-shared opt-in). 2. The server exposes a standard endpoint, e.g., GET /mcp/fingerprints/v1 → list of { endpoint_class, shape_dominant, inner_formats_top3, dup_rate, ... }. 3. Clients' tuners fetch these fingerprints alongside their own telemetry and use them as strong priors (especially for endpoints the client has seen rarely). ``` Benefits: a new user of a server benefits from the aggregate experience of all users on the first session, not after a week of local telemetry. This pattern is forward-compatible with the open MCP SEP on response-level annotations — if standardized, endpoint fingerprints can become part of the protocol rather than a server-specific extension. ### Failure modes & mitigations | Risk | Mitigation | |------|------------| | Tuner produces a config that regresses savings vs default | Every tune pass writes a before/after simulation on the developer's local corpus; if savings drop, the new config is flagged and the old one is retained. | | Team config drifts from reality after deployment change | TTL on team fingerprints (default: 14 days); stale endpoints demoted. | | Small-corpus overfitting | Rules R1/R2 require a minimum `call_count ≥ 20` per endpoint before activating a non-default template. | | Provider publishes biased fingerprints | Client blends provider data at most 40% weight; local telemetry is always the majority vote. | | Sample-rate skew (if telemetry is sampled) | The tuner scales counts by `1/sample_rate_applied` stored per-session. | ## Implementation Status ### Rust crates — shipped in `crates/plugins/format-pipeline/` - [x] **L0 content-hash dedup** — `dedup.rs` (350 lines, 10 tests). SHA-256/128-bit fingerprint, LRU + partition-aware eviction, mutation-aware invalidation (`invalidate_file`), reference-hint renderer. - [x] **Telemetry schema & sinks** — `telemetry.rs` (420 lines, 7 tests). `PipelineEvent` + `SessionSummary` + `TelemetrySink` trait (`JsonlSink` / `NullSink` / `MemorySink`). Thread-safe append-only JSONL with forward-compatible schema. - [x] **AdaptiveConfig** — `adaptive_config.rs` (370 lines, 5 tests). TOML schema, 12 knobs across `dedup` / `templates` / `mckp` / `telemetry` / `endpoint_overrides`. Save/load with schema-version guard. - [x] **Shape classifier** — `shape.rs` (400 lines, 10 tests). Rust port of the Python extractor; 11 shapes + 14 inner-format detectors (url / log / hash / diff / md-table / stack-trace / code-fence / …). - [x] **L1 per-endpoint templates** — `templates.rs` (240 lines, 6 tests). `csv_from_md` (markdown → CSV), `pipeline_deep_mckp` (nested JSON compaction), `mr_diff_fence` (JSON `diffs[]` → markdown fences). Dispatcher: `apply_by_id`. - [x] **L2 generic MCKP router** — `mckp_router.rs` (260 lines, 7 tests). Shape-dispatched format selection with threshold gating (min columns / min items / key stability / min fields) and `json_compact` fallback. - [x] **LayeredPipeline orchestrator** — `layered_pipeline.rs` (420 lines, 9 tests). End-to-end L0 → L1 → L2 → L3 dispatch; `with_telemetry(sink)` emits one `PipelineEvent` per response; `on_compaction_boundary()` hooks into host's compact events. - [x] **Criterion benchmark** — `benches/dedup_bench.rs`. Measured: `content_hash` 138 µs @ 64 KiB / 451 MiB/s, `cache.check(miss,lru=5)` = 2.7 ns, end-to-end (hash + check + insert) = 9 µs @ 4 KiB payload. p99 < 100 µs for 32 KiB. - [x] **`devboy-tune` offline CLI** — `bin/tune.rs` (420 lines, 6 tests). `analyze` subcommand aggregates telemetry JSONL per endpoint, applies rules R1-R5, writes `pipeline_config.toml`. `show` pretty-prints a config. Total: **242 tests** (233 lib + 6 bin + 3 doctests) passing; `cargo clippy --all-targets --all-features` clean. ### Python research scripts — shipped in `docs/research/scripts/` - [x] **`extract_paper2_format_events.py`** — anonymizing extractor: scans `~/.claude/projects/*.jsonl`, emits `paper2_format_events.parquet` with shape, inner formats, content_sha256 prefix, file_path_hash, context_partition, endpoint_class. Never stores raw text. - [x] **`simulate_paper2_pipeline.py`** — replay simulator: processes events in order, applies the 4-layer pipeline with tunable LRU, validates end-to-end savings on real corpus. - [x] **Mined rules** — `docs/research/data/paper2_format_rules_v2.{yaml,csv}` (aggregates only). ### Validation snapshot on our corpus | Metric | Value | |--------|------:| | Events analyzed | 144,658 | | Sessions | 1,529 | | Corpus baseline tokens | 115.57 M | | Saved tokens | **40.46 M (35.01%)** | | L0 hit rate | 29.9% of events | | Hint real cost (cl100k_base) | 8–11 tokens | | Sessions with compactions | 9.4% | | Hash collisions (128-bit SHA-256 prefix) | 0 | | Pipeline-invariant violations | 0 | ### Deferred to follow-up work - [x] **MCP middleware integration** — `crates/devboy-mcp/src/layered.rs` (`SessionPipeline`) wraps `LayeredPipeline` per-session state and is consulted from `handle_tools_call` for every successful response. Mutating tools (`Edit` / `Write` / `MultiEdit` / `NotebookEdit`) fire `invalidate_file` before dispatch; the host advances the partition counter on `notifications/devboy/compact` or the `compact_pipeline_cache` internal tool. Wired in PR for issue #203 follow-up (2026-04-25). - [x] **Near-dup (Type-2) hints** — `crates/plugins/format-pipeline/src/near_ref.rs` plus `DedupCache::insert_with_body` / `find_near_ref`. Emits `> [near-ref: tc_X, status: a→b, …]` for pipeline-polling endpoints when `dedup.near_ref_enabled = true`. Eligibility: scalar-only diffs, matched key sets, ≤ 50 bytes of delta payload, ≥ 500 bytes body. - [x] **Format round-trip correctness tests** — `crates/plugins/format-pipeline/src/round_trip.rs` declares a `DataLoss` profile per encoder and asserts every input key reappears in the encoded output (textually). `mckp_v2` and `json_compact` are gated at `DataLoss::None`; naive `csv` / `csv_from_md` are documented `DataLoss::TopLevel` and therefore cannot be made the production default. - [ ] **LLM comprehension eval (5-model batch)** — partial: 109-question evaluation across {`glm-5.1`, `gpt-oss:20b`, `gemma4:26b`} shipped in §Encoder Bug Postmortem. Outstanding: same fixtures × {`claude-sonnet-4-6`, `claude-haiku-4-5`} on the GPU server. - [ ] **Team- and provider-shared fingerprints** (§Deployment Patterns B/C). Requires shared-bucket protocol and k-anonymity enforcement. - [ ] **TOON → MCKP default switch** — `OutputFormat::Mckp` exists and is wired into all typed-domain transforms; switching the default needs a major-version bump and a one-shot migration warning surfaced in `format_output`. ## Savings Accounting A single number — "we saved X% of tokens" — hides the source of the saving. The layered pipeline has two independent compression mechanisms, and the headline number must report both, separately. ### Two orthogonal sources ``` ┌─────────┐ L0 hits → emit ~10-token reference hint raw response ─►│ L0 │─────────────────────────────────────► output │ dedup │ (≈ 100% saved └────┬────┘ on this branch) │ miss ▼ ┌─────────┐ │ L1 / L2 │ encoder savings — depend on shape & format │ encoder │ (mckp_v2: −16% vs json_compact on our corpus) └────┬────┘ ▼ output ``` The two saving rates are **multiplicative**, not additive: the encoder only sees the fraction of events that did not hit L0. ``` combined_saving = dedup_share × 1.0 + fresh_share × encoder_saving_vs_json = p(L0 hit) × 1.0 + (1 − p(L0 hit)) × encoder_saving ``` ### What the corpus shows Numbers are from the 2026-04-25 Paper 2 evaluation (`o200k_base` tokenizer, 431 records × 109 questions × 3 LLMs). Dedup numbers are from the 144 658-event simulation in §"Validation snapshot on our corpus". | Encoder | Encoder vs `json_compact` | Encoder vs `toon` | Combined w/ L0 dedup (29.9% hit rate) | Accuracy on glm-5.1 | |---|---:|---:|---:|---:| | `json_pretty` | −50% | −20% | **−5%** | (n/a) | | `csv` (naive monolithic) | −4% | +18% | +27% | 66.7% (−31.5 p.p.) | | `markdown_table` (naive) | −23% | +2% | +14% | 66.7% (−31.5 p.p.) | | `toon` | **−26%** | 0% | +12% | 98.2% (=json) | | `mckp` (lossy v1) | +25% | +40% | +47% | 96.3% (−1.9 p.p.) | | `jton_human` (AlphaEvolve v8-v20) | +17% | +34% | +42% | 93.6% (−4.6 p.p.) | | `jton_machine` (AlphaEvolve v27-v31) | **+20%** | **+37%** | **+44%** | 94.5% (−3.7 p.p.) | | **`mckp_v2`** | **+16%** | **+34%** | **+41%** | **98.2% (=json)** | | `mckp_v2` (Rust port) | +17% | +34% | +42% | 97.2% (−1.0 p.p.) | **Key reads:** 1. **TOON is not a saver on this tokenizer.** The advertised −40% holds only on the specific tokenization profile used by `toon-format`'s defaults; on the `o200k_base` family that GPT/Anthropic models share, TOON costs ~26% **more** than `json_compact`. 2. **Naive CSV / Markdown look like they save tokens, but they lose data** — see §Encoder Bug Postmortem. The 4–23% headline saving on table A becomes a −30 p.p. accuracy regression on deep / nested shapes; the saving is bought with answers the model can no longer give. 3. **JTON saves slightly more tokens than `mckp_v2` but costs ~4 p.p. accuracy.** JTON is an LLM-evolved (AlphaEvolve, 32 generations) binary-packed compressor — base64-encoded `struct.pack` blobs inside a JSON wrapper. Modern LLMs *do* parse it surprisingly well (94–95% accuracy on glm-5.1, not the catastrophic <50% one might expect from binary blobs in a prompt), but the regression is measurable: −3.7 p.p. for `jton_machine`, −4.6 p.p. for `jton_human` vs the `json_compact` baseline (98.2%). Per-shape it collapses on `deep` (50% vs 100% for `mckp_v2`) where the binary compaction is most aggressive. 4. **L0 dedup is the dominant saver**, contributing 29.9% of combined savings on our corpus before the encoder runs at all. The encoder contributes the *remaining* fraction multiplicatively. 5. **`mckp_v2` (and its Rust port) is the only encoder that wins both axes**: +16% tokens vs `json_compact`, +34% vs `toon`, no accuracy loss on any of the three evaluated LLMs. The JTON head-to-head confirms this: +4 p.p. extra token saving from JTON costs ~4 p.p. accuracy — a worse trade than mckp_v2's structured-but-readable output. ### Reporting rule Every quoted savings number in this paper, in production telemetry, and in `devboy tune` output **must** distinguish: - `dedup_savings_pct` — fraction of total tokens saved by L0 hint emits; - `encoder_savings_pct` — fraction saved by L1 / L2 encoding (computed over the L0-miss branch only); - `combined_savings_pct` — the multiplicative composition above; - The **baseline** the percentages are taken against (`json_compact`, `json_pretty`, or `toon` — they differ by tens of percentage points). The `tune analyze` and `tune from-claude-logs` CLIs surface all three columns in `print_top_endpoints`. The Paper 2 reproducibility script `scripts/06_savings_breakdown.py` produces a per-format / per-shape / per-baseline matrix from `fixtures/encoded.parquet` and the live comprehension parquets. ## Encoder Bug Postmortem (2026-04-25) During end-to-end LLM-comprehension validation we discovered a class of silent data-loss bugs in the original Python reference encoder. The bug is structural — not a transcription mistake — and it has implications for any format-adaptive system that picks a single best encoding for a mixed-shape input. The Rust production encoder is partially affected; the relevant fix is described below. ### Symptom On `deep`-shape and `nested`-shape records, naive monolithic CSV / Markdown encoding scored 8.3% and 33.3% accuracy respectively, vs. 98.2% / 100% for `json_compact`. A schema-explainer hint prefix added 0 percentage points of recovery. The cost / accuracy frontier looked catastrophic for "compressed" formats. ### Root cause `_find_main_homogeneous_array(obj)` walked every subtree, picked the largest array of objects, and emitted **only that array** as a CSV / Markdown table. Two layers of data loss followed: 1. **Wrapping object's top-level fields are dropped.** If the input is `{"company":"Acme","year":2026,"employees":[...]}`, the encoder emitted a table of `employees` and silently discarded `company` and `year`. Any question whose gold answer references the wrapper ("What company is this?", "What year was this snapshot taken?") becomes unanswerable — not because the LLM is weak, but because the data is no longer in the prompt. 2. **Nested fields inside array elements are dropped.** Inside the chosen array, the encoder filtered to *primitive-only* columns. For `orders=[{id, total, customer:{...}, items:[...]}, ...]`, the table carried only `id` and `total`. Questions about `customer.email` or `items` returned `null`. The first defect cost roughly −30 p.p. accuracy in aggregate; the second cost a further ~−1.9 p.p. on the (smaller) population of nested-in-array questions. ### Fix — `mckp_v2` The corrected encoder applies the same per-subtree dispatch but with two guarantees: - **Top-level non-array fields are pre-rendered** as `key: value` lines (nested values become inline JSON) above the main table. - **The main table uses the union of keys across all elements**; nested-typed cells are emitted as inline JSON in their cell, never dropped. In Rust, `templates::deep_mckp_with_inner_table` implements this and is invoked from `mckp_router::try_deep_mckp` ahead of the compact-JSON fallback. It returns `None` when the input is not an object wrapping a homogeneous array, so the rest of the pipeline is unchanged. `mckp_router::try_array_csv` was already correct — it has used union of keys since PR #204 — so the bug was confined to the `Shape::NestedObject` branch and to the Python research encoder. ### Validation 109-question evaluation across {`glm-5.1`, `gpt-oss:20b`, `gemma4:26b`}: | Encoder | glm-5.1 | gpt-oss:20b | gemma4:26b | Tokens vs `json_compact` | |---|---:|---:|---:|---:| | `json_compact` baseline | 98.2% | 95.4% | 95.4% | — | | `csv` (naive monolithic) | 66.7% | 64.8% | 64.8% | −5% | | `markdown` (naive monolithic) | 66.7% | 64.8% | 64.8% | +23% | | `mckp_v1` (lossy reference) | 96.3% | 94.5% | 94.5% | −25% | | **`mckp_v2`** | **98.2%** | **95.4%** | **95.4%** | **−16%** | After the fix, MCKP per-subtree matches the JSON baseline accuracy on all three models while preserving a 16% token saving overall. Two flips were observed (both `glm-5.1` deep-shape questions whose gold answer referenced a previously-dropped nested field), zero regressions. ### Implications 1. **Any "format X reduces tokens by N%" claim must be paired with an accuracy measurement on the same prompts.** Naive monolithic CSV looks like a 5% token saving on this corpus and is in fact a −30 p.p. accuracy regression. 2. **Hint prefixes cannot recover dropped data.** The hint experiments (`chunk1` vs `hint1`) confirmed this empirically: a one-line schema explainer added 0 lift to CSV / Markdown accuracy because the information was no longer present in the prompt. 3. **Encoders should fail closed on data loss.** A useful regression test is: for each encoder, parse the encoded form back and compare the recovered key set to the input's key set. Any drop is a bug. 4. **Token-cost numbers are tokenizer-dependent and must be reported alongside the tokenizer.** Inline JSON cells cost +119% prompt tokens per question on `glm-5.1` (Anthropic-class) but ±0% on gpt-oss / gemma (BPE). Same encoder, same prompts. The Rust fix and a regression test (`deep_mckp_inner_table_*`) ship in `crates/plugins/format-pipeline/src/templates.rs`. ## Related Work - TOON (devboy internal, Paper 1 baseline) — custom 3-level format - LLMLingua (token dropping) — format-agnostic, loses structure - RECOMP (extractive compression) — sentence-level, not structure-aware - Markdown as LLM format: numerous prompting papers confirm LLMs prefer Markdown - Multi-choice Knapsack Problem: Pisinger 1995, Kellerer et al. 2004