# Prompt Caching and Cost Control

## Purpose

Prompt caching reduces repeated prefill work when multiple model calls share the same prefix. In long-running agents, this can materially reduce input-token cost and time-to-first-token latency.

Treat prompt caching as part of harness architecture, not as a provider afterthought. The context builder, tool registry, instruction manager, compactor, and telemetry layer all affect cache hit rate.

## Core rule: stable prefix, dynamic suffix

Most prompt-cache systems reward exact or near-exact prefix reuse. Design requests so stable content appears first and volatile content appears late.

Recommended ordering:

```text
1. Tool definitions, in deterministic order
2. Static system/developer instructions
3. Stable scoped instructions or skill index
4. Stable reference context likely to be reused
5. Prior conversation or typed event history, append-only where possible
6. Dynamic runtime environment
7. New user message or current task suffix
```

Dynamic values belong near the end:

```text
current date/time
request ID
session ID
working directory
cursor state
fresh search results
latest tool output
user's newest message
```

Do not put changing values at the start of the system prompt.

## Deterministic serialization

Cache stability depends on the byte-level or token-level request shape. Make serialization deterministic:

```text
stable tool order
stable JSON key order
stable schema formatting
stable instruction block order
stable skill listing order
stable whitespace where possible
versioned prompt bundles
versioned tool bundles
```

Avoid nondeterministic middleware that injects trace IDs, timestamps, randomized examples, or variable environment blocks into the stable prefix.

## Multi-turn behavior

Keep conversation and event history append-only until compaction is required.

Good shape:

```text
turn 1: stable_prefix + user_1
turn 2: stable_prefix + user_1 + assistant_1 + user_2
turn 3: stable_prefix + user_1 + assistant_1 + user_2 + assistant_2 + user_3
```

Bad shape:

```text
turn 2: rewritten summary of turn 1 + stable_prefix + user_2
turn 3: reordered tools + rewritten system prompt + user_3
```

Append-only history lets the provider reuse prior prefix work. Rewriting history every turn often destroys cache reuse.

## Compaction and caching

Compaction is often necessary, but it resets or changes the reusable prefix.

Use these rules:

```text
compact only when useful
make compaction boundaries explicit
make the summary itself stable after creation
do not rewrite the summary on every turn
preserve recent high-value messages exactly when possible
prune oversized tool outputs consistently rather than rewriting all history
store bulky artifacts externally and reference them
```

After one cold turn following compaction, the compacted summary can become part of the new stable prefix.

## Tools and schemas

Tool definitions are usually part of the reusable prefix. Tool churn can destroy cache hit rate.

Best practices:

```text
expose only relevant tools
sort tools deterministically
avoid dynamic text inside tool descriptions
version tool sets deliberately
separate stable tool guidance from dynamic tool availability notes
use deferred tool search for large tool inventories
keep structured output schemas stable
```

When a tool changes materially, record a prompt/tool bundle version so cache changes are explainable.

## Provider-specific implementation notes

### OpenAI

OpenAI prompt caching is automatic on supported API requests. Current OpenAI docs describe a minimum prompt length for caching, a `cached_tokens` usage field, and optional retention controls such as extended retention for supported models.

Implementation notes:

```text
log usage.prompt_tokens_details.cached_tokens
keep stable instructions and tools before volatile context
use provider-supported cache keys or retention parameters when appropriate
monitor cache hit rate, cost, and time-to-first-token
avoid overly narrow cache routing keys in low-traffic buckets
```

### Anthropic

Anthropic prompt caching commonly uses explicit cache-control markers or automatic caching, depending on the API path and model. Use provider documentation for the current exact syntax and TTL behavior.

Implementation notes:

```text
place cache markers after stable blocks, not before volatile blocks
respect provider limits on cache breakpoints
choose short or extended TTL based on expected inter-request gaps
monitor cache read and cache write token fields
```

### OpenAI-compatible and self-hosted APIs

OpenAI-compatible APIs vary widely. Some implement prefix caching, some only emulate OpenAI message shapes, and some expose backend-specific controls.

Implementation notes:

```text
test the exact provider and model
verify whether cached-token usage is reported
use tenant-safe cache isolation where supported
monitor backend prefix-cache hit-rate if self-hosted
keep request serialization stable even when cache support is uncertain
```

## Monitoring

Log cache diagnostics on every model call when available:

```json
{
  "request_id": "...",
  "session_id": "...",
  "provider": "openai|anthropic|openai-compatible",
  "model": "...",
  "prompt_bundle_version": "...",
  "tool_bundle_version": "...",
  "system_prompt_hash": "...",
  "tools_hash": "...",
  "input_tokens_new": 0,
  "cache_read_tokens": 0,
  "cache_write_tokens": 0,
  "cached_tokens": 0,
  "output_tokens": 0,
  "time_to_first_token_ms": 0,
  "total_latency_ms": 0,
  "estimated_cost": 0
}
```

Track:

```text
cache hit rate by session
cache hit rate by tenant or segment
unique system prompt hashes per day
unique tool bundle hashes per day
cost split: uncached input, cached input, output
latency split: prefill, time-to-first-token, generation
cache hit rate before and after compaction
```

Alert when a long-prefix agent unexpectedly reports zero cached tokens over many turns, or when stable prompt/tool hashes fragment unexpectedly.

## Cache-killing anti-patterns

Avoid:

```text
timestamp at the start of the system prompt
request ID in the stable prefix
randomized tool order
randomized JSON key order
injecting live environment state before static instructions
including per-user secrets in the prefix
rewriting conversation history every turn
re-summarizing the whole session every turn
changing schema formatting without versioning
putting volatile retrieval results before stable instructions
using overly granular cache keys with low request volume
failing to log cached-token fields
```

## Prompt-cache-aware context builder

A cache-aware context builder should produce two zones:

```text
stable_prefix:
  tool definitions
  static instructions
  scoped stable instructions
  stable skill index
  stable schemas and output contracts

volatile_suffix:
  current task
  dynamic runtime state
  latest observations
  new retrieved snippets
  approval request/response
```

This does not mean all stable content should always be included. Relevance still matters. The best request is both cache-friendly and context-efficient.

## Cost-control checklist

- Keep stable content before volatile content.
- Remove timestamps and request IDs from stable instructions.
- Sort tools and schemas deterministically.
- Log provider cache usage fields.
- Track system and tool hash fragmentation.
- Avoid compaction churn.
- Use long retention only when reuse justifies it.
- Prefer skill and tool progressive disclosure over loading huge inventories.
- Measure cost and latency before and after each prompt/tool bundle change.