---
name: context-engineering
description: Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.
context: fork
version: 1.0.0
author: OrchestKit AI Agent Hub
tags: [context, attention, optimization, llm, performance, 2026]
user-invocable: false
---

# Context Engineering

**The discipline of curating the smallest high-signal token set that achieves desired outcomes.**

## Overview

Context engineering goes beyond prompt engineering. While prompts focus on *what* you ask, context engineering focuses on *everything* the model sees—system instructions, tool definitions, documents, message history, and tool outputs.

**Key Insight:** Context windows are constrained not by raw token capacity but by attention mechanics. As context grows, models experience degradation.

## Overview

- Designing agent system prompts
- Optimizing RAG retrieval pipelines
- Managing long-running conversations
- Building multi-agent architectures
- Reducing token costs while maintaining quality

---

## The "Lost in the Middle" Phenomenon

Models pay unequal attention across the context window:

```
Attention
Strength   ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████
           ↑                                                      ↑
        START              MIDDLE (weakest attention)           END
```

**Practical Implications:**

| Position | Attention | Best For |
|----------|-----------|----------|
| START | High | System identity, critical instructions, constraints |
| MIDDLE | Low | Background context, optional details |
| END | High | Current task, recent messages, immediate query |

---

## The Five Context Layers

### 1. System Prompts (Identity Layer)

Establishes agent identity at the right "altitude":

```
TOO HIGH (vague):        "You are a helpful assistant"
TOO LOW (brittle):       "Always respond with exactly 3 bullet points..."
OPTIMAL (principled):    "You are a senior engineer who values clarity,
                          tests assumptions, and explains trade-offs"
```

**Best Practices:**
- Define role and expertise level
- State core principles (not rigid rules)
- Include what NOT to do (boundaries)
- Position at START of context

### 2. Tool Definitions (Capability Layer)

Tools steer behavior through descriptions:

```python
# ❌ BAD: Ambiguous - when would you use this?
@tool
def search(query: str) -> str:
    """Search for information."""
    pass

# ✅ GOOD: Clear trigger conditions
@tool
def search_documentation(query: str) -> str:
    """
    Search internal documentation for technical answers.

    USE WHEN:
    - User asks about internal APIs or services
    - Question requires company-specific knowledge
    - Public information is insufficient

    DO NOT USE WHEN:
    - Question is general programming knowledge
    - User explicitly wants external sources
    """
    pass
```

**Rule:** If a human cannot definitively say which tool to use, an agent cannot either.

### 3. Retrieved Documents (Knowledge Layer)

Just-in-time loading beats pre-loading:

```python
# ❌ BAD: Pre-load everything
context = load_all_documentation()  # 50k tokens!

# ✅ GOOD: Progressive disclosure
def build_context(query: str) -> str:
    # Stage 1: Lightweight retrieval (500 tokens)
    summaries = search_summaries(query, top_k=5)

    # Stage 2: Selective deep loading (only if needed)
    if needs_detail(summaries):
        full_docs = load_full_documents(summaries[:2])
        return summaries + full_docs

    return summaries
```

### 4. Message History (Memory Layer)

Treat as scratchpad, not permanent storage:

```python
# Implement sliding window with compression
MAX_MESSAGES = 20
COMPRESSION_TRIGGER = 0.7  # 70% of context budget

def manage_history(messages: list, budget: int) -> list:
    current_tokens = count_tokens(messages)

    if current_tokens > budget * COMPRESSION_TRIGGER:
        # Compress older messages, keep recent
        old = messages[:-5]
        recent = messages[-5:]

        summary = summarize(old)  # Anchored compression
        return [summary] + recent

    return messages
```

### 5. Tool Outputs (Observation Layer)

**Critical Finding:** Tool outputs can reach 83.9% of total context usage!

```python
# ❌ BAD: Return raw output
def search_web(query: str) -> str:
    results = web_search(query)
    return json.dumps(results)  # Could be 10k+ tokens!

# ✅ GOOD: Structured, bounded output
def search_web(query: str) -> str:
    results = web_search(query)

    # Extract only what's needed
    extracted = [
        {
            "title": r["title"],
            "snippet": r["snippet"][:200],  # Truncate
            "url": r["url"]
        }
        for r in results[:5]  # Limit count
    ]

    return json.dumps(extracted)  # ~500 tokens max
```

---

## The 95% Finding

Research shows what actually drives agent performance:

```
┌────────────────────────────────────────────────────────────────┐
│  TOKEN USAGE        ████████████████████████████████████  80%  │
│  TOOL CALLS         █████  10%                                 │
│  MODEL CHOICE       ██  5%                                     │
│  OTHER              ██  5%                                     │
└────────────────────────────────────────────────────────────────┘
```

**Key Insight:** Optimize context efficiency BEFORE switching models.

---

## Context Budget Management

### Token Budget Calculator

```python
def calculate_budget(model: str, task_type: str) -> dict:
    """Calculate optimal token allocation."""

    MAX_CONTEXT = {
        "gpt-4o": 128_000,
        "claude-3": 200_000,
        "llama-3": 128_000,
    }

    # Reserve 20% for response generation
    available = MAX_CONTEXT[model] * 0.8

    # Allocation by task type
    ALLOCATIONS = {
        "chat": {
            "system": 0.05,      # 5%
            "tools": 0.05,       # 5%
            "history": 0.60,    # 60%
            "retrieval": 0.20,  # 20%
            "current": 0.10,    # 10%
        },
        "agent": {
            "system": 0.10,     # 10%
            "tools": 0.15,      # 15%
            "history": 0.30,    # 30%
            "retrieval": 0.25,  # 25%
            "observations": 0.20, # 20%
        },
    }

    alloc = ALLOCATIONS[task_type]
    return {k: int(v * available) for k, v in alloc.items()}
```

### Compression Triggers

```python
COMPRESSION_CONFIG = {
    "trigger_threshold": 0.70,    # Start compressing at 70%
    "target_threshold": 0.50,     # Compress down to 50%
    "preserve_recent": 5,         # Always keep last 5 messages
    "preserve_system": True,      # Never compress system prompt
}
```

---

## Attention-Aware Positioning

### Template Structure

```markdown
[START - HIGH ATTENTION]
## System Identity
You are a {role} specialized in {domain}.

## Critical Constraints
- NEVER {dangerous_action}
- ALWAYS {required_behavior}

[MIDDLE - LOWER ATTENTION]
## Background Context
{retrieved_documents}
{older_conversation_history}

[END - HIGH ATTENTION]
## Current Task
{recent_messages}
{user_query}

## Response Guidelines
{output_format_instructions}
```

### Priority Positioning Rules

1. **Identity & Constraints** → START (immutable)
2. **Critical instructions** → START or END
3. **Retrieved documents** → MIDDLE (expandable)
4. **Conversation history** → MIDDLE (compressible)
5. **Current query** → END (always visible)
6. **Output format** → END (guides generation)

---

## Metrics: Tokens-Per-Task

**Optimize for total task completion, not individual requests:**

```python
@dataclass
class TaskMetrics:
    task_id: str
    total_tokens: int = 0
    request_count: int = 0
    retrieval_tokens: int = 0
    generation_tokens: int = 0

    @property
    def tokens_per_request(self) -> float:
        return self.total_tokens / max(self.request_count, 1)

    @property
    def efficiency_ratio(self) -> float:
        """Lower is better - generation vs total context."""
        return self.generation_tokens / max(self.total_tokens, 1)
```

**Anti-pattern:** Aggressive compression that loses critical details forces expensive re-fetching, consuming MORE tokens overall.

---

## Common Pitfalls

| Pitfall | Problem | Solution |
|---------|---------|----------|
| Token stuffing | "More context = better" | Quality over quantity |
| Flat structure | No priority signaling | Use headers, positioning |
| Static context | Same context for all queries | Dynamic, query-relevant retrieval |
| Ignoring middle | Important info gets lost | Position critically |
| No compression | Context grows unbounded | Sliding window + summarization |

---

## Integration with OrchestKit

### Agent System Prompts

Apply attention-aware positioning to agent definitions:

```markdown
# Agent: backend-system-architect

[HIGH ATTENTION - START]
## Identity
Senior backend architect with 15+ years experience.

## Constraints
- NEVER suggest unvalidated security patterns
- ALWAYS consider multi-tenant isolation

[LOWER ATTENTION - MIDDLE]
## Domain Knowledge
{dynamically_loaded_patterns}

[HIGH ATTENTION - END]
## Current Task
{user_request}
```

### Skill Loading

Progressive skill disclosure:

```python
# Stage 1: Load skill metadata only (~100 tokens)
skill_index = load_skill_summaries()

# Stage 2: Load relevant skill on demand (~500 tokens)
if task_matches("database"):
    full_skill = load_skill("pgvector-search")
```

---

---

## CC 2.1.7: MCP Auto-Discovery and Deferral

### MCP Search Mode

CC 2.1.7 introduces intelligent MCP tool discovery. When context usage exceeds 10% of the effective window, MCPs are automatically deferred to reduce token overhead.

```
Context < 10%:  MCP tools immediately available
Context > 10%:  MCP tools discovered via MCPSearch (deferred loading)

Savings: ~7200 tokens per session average
```

### How Auto-Deferral Works

The context budget monitor tracks usage against the effective window:

1. **Below 10%**: MCP tool definitions loaded in context (~1200 tokens)
2. **Above 10%**: MCP tools deferred, available via MCPSearch on-demand
3. **State file**: `/tmp/claude-mcp-defer-state-{session}.json`

### Best Practices for MCP with Auto-Deferral

1. **Use MCPs early** - Before context fills up
2. **Batch MCP calls** - Multiple queries in one turn
3. **Cache MCP results** - Store retrieved docs in context
4. **Monitor statusline** - Watch for `mcp.deferred: true`

### Checking MCP Deferral State

```bash
cat /tmp/claude-mcp-defer-state-${CLAUDE_SESSION_ID}.json
```


## Related Skills

- `context-compression` - Compression strategies and anchored summarization
- `multi-agent-orchestration` - Context isolation across agents
- `rag-retrieval` - Optimizing retrieved document context
- `prompt-caching` - Reducing redundant context transmission

---

**Version:** 1.0.0 (January 2026)
**Based on:** Context Engineering research, BrowseComp evaluation findings
**Key Metric:** 80% of agent performance variance explained by token usage

## Capability Details

### attention-mechanics
**Keywords:** context window, attention, lost in the middle, token budget
**Solves:**
- Understand lost-in-the-middle effect (high attention at START/END)
- Position critical info strategically
- Optimize tokens-per-task not tokens-per-request

### context-layers
**Keywords:** context anatomy, context structure, five layers
**Solves:**
- Understand 5 context layers (system, tools, docs, history, outputs)
- Implement just-in-time document loading
- Manage tool output truncation

### budget-allocation
**Keywords:** token budget, context budget, allocation
**Solves:**
- Allocate tokens across context layers
- Implement compression triggers at 70% utilization
- Target 50% utilization after compression