--- name: contextual-chunking description: Contextual Retrieval implementation for RAG - chunks clinical notes with LLM-generated context prepended to each chunk before embedding. Improves citation accuracy by 49% per Anthropic research. --- # Contextual Chunking Skill ## Overview This skill implements Anthropic's Contextual Retrieval pattern for RAG systems. It chunks clinical notes into fixed-size segments (1000 tokens, 200 token overlap) and generates 50-100 token contextual summaries for each chunk using Phi-4. The context is prepended to chunks before embedding, significantly improving retrieval accuracy for citation extraction. ## When to Use Use this skill when: - Preparing clinical notes for RAG-based summarization - Creating embeddings for ChromaDB storage - Need to improve citation accuracy and reduce hallucinations - Processing multi-page clinical notes for semantic search ## Research Background **Anthropic Contextual Retrieval Paper**: Prepending chunk-specific context improves retrieval accuracy by 49% over standard RAG. The context helps the embedding model understand each chunk's role within the larger document. ## Installation **IMPORTANT**: This skill has its own isolated virtual environment (`.venv`) managed by `uv`. Do NOT use system Python. Initialize the skill's environment: ```bash # From the skill directory cd .agent/skills/contextual-chunking uv sync # Creates .venv and installs dependencies from pyproject.toml ``` Dependencies are in `pyproject.toml`: - `tiktoken` - Token counting for Phi-4 ## Usage **CRITICAL**: Always use `uv run` to execute code with this skill's `.venv`, NOT system Python. ### Basic Chunking with Context ```python # From .agent/skills/contextual-chunking/ directory # Run with: uv run python -c "..." from contextual_chunking import ContextualChunker # You'll need to import ollama-client separately import sys from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent / "ollama-client")) from ollama_client import OllamaClient # Initialize chunker = ContextualChunker( ollama_client=OllamaClient(), chunk_size=1000, # Tokens per chunk chunk_overlap=200, # Overlap between chunks (20%) context_size=75 # Context tokens (50-100 range) ) # Chunk clinical note clinical_note = "Patient presents with chest pain radiating to left arm..." enriched_chunks = chunker.chunk_with_context( document_text=clinical_note, doc_id="note_123" ) # Each enriched chunk contains: for chunk in enriched_chunks: print(f"Chunk ID: {chunk['id']}") print(f"Original text: {chunk['original_text'][:100]}...") print(f"Context: {chunk['context']}") print(f"Enriched (context + text): {chunk['enriched_text'][:150]}...") print(f"Offsets: {chunk['start_offset']}-{chunk['end_offset']}") print("---") ``` ### Integration with ChromaDB ```python from src.skills.chroma_client.chroma_client import ChromaClient # 1. Chunk with context enriched_chunks = chunker.chunk_with_context(clinical_note, "note_123") # 2. Store enriched chunks in ChromaDB chroma_client = ChromaClient() chroma_client.add_chunks( collection_name="clinical_note_session_456", chunks=[chunk['enriched_text'] for chunk in enriched_chunks], metadatas=[{ 'chunk_id': chunk['id'], 'start_offset': chunk['start_offset'], 'end_offset': chunk['end_offset'], 'original_text': chunk['original_text'] } for chunk in enriched_chunks], ids=[chunk['id'] for chunk in enriched_chunks] ) ``` ## Context Generation Prompt The LLM generates context with this prompt template: ``` Given the whole document context, provide succinct context (50-100 tokens) to situate this chunk for search retrieval purposes. Document title/type: Clinical Note Document context: [First 2000 chars of full document] Chunk to contextualize: {chunk_text} Provide ONLY the context (no explanations): ``` **Example Output**: ``` Context: This section describes the patient's presenting symptoms during initial triage, specifically cardiovascular complaints requiring urgent evaluation. ``` ## Chunk Structure Each enriched chunk dictionary contains: ```python { 'id': 'note_123_chunk_0', 'original_text': 'Patient presents with chest pain...', 'context': 'This section describes presenting symptoms...', 'enriched_text': 'This section describes presenting symptoms... Patient presents with chest pain...', 'start_offset': 0, 'end_offset': 1200, 'token_count': 1000 } ``` ## Configuration **Parameters**: - `chunk_size`: Tokens per chunk (default: 1000) - Too small: Context fragmentation, poor retrieval - Too large: Embedding quality degrades, slower search - `chunk_overlap`: Token overlap (default: 200, ~20%) - Prevents information loss at boundaries - Critical for accurate citation offsets - `context_size`: Context tokens (default: 75, range: 50-100) - Balances informativeness vs token cost - Generated by LLM for each chunk ## Best Practices 1. **Token Counting**: Use tiktoken for accurate Phi-4 token counts 2. **Context Quality**: Verify LLM generates succinct, relevant context 3. **Offset Tracking**: Maintain character offsets for citation extraction 4. **Batch Processing**: Generate contexts in batches for efficiency 5. **Cache Contexts**: Store enriched chunks to avoid regeneration ## Performance Considerations **Chunking a 10-page note (5000 tokens)**: - Chunks: ~5 chunks (1000 tokens each, 200 overlap) - Context generation: 5 LLM calls (~5-10 seconds total) - Total time: 10-15 seconds (acceptable for offline processing) **Trade-offs**: - **Pro**: 49% better retrieval accuracy - **Pro**: Fewer hallucinations, better citations - **Con**: Additional LLM inference time - **Con**: Slightly higher token usage ## Error Handling - If LLM context generation fails, fall back to empty context (still functional) - If chunk exceeds token limit, split further - Preserve original text and offsets even if context fails ## Integration with RAG Pipeline **Workflow**: 1. **Chunk**: Use this skill to create enriched chunks 2. **Embed**: Store in ChromaDB (automatic embedding) 3. **Retrieve**: Query ChromaDB for relevant chunks 4. **Extract**: Use `citation-extraction` skill to validate citations 5. **Cleanup**: Clear ChromaDB collection after session ## Implementation See `contextual_chunking.py` for the full Python implementation. ## References - [Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval) - [LangChain RAG Guide](https://python.langchain.com/docs/use_cases/question_answering)