---
name: chunking-strategies
description: Document chunking strategies for RAG systems. Use when implementing document processing pipelines to determine optimal chunking approaches based on document type and retrieval requirements.
---

# Chunking Strategies Skill

This skill provides chunking strategies for RAG document processing.

## Chunking Methods

### 1. Fixed-Size Chunking
```python
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks
```

### 2. Semantic Chunking
Split on natural boundaries (sentences, paragraphs).

```python
def semantic_chunk(text: str, max_tokens: int = 500):
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = count_tokens(para)
        if current_tokens + para_tokens > max_tokens:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [para]
            current_tokens = para_tokens
        else:
            current_chunk.append(para)
            current_tokens += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))
    return chunks
```

### 3. Recursive Chunking
Hierarchical splitting on multiple separators.

```python
SEPARATORS = ["\n\n", "\n", ". ", " "]

def recursive_chunk(text: str, max_size: int, separators: list[str]):
    if len(text) <= max_size:
        return [text]

    sep = separators[0] if separators else ""
    chunks = []
    parts = text.split(sep)

    for part in parts:
        if len(part) <= max_size:
            chunks.append(part)
        elif len(separators) > 1:
            chunks.extend(recursive_chunk(part, max_size, separators[1:]))
        else:
            chunks.append(part[:max_size])

    return chunks
```

## Chunking by Document Type

| Document Type | Recommended Strategy | Chunk Size |
|---------------|---------------------|------------|
| Technical docs | Semantic (headers) | 500-1000 tokens |
| Legal documents | Semantic (sections) | 1000-2000 tokens |
| Code | Function/class based | 200-500 tokens |
| Conversations | Message boundaries | 100-300 tokens |
| General text | Recursive | 300-500 tokens |

## Chunk Enrichment

```python
@dataclass
class EnrichedChunk:
    content: str
    metadata: dict
    summary: str  # LLM-generated
    keywords: list[str]
    parent_id: str  # For hierarchical retrieval
```

## Best Practices

- Add overlap between chunks (10-20%)
- Preserve semantic boundaries
- Include metadata (source, position)
- Consider hierarchical chunking for long docs
- Test retrieval quality with different sizes