---
name: grepai-chunking
description: Configure code chunking in GrepAI. Use this skill to optimize how code is split for embedding.
---

# GrepAI Chunking Configuration

This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.

## When to Use This Skill

- Optimizing search accuracy
- Adjusting for code style (verbose vs. concise)
- Troubleshooting search results
- Understanding how indexing works

## What is Chunking?

Chunking is the process of splitting source files into smaller segments for embedding:

```
┌─────────────────────────────────────┐
│         Large Source File           │
│         (1000+ tokens)              │
└─────────────────────────────────────┘
                  ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512    │ │ ~512    │ │ ~512    │
│ tokens  │ │ tokens  │ │ tokens  │
└─────────┘ └─────────┘ └─────────┘
                  ↓
          Each chunk gets
          its own embedding
```

## Why Chunking Matters

Embedding models have optimal input sizes:
- **Too large chunks:** Less precise search results
- **Too small chunks:** Lost context, fragmented results
- **Just right:** Good balance of precision and context

## Configuration

### Basic Settings

```yaml
# .grepai/config.yaml
chunking:
  size: 512      # Tokens per chunk
  overlap: 50    # Overlap between chunks
```

### Understanding Parameters

#### Chunk Size

The target number of tokens per chunk.

| Size | Effect |
|------|--------|
| 256 | More precise, less context |
| 512 | Balanced (default) |
| 1024 | More context, less precise |

#### Overlap

Tokens shared between adjacent chunks. Preserves context at boundaries.

| Overlap | Effect |
|---------|--------|
| 0 | No overlap, may lose context at boundaries |
| 50 | Standard overlap (default) |
| 100 | More context, larger index |

## Visualization

With size=512 and overlap=50:

```
File: auth.go (1000 tokens)

Chunk 1: tokens 1-512
         ┌────────────────────────────────────┐
         │ func Login(user, pass)...          │
         └────────────────────────────────────┘
                                    ↘
                              50 token overlap
                                    ↙
Chunk 2: tokens 463-974
         ┌────────────────────────────────────┐
         │ ...validate credentials...         │
         └────────────────────────────────────┘
                                    ↘
                              50 token overlap
                                    ↙
Chunk 3: tokens 925-1000
         ┌──────────────┐
         │ ...return    │
         └──────────────┘
```

## Recommended Settings by Language

### Verbose Languages (Java, C#)

```yaml
chunking:
  size: 768    # Larger to capture full methods
  overlap: 75
```

### Concise Languages (Go, Python)

```yaml
chunking:
  size: 512    # Standard size
  overlap: 50
```

### Very Concise (Rust, Zig)

```yaml
chunking:
  size: 384    # Smaller for precise results
  overlap: 40
```

## Recommended Settings by Codebase

### Small Functions (Microservices)

```yaml
chunking:
  size: 384    # Capture individual functions
  overlap: 40
```

### Large Classes (Monolith)

```yaml
chunking:
  size: 768    # Capture more context
  overlap: 100
```

### Mixed Codebase

```yaml
chunking:
  size: 512    # Balanced default
  overlap: 50
```

## How Tokens are Counted

GrepAI uses approximate token counting:
- ~4 characters = 1 token (for English text)
- Code varies based on identifiers and syntax

Example:
```go
func calculateTotal(items []Item) float64 {
    total := 0.0
    for _, item := range items {
        total += item.Price * float64(item.Quantity)
    }
    return total
}
```
≈ 45 tokens

## Impact on Index Size

Larger overlap = more chunks = larger index:

| Size | Overlap | Chunks per 10K tokens | Index Impact |
|------|---------|----------------------|--------------|
| 512 | 0 | ~20 | Smallest |
| 512 | 50 | ~22 | Standard |
| 512 | 100 | ~24 | +10% |
| 256 | 50 | ~44 | +100% |

## Impact on Search Quality

### Too Small Chunks (size: 128)

```
Query: "authentication middleware"

Result: "...c.AbortWithStatus(401)..."
        (Fragment, missing context)
```

### Just Right (size: 512)

```
Query: "authentication middleware"

Result: "func AuthMiddleware() gin.HandlerFunc {
            return func(c *gin.Context) {
                token := c.GetHeader("Authorization")
                if token == "" {
                    c.AbortWithStatus(401)
                    return
                }
                // validate token...
            }
        }"
        (Complete function with context)
```

### Too Large Chunks (size: 2048)

```
Query: "authentication middleware"

Result: "// Multiple unrelated functions...
        func AuthMiddleware()... (your match)
        func LoggingMiddleware()...
        func CORSMiddleware()..."
        (Too much noise)
```

## Experimentation

### Testing Different Settings

1. Try smaller chunks for more precise results:
```yaml
chunking:
  size: 384
  overlap: 40
```

2. Re-index:
```bash
rm .grepai/index.gob
grepai watch
```

3. Test with searches:
```bash
grepai search "your query"
```

4. Adjust and repeat until satisfied.

### Comparing Results

Before changing settings, save a search result:
```bash
grepai search "authentication" > before.txt
```

After changing settings and re-indexing:
```bash
grepai search "authentication" > after.txt
diff before.txt after.txt
```

## Chunk Boundaries

GrepAI tries to split at logical boundaries:
1. Empty lines (function/class boundaries)
2. Closing braces
3. Statement ends

This means actual chunk sizes may vary slightly from the target.

## Best Practices

1. **Start with defaults:** 512/50 works well for most codebases
2. **Adjust based on code style:** Verbose = larger, concise = smaller
3. **Test with real queries:** See what your searches return
4. **Re-index after changes:** Must regenerate embeddings
5. **Consider overlap:** Don't set to 0 unless index size is critical

## Common Issues

❌ **Problem:** Search results are too fragmented
✅ **Solution:** Increase chunk size:
```yaml
chunking:
  size: 768
```

❌ **Problem:** Search results have too much irrelevant context
✅ **Solution:** Decrease chunk size:
```yaml
chunking:
  size: 384
```

❌ **Problem:** Results miss related code at function boundaries
✅ **Solution:** Increase overlap:
```yaml
chunking:
  overlap: 100
```

❌ **Problem:** Index is too large
✅ **Solutions:**
- Decrease overlap
- Increase chunk size
- Add more ignore patterns

## Output Format

Chunking status:

```
✅ Chunking Configuration

   Size: 512 tokens
   Overlap: 50 tokens

   Index Statistics:
   - Total files: 245
   - Total chunks: 1,234
   - Avg chunks/file: 5.0
   - Avg chunk size: 478 tokens

   Recommendations:
   - Current settings are balanced
   - Consider size: 384 for more precise results
   - Consider size: 768 for more context
```