---
name: index-manager
description: Manages MPEP index lifecycle including downloads, building, maintenance, and optimization.
---

# Index Manager Skill

Expert system for managing MPEP search index lifecycle: PDF downloads, index building, maintenance, updates, optimization.

**FOR CLAUDE:** All dependencies installed, system operational.
- Go directly to appropriate phase
- Scripts/tools in mcp_server/
- Use patent-creator CLI when available
- Only run diagnostics if operations fail

## When to Use

Building/rebuilding MPEP index, corruption/missing files, optimization, adding content, troubleshooting.

## Index Lifecycle

```
PDFs Not Present -> Download (2-5 min, 500MB)
  -> Extract & Parse (500MB data)
  -> Generate Embeddings (5-10 min GPU, 35-65 min CPU)
  -> Build FAISS + BM25 Indexes
  -> Index Ready (mcp_server/index/)
  -> Maintenance (Verify -> Optimize -> Update)
```

## Phase 1: PDF Management

**Check Status:**
```bash
ls pdfs/  # Should show mpep-*.pdf, consolidated_laws.pdf, consolidated_rules.pdf
```

**Download PDFs:**
```bash
patent-creator download-mpep
# Or: python install.py (Select "Download MPEP PDFs")
```

**Verify Integrity:**
```bash
python -c "
import fitz
from pathlib import Path
for pdf in Path('pdfs').glob('*.pdf'):
    try:
        doc = fitz.open(pdf)
        print(f'[OK] {pdf.name}: {len(doc)} pages')
        doc.close()
    except Exception as e:
        print(f'[X] {pdf.name}: ERROR - {e}')
"
```

## Phase 2: Index Building

```bash
patent-creator rebuild-index
# Or: python mcp_server/server.py --rebuild-index
```

**Timeline:**
- Load PDFs: 30s
- Extract text: 1-2 min
- Chunk text (500 tokens): 30s
- Generate embeddings: 5-10 min (GPU) or 35-65 min (CPU)
- Build FAISS/BM25: 1 min
- Save to disk: 10s

**Total:** 5-15 min (GPU) or 35-65 min (CPU)

**Custom Build:**
```python
from mcp_server.mpep_search import MPEPIndex
index = MPEPIndex(use_hyde=False)
index.build_index(
    chunk_size=500,
    overlap=50,
    batch_size=32  # Reduce to 16/8 if OOM
)
```

## Phase 3: Verification

```bash
# Check files
ls -lh mcp_server/index/
# Expected: mpep_index.faiss (~150MB), mpep_metadata.json (~80MB), mpep_bm25.pkl (~60MB)

# Verify health
patent-creator health
# Should show: [OK] MPEP Index: Ready (12,543 chunks)

# Manual test
python -c "
from mcp_server.mpep_search import MPEPIndex
index = MPEPIndex()
print(f'Chunks: {len(index.chunks)}')
results = index.search('claim definiteness', top_k=3)
print(f'Search results: {len(results)}')
"
```

## Phase 4: Maintenance

**When to Rebuild:**
- MPEP updates (quarterly check uspto.gov)
- Index corruption
- After adding new PDFs
- Performance degradation
- Machine migration

**Rebuild Process:**
```bash
# Backup (optional)
cp -r mcp_server/index mcp_server/index_backup_$(date +%Y%m%d)

# Rebuild
patent-creator rebuild-index

# Verify
patent-creator health

# Remove backup if successful
rm -rf mcp_server/index_backup_*
```

## Phase 5: Content Updates

```bash
# Download new PDF
wget https://www.uspto.gov/web/offices/pac/mpep/mpep-2900.pdf -O pdfs/mpep-2900.pdf

# Rebuild (includes new section)
patent-creator rebuild-index
```

**Note:** Incremental updates not supported. Full rebuild required.

## Troubleshooting

- OOM errors during build
- Build taking too long
- Corrupted index files
- Search returning no results

## Performance Tuning

- Embedding generation speed (GPU vs CPU)
- Search latency optimization
- Index size reduction
- Batch size tuning

## Quick Reference

| Command | Purpose |
|---------|---------|
| `patent-creator download-mpep` | Download MPEP PDFs |
| `patent-creator rebuild-index` | Build/rebuild search index |
| `patent-creator health` | Check index health |
| `ls -lh mcp_server/index/` | View index files |

**Best Practices:**
1. Backup before rebuild
2. Verify PDFs before building
3. Use GPU for 10x faster builds
4. Test after rebuild
5. Keep PDFs until verified
6. Weekly health checks