---
name: repo-rag
description: Perform high-recall codebase retrieval using semantic search and symbol indexing. Use when you need to find specific code, understand project structure, or verify architectural patterns before editing.
version: 2.0
model: sonnet
invoked_by: both
user_invocable: true
tools: [Read, Grep, Glob]
best_practices:
- Use clear, specific queries (avoid vague terms)
- Provide context about what you're looking for
- Review multiple results to understand patterns
- Use follow-up queries to refine results
- Verify file paths before proposing edits
error_handling: graceful
streaming: supported
---
Repo RAG (Retrieval Augmented Generation) - Provides advanced codebase search capabilities beyond simple grep.
- High-recall codebase retrieval using semantic search
- Symbol indexing for finding classes, functions, and types
- Understanding project structure
- Verifying architectural patterns before editing
1. **Symbol Search First**: Use `symbols` to find classes, functions, and types. This is more accurate than text search for code structures.
2. **Semantic Search**: Use `search` for concepts, comments, or broader patterns.
3. **Verification**: Always verify the file path and context returned before proposing edits.
- **Architecture Review**: Run symbol searches on key interfaces to understand the dependency graph.
- **Plan Mode**: Use this skill to populate the "Context" section of a Plan Mode artifact.
- **Refactoring**: Identify all usages of a symbol before renaming or modifying it.
**Symbol Search**:
```
symbols "UserAuthentication"
```
**Semantic Search**:
```
search "authentication middleware logic"
```
## RAG Evaluation
### Overview
Systematic evaluation of RAG quality using retrieval and end-to-end metrics. Based on Claude Cookbooks patterns.
### Evaluation Metrics
**Retrieval Metrics** (from `.claude/tools/repo-rag/metrics.py`):
- **Precision**: Proportion of retrieved chunks that are actually relevant
- Formula: `Precision = True Positives / Total Retrieved`
- High precision (0.8-1.0): System retrieves mostly relevant items
- **Recall**: Completeness of retrieval - how many relevant items were found
- Formula: `Recall = True Positives / Total Correct`
- High recall (0.8-1.0): System finds most relevant items
- **F1 Score**: Harmonic mean of precision and recall
- Formula: `F1 = 2 × (Precision × Recall) / (Precision + Recall)`
- Balanced measure when both precision and recall matter
- **MRR (Mean Reciprocal Rank)**: Measures ranking quality
- Formula: `MRR = 1 / rank of first correct item`
- High MRR (0.8-1.0): Correct items ranked first
**End-to-End Metrics** (from `.claude/tools/repo-rag/evaluation.py`):
- **Accuracy (LLM-as-Judge)**: Overall correctness using Claude evaluation
- Compares generated answer to correct answer
- Focuses on substance and meaning, not exact wording
- Checks for completeness and absence of contradictions
### Evaluation Process
1. **Create Evaluation Dataset**:
```json
{
"query": "How is user authentication implemented?",
"correct_chunks": ["src/auth/middleware.ts", "src/auth/types.ts"],
"correct_answer": "User authentication uses JWT tokens...",
"category": "authentication"
}
```
2. **Run Retrieval Evaluation**:
```bash
# Using Python directly
from .claude.tools.repo_rag.metrics import evaluate_retrieval
metrics = evaluate_retrieval(retrieved_chunks, correct_chunks)
print(f"Precision: {metrics['precision']}, Recall: {metrics['recall']}, F1: {metrics['f1']}, MRR: {metrics['mrr']}")
```
3. **Run End-to-End Evaluation**:
```bash
# Using Python directly
from .claude.tools.repo_rag.evaluation import evaluate_end_to_end
result = evaluate_end_to_end(query, generated_answer, correct_answer)
print(f"Correct: {result['is_correct']}, Explanation: {result['explanation']}")
```
### Expected Performance
Based on Claude Cookbooks results:
- **Basic RAG**: Precision 0.43, Recall 0.66, F1 0.52, MRR 0.74, Accuracy 71%
- **With Re-ranking**: Precision 0.44, Recall 0.69, F1 0.54, MRR 0.87, Accuracy 81%
### Best Practices
1. **Separate Evaluation**: Evaluate retrieval and end-to-end separately
2. **Create Comprehensive Datasets**: Cover common and edge cases
3. **Evaluate Regularly**: Run evaluations after codebase changes
4. **Track Metrics Over Time**: Monitor improvements
5. **Use Both Metrics**: Precision/Recall for retrieval, Accuracy for end-to-end
### References
- [RAG Patterns Guide](../docs/RAG_PATTERNS.md) - Implementation patterns
- [Retrieval Metrics](../tools/repo-rag/metrics.py) - Metric calculations
- [End-to-End Evaluation](../tools/repo-rag/evaluation.py) - LLM-as-judge
- [Evaluation Guide](../docs/EVALUATION_GUIDE.md) - Comprehensive evaluation guide
## Memory Protocol (MANDATORY)
**Before starting:**
Read `.claude/context/memory/learnings.md`
**After completing:**
- New pattern -> `.claude/context/memory/learnings.md`
- Issue found -> `.claude/context/memory/issues.md`
- Decision made -> `.claude/context/memory/decisions.md`
> ASSUME INTERRUPTION: If it's not in memory, it didn't happen.