--- name: repo-rag description: Perform high-recall codebase retrieval using semantic search and symbol indexing. Use when you need to find specific code, understand project structure, or verify architectural patterns before editing. version: 2.0 model: sonnet invoked_by: both user_invocable: true tools: [Read, Grep, Glob] best_practices: - Use clear, specific queries (avoid vague terms) - Provide context about what you're looking for - Review multiple results to understand patterns - Use follow-up queries to refine results - Verify file paths before proposing edits error_handling: graceful streaming: supported --- Repo RAG (Retrieval Augmented Generation) - Provides advanced codebase search capabilities beyond simple grep. - High-recall codebase retrieval using semantic search - Symbol indexing for finding classes, functions, and types - Understanding project structure - Verifying architectural patterns before editing 1. **Symbol Search First**: Use `symbols` to find classes, functions, and types. This is more accurate than text search for code structures. 2. **Semantic Search**: Use `search` for concepts, comments, or broader patterns. 3. **Verification**: Always verify the file path and context returned before proposing edits. - **Architecture Review**: Run symbol searches on key interfaces to understand the dependency graph. - **Plan Mode**: Use this skill to populate the "Context" section of a Plan Mode artifact. - **Refactoring**: Identify all usages of a symbol before renaming or modifying it. **Symbol Search**: ``` symbols "UserAuthentication" ``` **Semantic Search**: ``` search "authentication middleware logic" ``` ## RAG Evaluation ### Overview Systematic evaluation of RAG quality using retrieval and end-to-end metrics. Based on Claude Cookbooks patterns. ### Evaluation Metrics **Retrieval Metrics** (from `.claude/tools/repo-rag/metrics.py`): - **Precision**: Proportion of retrieved chunks that are actually relevant - Formula: `Precision = True Positives / Total Retrieved` - High precision (0.8-1.0): System retrieves mostly relevant items - **Recall**: Completeness of retrieval - how many relevant items were found - Formula: `Recall = True Positives / Total Correct` - High recall (0.8-1.0): System finds most relevant items - **F1 Score**: Harmonic mean of precision and recall - Formula: `F1 = 2 × (Precision × Recall) / (Precision + Recall)` - Balanced measure when both precision and recall matter - **MRR (Mean Reciprocal Rank)**: Measures ranking quality - Formula: `MRR = 1 / rank of first correct item` - High MRR (0.8-1.0): Correct items ranked first **End-to-End Metrics** (from `.claude/tools/repo-rag/evaluation.py`): - **Accuracy (LLM-as-Judge)**: Overall correctness using Claude evaluation - Compares generated answer to correct answer - Focuses on substance and meaning, not exact wording - Checks for completeness and absence of contradictions ### Evaluation Process 1. **Create Evaluation Dataset**: ```json { "query": "How is user authentication implemented?", "correct_chunks": ["src/auth/middleware.ts", "src/auth/types.ts"], "correct_answer": "User authentication uses JWT tokens...", "category": "authentication" } ``` 2. **Run Retrieval Evaluation**: ```bash # Using Python directly from .claude.tools.repo_rag.metrics import evaluate_retrieval metrics = evaluate_retrieval(retrieved_chunks, correct_chunks) print(f"Precision: {metrics['precision']}, Recall: {metrics['recall']}, F1: {metrics['f1']}, MRR: {metrics['mrr']}") ``` 3. **Run End-to-End Evaluation**: ```bash # Using Python directly from .claude.tools.repo_rag.evaluation import evaluate_end_to_end result = evaluate_end_to_end(query, generated_answer, correct_answer) print(f"Correct: {result['is_correct']}, Explanation: {result['explanation']}") ``` ### Expected Performance Based on Claude Cookbooks results: - **Basic RAG**: Precision 0.43, Recall 0.66, F1 0.52, MRR 0.74, Accuracy 71% - **With Re-ranking**: Precision 0.44, Recall 0.69, F1 0.54, MRR 0.87, Accuracy 81% ### Best Practices 1. **Separate Evaluation**: Evaluate retrieval and end-to-end separately 2. **Create Comprehensive Datasets**: Cover common and edge cases 3. **Evaluate Regularly**: Run evaluations after codebase changes 4. **Track Metrics Over Time**: Monitor improvements 5. **Use Both Metrics**: Precision/Recall for retrieval, Accuracy for end-to-end ### References - [RAG Patterns Guide](../docs/RAG_PATTERNS.md) - Implementation patterns - [Retrieval Metrics](../tools/repo-rag/metrics.py) - Metric calculations - [End-to-End Evaluation](../tools/repo-rag/evaluation.py) - LLM-as-judge - [Evaluation Guide](../docs/EVALUATION_GUIDE.md) - Comprehensive evaluation guide ## Memory Protocol (MANDATORY) **Before starting:** Read `.claude/context/memory/learnings.md` **After completing:** - New pattern -> `.claude/context/memory/learnings.md` - Issue found -> `.claude/context/memory/issues.md` - Decision made -> `.claude/context/memory/decisions.md` > ASSUME INTERRUPTION: If it's not in memory, it didn't happen.