--- name: knowledge-base-search description: Search and locate relevant content within a local knowledge base (files, indices, or PageIndex). Use when you need verifiable citations (file + page/paragraph) to support answers from local sources. license: MIT author: aipoch --- > **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills) # Knowledge Base Search ## When to Use - You need to find specific facts, definitions, or procedures from a local knowledge base and return the exact source location. - You must provide traceable citations (file path + page/paragraph/section) for audit, compliance, or review. - You need to verify the original wording of a claim in the source document (quote-level validation). - You want to compare how multiple local documents discuss the same topic and identify differences. - You need to assemble supporting snippets for a report, FAQ, or internal knowledge response using only local materials. ## Key Features - Supports multiple retrieval approaches: direct file search, index-based search, and PageIndex-style location mapping. - Query strategy guidance: keyword splitting, synonym expansion, and optional filters (time range, file type, tags). - Relevance-oriented result ranking and filtering to keep the most supportive evidence first. - Outputs verifiable hit snippets with precise citation locations (file + page/paragraph/section when available). - Enforces local-only boundaries: searches only within authorized directories and does not modify source content. ## Dependencies - `glob` (>= 10.0.0): file path pattern matching - `grep` (>= 3.11): in-file text searching - Local knowledge base index files (one or more of: filename index, content index, vector index, PageIndex mapping) - `assets/hit_list_template.csv`: standardized hit list output template - Optional reference: `references/guide.md` (output formats, checklists, inspection points) ## Example Usage The following example demonstrates an end-to-end local search workflow and produces a CSV hit list compatible with `assets/hit_list_template.csv`. ### Inputs - Knowledge base root: `./kb/` - Query: `How do we rotate API keys?` - Filters: file types `md,pdf`, time range `2024-01-01..2026-12-31` ### Steps 1. **Confirm index and scope** - Ensure the search scope is limited to authorized paths (e.g., `./kb/`). - Identify available indices: - filename/content index (fast keyword search) - vector index (semantic retrieval) - PageIndex mapping (page/paragraph location resolution) 2. **Build the query** - Keywords: `rotate`, `API key`, `key rotation` - Synonyms/variants: `credential rotation`, `token rotation`, `regenerate key` - Filters: - file type: `*.md`, `*.pdf` - time range: `2024-01-01..2026-12-31` (if metadata exists) 3. **Execute search (local-only)** - Path discovery (example): - `glob("./kb/**/*.md")` - `glob("./kb/**/*.pdf")` - Content search (example): - `grep -RIn "API key\|key rotation\|rotate" ./kb/` 4. **Filter and rank results** - Keep hits that directly answer the question (procedure, policy, steps, constraints). - Rank by: - term proximity (e.g., “rotate” near “API key”) - section relevance (e.g., “Security”, “Credentials”, “Operations”) - coverage (hits that include prerequisites + steps + verification) 5. **Output citations and hit list** - For each hit, output: - `file_path` - `location` (page number for PDFs; heading/paragraph index for Markdown; PageIndex if available) - `snippet` (verbatim excerpt supporting the conclusion) - `notes` (why it is relevant; any assumptions) - Save as `hit_list.csv` using `assets/hit_list_template.csv` columns. ### Example Output (CSV rows) ```csv file_path,location,snippet,relevance_score,notes kb/security/credential_policy.pdf,page 12,"API keys must be rotated every 90 days... Rotation requires...",0.92,"Direct policy + rotation interval + procedure reference." kb/runbooks/api_key_rotation.md,section 'Procedure' ¶3,"To rotate an API key: (1) create a new key... (2) update services... (3) revoke old key...",0.89,"Step-by-step operational runbook." kb/audit/controls.md,heading 'Key Management' ¶2,"Evidence of rotation includes change tickets and key revocation logs...",0.81,"Provides verification/evidence requirements." ``` ## Implementation Details ### Retrieval Workflow 1. **Index confirmation** - Determine knowledge base root paths and last update time (if available). - Detect which indices exist: - filename index: quick narrowing by file names - content index: inverted index / grep-like scanning - vector index: semantic similarity retrieval - PageIndex: mapping from document offsets to page/paragraph identifiers 2. **Query strategy** - Tokenize the question into: - core entities (e.g., “API key”) - actions (e.g., “rotate”, “revoke”, “regenerate”) - constraints (e.g., “every 90 days”, “approval required”) - Expand with synonyms and variants. - Apply filters when metadata exists: - time range - file type - tags/collections 3. **Result filtering and ranking** - Remove low-signal hits (navigation, boilerplate, unrelated mentions). - Rank by a weighted score (example): - **Keyword match** (exact phrase > partial): 0.45 - **Proximity** (terms close together): 0.20 - **Section importance** (titles like “Procedure/Policy”): 0.20 - **Coverage** (answers include steps + constraints + verification): 0.15 - Keep the original text snippet verbatim for verification. 4. **Citation and location resolution** - Markdown/text: - use heading + paragraph index (or line range) as the primary locator - PDF: - use page number; optionally include bounding text around the hit - PageIndex (if present): - map internal offsets to stable `page/paragraph` identifiers ### Constraints and Limitations - Search only within user-authorized local directories. - Do not modify source documents. - Do not execute scripts or arbitrary code. - Do not access network resources or external APIs. - If indices are missing/corrupted, fall back to direct file scanning; if scanning is not possible, report the limitation and required remediation (re-indexing).