--- name: semantic-grep description: In-process semantic search over text files or in-memory strings, using Gemini embeddings via the CF AI Gateway. Use when user wants fuzzy/conceptual search where exact-keyword grep would miss — "sessions discussing regulatory constraints", "code about retry logic", "notes mentioning burnout even if the word isn't there". Complements searching-codebases (regex/AST) and extracting-keywords (YAKE). Do NOT use when an exact string/regex match is what's wanted — grep/rg wins on speed and precision there. metadata: version: 0.1.1 --- # Semantic Grep jina-grep-style semantic search, done in-process via Python rather than as an external CLI. Embeds query + corpus chunks with `gemini-embedding-001`, ranks by cosine similarity, returns grep-format output. ## When Semantic Search Helps The core trade-off (lifted from `jina-grep-cli`'s own docs and validated in testing): | Task | Tool | |------|------| | Known exact string, filename, or regex | `grep` / `rg` / `searching-codebases` | | "What files discuss concept X" when X may not appear verbatim | **semantic-grep** | | Hybrid: prefilter with grep, rerank by concept | grep → `rerank_candidates()` | **Regression test result (workshop session corpus, 135 docs):** - *"handling regulatory constraints"* → top hit *"Engineering AI Systems Under Sovereignty Constraints"* (0.67). ✓ - *"sessions about GEPA"* → top hit *"Gemma, DeepMind's Family of Open Models"* (0.69). ✗ — false positive on phonetic neighbor. GEPA is mentioned verbatim in one session description; grep would find it correctly. **Rule: when the user query reads like a named entity or keyword, try grep first. Only reach for semantic-grep when paraphrase/concept matching is actually needed.** ## Setup Credentials via `proxy.env` (Cloudflare AI Gateway w/ BYOK — same pattern as `invoking-gemini`): ``` CF_ACCOUNT_ID=... CF_GATEWAY_ID=... CF_API_TOKEN=... ``` Direct-API fallback: `GOOGLE_API_KEY` or `GEMINI_API_KEY` env var. No dependencies beyond `requests` + `numpy`. ## Quick Start ```python import sys sys.path.insert(0, '/mnt/skills/user/semantic-grep/scripts') from semantic_grep import semantic_grep, format_grep # Directory of .txt files results = semantic_grep("error handling under load", "/path/to/notes", top_k=5, granularity="paragraph") print(format_grep(results)) # notes/incidents.txt:42: When the queue depth exceeds... [0.71] # notes/postmortem.txt:8: Under sustained traffic we saw... [0.68] ``` ## Core API ### `semantic_grep(query, corpus, *, top_k=10, threshold=None, ...)` Main search function. - `query` *(str)* — the search query (embedded with `RETRIEVAL_QUERY` task type) - `corpus` *(str | Path | list[Chunk])* — a file, directory, or pre-chunked list - `top_k` *(int | None)* — max results; `None` = all above threshold - `threshold` *(float | None)* — cosine similarity cutoff; `None` = no filter (top_k only) - `granularity` *("paragraph" | "line")* — how to chunk files (default paragraph) - `include` *(str)* — filename-glob filter when `corpus` is a directory (default `"*.txt"`). Matches against `Path.name` only, not the full path — `"*.md"` works, `"docs/*.md"` does not. - `model` *(str)* — default `"gemini-embedding-001"` - `dim` *(int)* — 128 / 768 / 1536 / 3072 (default 768; MRL-truncated + renormalized) - `task` *("text" | "code")* — selects text vs code task types Returns `list[Match]` where `Match` has `path`, `line`, `text`, `score`. ### `load_corpus(path, *, include="*.txt", granularity="paragraph") -> list[Chunk]` Load and chunk a file or directory without embedding. Useful for inspecting what gets embedded before paying for the API call. ### `embed_batch(texts, task_type, *, model, dim, group_size=100) -> np.ndarray` Lower-level: embed a list of strings directly via `:batchEmbedContents`. Returns `(N, dim)` float32 array, rows normalized when `dim < 3072`. ### `format_grep(matches, *, max_text_chars=200, show_score=True) -> str` Format matches as grep output: `path:line: snippet [score]`. ## Pipe-mode Rerank Pattern The highest-leverage use isn't naive full-corpus semantic search — it's hybrid retrieval: **fast coarse filter → semantic rerank**. ```python import subprocess from semantic_grep import Chunk, semantic_grep, format_grep # Stage 1: fast exact/regex prefilter with rg result = subprocess.run( ["rg", "-n", "--no-heading", "error|fail|timeout", "logs/"], capture_output=True, text=True, ) # Parse `path:line:text` into Chunks chunks = [] for raw in result.stdout.splitlines(): path, line, text = raw.split(":", 2) chunks.append(Chunk(path=path, line=int(line), text=text)) # Stage 2: semantic rerank on the prefiltered subset ranked = semantic_grep("intermittent queue saturation during peak traffic", chunks, top_k=10) print(format_grep(ranked)) ``` This is how you scale past the "embed the whole corpus every call" limit without needing a vector DB. The exact-match stage cheaply cuts millions of lines to thousands; semantic reranks those. ## Task Types (Gemini) - **text mode** (default): query → `RETRIEVAL_QUERY`, docs → `RETRIEVAL_DOCUMENT`. Asymmetric — documented to outperform symmetric encoding for retrieval. - **code mode**: query → `CODE_RETRIEVAL_QUERY`, docs → `RETRIEVAL_DOCUMENT`. Use when searching code with natural-language queries. Use `SEMANTIC_SIMILARITY` (symmetric) only if you're doing pairwise sim, not retrieval. This module doesn't expose that path yet. ## Model Notes `gemini-embedding-001` (GA since Feb 2026): - 2,048 input token limit per text. Longer texts are truncated at ~8K chars (approximation). - Matryoshka (MRL) — 3072 native dims, safely truncatable to 1536/768/256/128. - 3072 is auto-normalized; lower dims need client-side renorm (handled here). - Pricing: $0.15 / 1M input tokens. 135 medium paragraphs ≈ 15K tokens ≈ $0.002 per query. `gemini-embedding-2-preview` (March 2026) is multimodal and currently top of MTEB. Set `model="gemini-embedding-2-preview"` to opt in once the preview stabilizes. ## Limitations (v0.1.1) - **No persistent index.** Every call re-embeds the corpus. Fine for <~1K chunks; prohibitive for real knowledge bases. Phase 2: cache embeddings by content hash. - **Token budget is approximated by char count (×1.5).** Conservative for mixed-script text; over-truncates English slightly. Real tokenizer would use the Gemini tokenizer endpoint but costs an extra call per embed. - **Batch bulk-failure diagnostic.** If one text in a group of 100 overflows or is rejected by safety filters, the whole batch fails and the 99 good ones are lost. No per-index fallback yet. - **No memory ceiling on corpus size.** `semantic_grep` pre-allocates `(N, dim)` float32; 1M chunks at dim=768 ≈ 3GB. Caller is responsible for sane chunk counts. `load_corpus` also follows symlinks via `rglob` — fine in a trusted single-user container, not for untrusted paths. - **Sequential batch groups.** `group_size=100` per HTTP call; groups run serially. For >1K chunks, add asyncio — not needed yet. - **No CLI shim.** Called as a Python module, not a subprocess. Per design: "within an LLM rather than calling out to one." - **Embedding function lives here, not in `invoking-gemini`.** Should be factored up when invoking-gemini adds embedding support. Tracked as followup. ## Related Skills - `invoking-gemini` — sibling; handles Gemini text + image generation through the same CF gateway. Shares credential pattern. - `searching-codebases` — regex/AST search. Use first when the query is a known pattern. - `extracting-keywords` — YAKE keyword extraction; orthogonal, but pairs well for building query terms from a long prompt. - `exploring-codebases` — for understanding repo structure. Semantic-grep doesn't replace AST-based navigation. ## Attribution Conceptually inspired by [`jina-grep-cli`](https://github.com/jina-ai/jina-grep-cli) — we kept the retrieval shape (grep-compatible output, asymmetric query/doc embeddings, threshold + top-k) but swapped the MLX/Apple-Silicon backend for a portable Gemini API call. The original's pipe-mode rerank pattern is the most generalizable idea it contributes and is preserved here.