# Vector similarity search, embeddings, RAG The README's **[Vector similarity search](../README.md#vector-similarity-search)** chapter covers the API surface: `f.vector(N, { metric })`, the `near` filter, the `nearTo` orderBy, the per-dialect column types and operators, the dimension-check at IR build time. Read it first. This doc is the long-form companion. It picks a dialect, walks a complete RAG pipeline (chunk → embed → store → retrieve → rerank → prompt), shows hybrid BM25 + vector search through forge's `.searchable()` and `nearTo`, covers embedding versioning when you swap models, tunes HNSW / IVFFlat indexes, explains halfvec / binary quantization, stores CLIP image vectors next to text, batches embeddings at scale, and bolts on MRR / nDCG regression tests that fail CI when recall drops. Numbers are from public benchmarks; treat them as orientation, not promises for your workload. ## Contents * [Picking a dialect](#picking-a-dialect) * [End-to-end RAG with Postgres + pgvector](#end-to-end-rag-with-postgres--pgvector) * [Hybrid search — BM25 + vector with RRF](#hybrid-search--bm25--vector-with-rrf) * [Embedding versioning](#embedding-versioning) * [Index tuning — HNSW and IVFFlat](#index-tuning--hnsw-and-ivfflat) * [Quantization — halfvec and binary](#quantization--halfvec-and-binary) * [Multi-modal vectors — text + CLIP](#multi-modal-vectors--text--clip) * [Streaming insertion at scale](#streaming-insertion-at-scale) * [Eval and guardrails — MRR, nDCG, embedding cache](#eval-and-guardrails--mrr-ndcg-embedding-cache) * [Cost model](#cost-model) --- ## Picking a dialect The same `f.vector(N)` schema works against six engines, but the runtime behaviour at 100k / 1M / 10M vectors is not the same. Pick by corpus size, recall target, and where the rest of your data already lives. | Engine | Index | 100k p95 | 1M p95 | 10M p95 | Recall @ 10 | Write tput | Cost shape | |---|---|---|---|---|---|---|---| | pgvector (HNSW) | `USING hnsw (col vector_cosine_ops)` | 3-8 ms | 8-20 ms | 30-80 ms | 0.97-0.99 | 5-10k rows/s | Whatever your PG host charges. Pure storage + CPU. | | pgvector (IVFFlat) | `USING ivfflat (col vector_cosine_ops) WITH (lists = N)` | 5-15 ms | 15-40 ms | 60-150 ms | 0.90-0.96 (probes=10) | 8-15k rows/s | Same. IVFFlat is half the build time of HNSW. | | sqlite-vec (vec0) | `CREATE VIRTUAL TABLE … USING vec0(...)` | 2-6 ms | 20-60 ms | not viable | brute-force = 1.0, ANN = 0.95-0.98 | 20-40k rows/s | Local file. Zero infra cost. | | MongoDB Atlas Vector Search | `createSearchIndex({ type: 'vectorSearch' })` | 10-30 ms | 20-60 ms | 50-150 ms | 0.95-0.98 | 3-8k rows/s | Atlas-only. Search nodes priced per RAM hour. | | MySQL HeatWave Vector | `SECONDARY_ENGINE = RAPID` | 4-10 ms | 10-25 ms | 40-100 ms | 0.96-0.99 | 5-10k rows/s | OCI HeatWave SKU. Community MySQL = exact only. | | DuckDB (vss) | `USING HNSW` | 2-5 ms | 8-15 ms | 30-60 ms | 0.96-0.99 | 30-60k rows/s | Local / in-process. Cost = the box it runs on. | | MSSQL 2025 / Azure SQL | `USING VECTOR WITH (algorithm = 'HNSW')` | 5-12 ms | 15-30 ms | 50-120 ms | 0.96-0.99 | 6-12k rows/s | Azure SQL Hyperscale or 2025 license. | Decision shortcuts: - **You already run Postgres.** Stop reading the table. pgvector with HNSW is the default. Supabase, Neon, RDS, Crunchy, Aiven all ship it. - **App-local search (offline, edge, browser).** sqlite-vec via forge's wasm adapter ([Browser (sqlite-wasm + OPFS)](../README.md#browser-sqlite-wasm--opfs)). Holds up to a few hundred thousand vectors per tab. Beyond that, route to a server. - **Operational data is already in Mongo.** Atlas Vector Search keeps the vector next to the document, no second store. forge auto-routes the `near` filter to `$vectorSearch`. Off-Atlas Mongo has no native ANN — brute-force or move the embeddings out. - **Analytical / OLAP workload.** DuckDB with the `vss` extension. Great if the embeddings are part of an analytics pipeline (parquet ingest, windowed aggregates, embedding-as-feature). Not a serving DB. - **Microsoft stack.** SQL Server 2025 ships HNSW natively. Don't add a separate vector DB if the data is already there. The rule of thumb for "do I need ANN?" is **brute-force is fine up to roughly 10k vectors at 1536 dim**. Below that, every dialect runs a linear scan at sub-50 ms and recall is exactly 1.0. ANN starts paying off around 100k. --- ## End-to-end RAG with Postgres + pgvector The pipeline below is the one you'd put behind a `/chat` endpoint. It ingests documents, embeds with OpenAI's `text-embedding-3-small`, stores through forge, retrieves with `nearTo`, re-ranks with a local cross-encoder, and composes the prompt. All five steps fit in one file. ```ts // rag.ts import { createDb, f, model, postgresDriver } from 'forge-orm'; // 1) Schema. The dims must match the embedding model exactly — // f.vector(1536) throws an IR error if you hand it a 1024-vector. const Chunk = model('chunks', { id: f.id(), doc_id: f.string(), ord: f.int(), body: f.text().searchable(), // also indexed for BM25 / FTS hybrid token_count: f.int(), embedding: f.vector(1536, { metric: 'cosine' }), model_id: f.string().default('openai/text-embedding-3-small'), created_at: f.dateTime().default('now'), }, { indexes: [ { keys: { embedding: 1 }, method: 'vector', // pgvector HNSW knobs — see "Index tuning" below. options: { m: 16, ef_construction: 64 } }, { keys: { doc_id: 1, ord: 1 } }, ], }); export const db = createDb({ driver: postgresDriver(process.env.PG_URL!), schema: { chunk: Chunk }, }); // 2) Chunking. Two strategies — token windows for prose, paragraph splits // for structured docs. The token-window version is the safer default. function chunkByTokens(text: string, target = 400, overlap = 50): string[] { // Approximate: 1 token ~= 4 chars for English. Replace with tiktoken // for production. The overlap stitches context across chunk boundaries // so a sentence split mid-thought is still findable. const charSize = target * 4; const charOverlap = overlap * 4; const out: string[] = []; for (let i = 0; i < text.length; i += charSize - charOverlap) { out.push(text.slice(i, i + charSize)); } return out; } function chunkByParagraph(text: string, maxChars = 1600): string[] { // Better for markdown / structured prose. Falls back to token splits // when a paragraph blows past `maxChars` (long code blocks, tables). const paras = text.split(/\n{2,}/).map((p) => p.trim()).filter(Boolean); const out: string[] = []; for (const p of paras) { if (p.length <= maxChars) { out.push(p); continue; } out.push(...chunkByTokens(p, 400, 50)); } return out; } // 3) Embedding. One function, three providers. Pick at runtime so you can // A/B them on the same corpus. type EmbeddingProvider = 'openai' | 'voyage' | 'local'; async function embed(texts: string[], provider: EmbeddingProvider): Promise { if (provider === 'openai') { const res = await fetch('https://api.openai.com/v1/embeddings', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ model: 'text-embedding-3-small', input: texts }), }); if (!res.ok) throw new Error(`openai ${res.status}: ${await res.text()}`); const json = await res.json() as { data: { embedding: number[] }[] }; return json.data.map((d) => d.embedding); } if (provider === 'voyage') { const res = await fetch('https://api.voyageai.com/v1/embeddings', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.VOYAGE_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ model: 'voyage-3', input: texts, input_type: 'document' }), }); const json = await res.json() as { data: { embedding: number[] }[] }; return json.data.map((d) => d.embedding); } // Local: @xenova/transformers running in-process. ~10x slower than // hosted, but no per-token bill and no network hop. Use for offline, // cost-sensitive, or PII-restricted workloads. const { pipeline } = await import('@xenova/transformers'); const pipe = await pipeline('feature-extraction', 'Xenova/bge-small-en-v1.5'); const out: number[][] = []; for (const t of texts) { const tensor = await pipe(t, { pooling: 'mean', normalize: true }); out.push(Array.from(tensor.data as Float32Array)); } return out; } // 4) Ingest. Embed in batches of 100 (OpenAI's per-request limit is 2048 // inputs but smaller batches give better cost-recall tracking). export async function ingest(docId: string, text: string) { const chunks = chunkByParagraph(text); const batchSize = 100; for (let i = 0; i < chunks.length; i += batchSize) { const slice = chunks.slice(i, i + batchSize); const vectors = await embed(slice, 'openai'); await db.chunk.createMany({ data: slice.map((body, j) => ({ doc_id: docId, ord: i + j, body, token_count: Math.ceil(body.length / 4), embedding: vectors[j], model_id: 'openai/text-embedding-3-small', })), }); } } // 5) Retrieve. `near` does the filter (only return rows within 0.4 cosine // distance), `nearTo` does the sort. forge attaches `_distance` to // every row in the result. export async function retrieve(query: string, k = 20) { const [qVec] = await embed([query], 'openai'); return db.chunk.findMany({ where: { embedding: { near: { vector: qVec, withinDistance: 0.4 } } }, orderBy: { embedding: { nearTo: qVec } }, take: k, }); } // 6) Re-rank with a cross-encoder. Bi-encoders (the embedding model) score // query and doc independently; cross-encoders score the pair jointly, // so they're 10-30 points more accurate at the top of the list. Use // them only on the top-k retrieved set — they're 50-100x slower. async function rerank(query: string, candidates: { id: string; body: string }[]) { const { pipeline } = await import('@xenova/transformers'); const ce = await pipeline('text-classification', 'Xenova/bge-reranker-base'); const scored = await Promise.all(candidates.map(async (c) => { const out = await ce(`${query} [SEP] ${c.body}`) as { score: number }[]; return { ...c, score: out[0].score }; })); return scored.sort((a, b) => b.score - a.score); } // 7) Compose. Retrieve 20, rerank to top-5, stuff a prompt. export async function answer(query: string) { const raw = await retrieve(query, 20); const reranked = await rerank(query, raw); const context = reranked.slice(0, 5).map((c, i) => `[${i + 1}] ${c.body}`).join('\n\n'); return { prompt: `Answer using the sources below. Cite as [N].\n\n${context}\n\nQ: ${query}\nA:`, sources: reranked.slice(0, 5), }; } ``` A few things this skips that production needs: - **Retry / rate-limit.** OpenAI returns 429s under load. Wrap `embed` with an exponential-backoff retry; don't bury that in the ingest loop. - **Idempotency.** `ingest()` re-embeds every call. The cache pattern in [Eval and guardrails](#eval-and-guardrails--mrr-ndcg-embedding-cache) fixes that — content-hash key, skip if already embedded. - **Deletion.** Re-ingesting a document needs a `db.chunk.deleteMany({ where: { doc_id } })` upfront, otherwise you stack duplicate chunks every run. --- ## Hybrid search — BM25 + vector with RRF Pure vector search misses exact-match queries — model numbers, rare tokens, acronyms. Pure BM25 misses paraphrase — "how do I refund a charge" vs "reverse a payment". Reciprocal Rank Fusion (RRF) takes the rank from each search and adds `1 / (k + rank)`, then re-sorts. `k = 60` is the canonical constant from the original Cormack paper; it's robust across corpora. Postgres version. `f.text().searchable()` already gives a tsvector column and a GIN index — see [Full-text search](../README.md#full-text-search). ```ts async function hybrid(query: string, limit = 20) { const [qVec] = await embed([query], 'openai'); // Lane A — semantic. forge handles this end-to-end. const semantic = await db.chunk.findMany({ orderBy: { embedding: { nearTo: qVec } }, take: 50, }); // Lane B — lexical. websearch_to_tsquery handles "phrase queries" and // -negation naturally. The .searchable() column is `body_tsv` by default. const lexical = await db.$queryRaw<{ id: string; rank: number }[]>` SELECT id, ts_rank(body_tsv, websearch_to_tsquery('english', ${query})) AS rank FROM chunks WHERE body_tsv @@ websearch_to_tsquery('english', ${query}) ORDER BY rank DESC LIMIT 50 `; // RRF — collapse two ranked lists into one. k=60 is the standard. const k = 60; const score = new Map(); semantic.forEach((r, i) => score.set(r.id, (score.get(r.id) ?? 0) + 1 / (k + i))); lexical.forEach((r, i) => score.set(r.id, (score.get(r.id) ?? 0) + 1 / (k + i))); const ids = [...score.entries()].sort((a, b) => b[1] - a[1]).slice(0, limit).map(([id]) => id); // Hydrate. `in` preserves no order — re-sort with the ids array. const rows = await db.chunk.findMany({ where: { id: { in: ids } } }); return ids.map((id) => rows.find((r) => r.id === id)!).filter(Boolean); } ``` SQLite version. Pair fts5 (forge's `.searchable()` on sqlite) with `sqlite-vec`. Same RRF code path on top. ```ts async function hybridSqlite(query: string, limit = 20) { const [qVec] = await embed([query], 'openai'); // Lane A — sqlite-vec ANN. The `embedding_vec` mirror table is created // out-of-band when sqlite-vec is installed; forge writes into it via // the trigger pattern described in the README's vector chapter. const semantic = await db.$queryRaw<{ id: string; dist: number }[]>` SELECT chunks.id AS id, v.distance AS dist FROM embedding_vec v JOIN chunks ON chunks.id = v.rowid WHERE v.embedding MATCH ${JSON.stringify(qVec)} ORDER BY v.distance LIMIT 50 `; // Lane B — fts5. `chunks_fts` is the contentless mirror forge generates // for .searchable() columns on sqlite. const lexical = await db.$queryRaw<{ id: string; rank: number }[]>` SELECT chunks.id AS id, bm25(chunks_fts) AS rank FROM chunks_fts JOIN chunks ON chunks.id = chunks_fts.rowid WHERE chunks_fts MATCH ${query} ORDER BY rank LIMIT 50 `; const k = 60; const score = new Map(); semantic.forEach((r, i) => score.set(r.id, (score.get(r.id) ?? 0) + 1 / (k + i))); lexical.forEach((r, i) => score.set(r.id, (score.get(r.id) ?? 0) + 1 / (k + i))); const ids = [...score.entries()].sort((a, b) => b[1] - a[1]).slice(0, limit).map(([id]) => id); const rows = await db.chunk.findMany({ where: { id: { in: ids } } }); return ids.map((id) => rows.find((r) => r.id === id)!).filter(Boolean); } ``` Typical lift over pure vector: +6 to +12 nDCG@10 on technical corpora (docs, code), +2 to +4 on conversational. The wins are concentrated on queries containing identifiers — SKUs, error codes, function names. --- ## Embedding versioning You will swap models. `text-embedding-3-small` was the answer in 2024; a year later there's something cheaper, faster, or 4 points more accurate. Two designs for handling that without a flag-day re-embed. **Single table, model id column.** The pattern the RAG example above already uses. Every row records which model produced its vector. Queries filter by `model_id` before the `nearTo` so you never mix bi-encoders. ```ts const Chunk = model('chunks', { id: f.id(), body: f.text(), // Same column holds vectors from many models — but the dims have to // match the widest, and you waste storage on narrower ones. Use a // separate column per model if dims differ significantly. embedding: f.vector(1536, { metric: 'cosine' }), model_id: f.string(), }, { indexes: [ // Partial-filter HNSW: one index per model. The `where` clause // (forge spells it `partialFilterExpression`) is critical — without // it, the index mixes vectors from incompatible spaces. { keys: { embedding: 1 }, method: 'vector', name: 'idx_emb_v3small', partialFilterExpression: { model_id: 'openai/text-embedding-3-small' } }, { keys: { embedding: 1 }, method: 'vector', name: 'idx_emb_cohere_v3', partialFilterExpression: { model_id: 'cohere/embed-english-v3' } }, ], }); // Retrieval — pin the model. await db.chunk.findMany({ where: { model_id: 'openai/text-embedding-3-small', embedding: { near: { vector: qVec, withinDistance: 0.4 } }, }, orderBy: { embedding: { nearTo: qVec } }, take: 10, }); ``` **Two columns when dims differ.** Cohere embed-v3 is 1024-dim, text-embedding-3-small is 1536-dim. Storing both in one `f.vector(1536)` column pads or rejects. Use two columns. ```ts const Chunk = model('chunks', { id: f.id(), body: f.text(), emb_1536: f.vector(1536, { metric: 'cosine' }).optional(), emb_1024: f.vector(1024, { metric: 'cosine' }).optional(), }, { indexes: [ { keys: { emb_1536: 1 }, method: 'vector' }, { keys: { emb_1024: 1 }, method: 'vector' }, ], }); ``` **Migration plan when you swap.** 1. Add the new column (`emb_1024`) as `.optional()`. Existing rows are `NULL`, no rewrite. 2. Backfill in batches — a worker reads `where: { emb_1024: null }`, embeds, updates. Throttle by `take: 1000` to avoid OOM on the API rate-limit window. 3. Dual-read for a week. Compare hit-rates between the two columns on a labelled eval set. If new beats old by your threshold (typically +1 nDCG@10), cut over. 4. Drop the old column with a separate migration. Keep it around for at least one release cycle — rollback is "switch the column name in one place" if you keep both. Cost of carrying both: storage doubles, write tput halves (two embed calls per chunk). Cheap insurance for a week. --- ## Index tuning — HNSW and IVFFlat The two ANN index families on pgvector have different knobs and different tradeoffs. Picking the wrong defaults is the most common cause of "my-vector-search-is-slow" tickets. **HNSW.** Hierarchical Navigable Small World. Graph index. Build is expensive, query is fast, recall is high. Two parameters: - `m` — neighbors per node. Default 16. Larger = better recall, more memory, slower build. Set 24-32 for high-recall regimes (>0.99). Below 16 falls apart. - `ef_construction` — candidate list size at build. Default 64. Larger = better graph quality, slower build. 128 is a sane "I want recall but not at any cost" setting. - `ef_search` (runtime) — `SET hnsw.ef_search = 100;`. Default 40. Larger = better recall at query time, slower. Tune per query. ```ts const Chunk = model('chunks', { id: f.id(), embedding: f.vector(1536, { metric: 'cosine' }), }, { indexes: [ { keys: { embedding: 1 }, method: 'vector', options: { m: 16, ef_construction: 64 } }, ], }); // At query time, push ef_search up for high-recall paths. await db.$queryRaw`SET LOCAL hnsw.ef_search = 100`; const top = await db.chunk.findMany({ orderBy: { embedding: { nearTo: qVec } }, take: 50, }); ``` Memory budget for HNSW: roughly `(dims * 4 + m * 8) * rows` bytes per index. At 1M rows / 1536 dim / m=16, that's ~6 GB. Plan accordingly. **IVFFlat.** Inverted-file with flat lists. Cheap to build, smaller in memory, lower recall ceiling, but the `probes` knob lets you trade quality for latency live. - `lists` — number of clusters. **Rule of thumb: `sqrt(N)` for under 1M, `N / 1000` for over 1M.** 100k rows ≈ 316 lists. 10M rows ≈ 10000. - `probes` (runtime) — how many lists to scan per query. Default 1 (recall ~0.7-0.8). Bump to 10 for recall ~0.95, 20 for ~0.98. ```ts indexes: [ { keys: { embedding: 1 }, method: 'vector', options: { lists: 316 } }, // 100k rows ]; await db.$queryRaw`SET LOCAL ivfflat.probes = 10`; ``` **HNSW vs IVFFlat.** HNSW wins on recall-latency frontier almost always. IVFFlat wins when (a) you ingest faster than you query — build is half HNSW's, and (b) memory is tight — IVFFlat indexes are ~30% smaller. **sqlite-vec.** ANN support is recent and partial. `vec0` brute-forces under ~50k rows comfortably. Above that, partition by metadata (`tenant_id`, `lang`, time window) to keep each shard small. There's no HNSW yet; if you need it on sqlite, you need to live with brute-force on the recent slice and offload cold data to a server. **Brute-force fallback path.** Even with an index, you sometimes want to guarantee recall = 1.0 — eval runs, regression suites, "what did we actually retrieve" debugging. Forge gives you the escape hatch through `$queryRaw`: ```ts const exact = await db.$queryRaw<{ id: string; dist: number }[]>` SELECT id, embedding <=> ${qVec}::vector AS dist FROM chunks ORDER BY dist LIMIT 10 `; // pgvector picks the seq-scan path when ORDER BY uses the operator // without the index being eligible — no hint required. ``` --- ## Quantization — halfvec and binary pgvector 0.7 added `halfvec` (fp16, 2 bytes per dim) and `bit` (1 bit per dim) variants. Storage drops 2x and 32x respectively. Recall drops too. **halfvec.** Safe on most text embeddings. Recall loss is in the noise — typically <0.5 points on nDCG@10. Storage halves; index build is 30-40% faster. ```ts const Chunk = model('chunks', { id: f.id(), embedding: f.vector(1536, { metric: 'cosine', precision: 'half' }), }, { indexes: [{ keys: { embedding: 1 }, method: 'vector' }], }); ``` Forge emits `halfvec(1536)` for Postgres when `precision: 'half'` is set and falls back to `vector(N)` on dialects that don't support it. The `near` / `nearTo` API is identical. **Binary.** Cosine collapses to Hamming distance on a binary vector, which is a single XOR + popcount per pair. 100x faster scans, 32x smaller. Recall loss is significant — 3-10 nDCG@10 points depending on the embedding model. Use as a coarse filter, not as the final answer: ```ts // Two-stage: binary recall N=1000 candidates, then fp16 rerank N=10. const candidates = await db.$queryRaw<{ id: string }[]>` SELECT id FROM chunks ORDER BY embedding_bit <~> ${toBinary(qVec)}::bit(1536) LIMIT 1000 `; const ids = candidates.map((c) => c.id); const rerank = await db.chunk.findMany({ where: { id: { in: ids } }, orderBy: { embedding: { nearTo: qVec } }, // fp16 / fp32 column take: 10, }); ``` The binary column is a separate `f.string()` (forge stores `bit(N)` via the raw column escape hatch) — there's no first-class `f.binaryVector()` yet. Track this with a content-hash so you only rebuild when the source vector changes. Rule: **don't run a halfvec experiment on the production index without an A/B comparison on your eval set.** Most loss numbers in the pgvector docs are MS MARCO; your corpus probably isn't. --- ## Multi-modal vectors — text + CLIP CLIP image embeddings are 512-dim (`openai/clip-vit-base-patch32`) or 768-dim (`openai/clip-vit-large-patch14`). Different model, different space; you can't `nearTo` a CLIP vector against a text-embedding-3 vector and get sensible results. Two designs. **Polymorphic table.** Text and image rows in one model, separate columns per modality, a `modality` enum to discriminate. ```ts const Asset = model('assets', { id: f.id(), modality: f.string(), // 'text' | 'image' caption: f.text().searchable(), image_url: f.string().optional(), text_emb: f.vector(1536, { metric: 'cosine' }).optional(), image_emb: f.vector(768, { metric: 'cosine' }).optional(), }, { indexes: [ { keys: { text_emb: 1 }, method: 'vector', partialFilterExpression: { modality: 'text' } }, { keys: { image_emb: 1 }, method: 'vector', partialFilterExpression: { modality: 'image' } }, ], }); ``` The partial-filter pattern keeps the index dense — without it, every text row carries a NULL slot in the image index and vice versa. **Cross-modal retrieval.** CLIP places text and image in the same space when both are encoded by CLIP's encoders (not the OpenAI text embedding model). For "find images matching this query": ```ts async function imageSearch(query: string) { // Use CLIP's text encoder, not text-embedding-3. const { pipeline } = await import('@xenova/transformers'); const txtEnc = await pipeline('feature-extraction', 'Xenova/clip-vit-base-patch32'); const tensor = await txtEnc(query, { pooling: 'mean', normalize: true }); const qVec = Array.from(tensor.data as Float32Array); return db.asset.findMany({ where: { modality: 'image', image_emb: { near: { vector: qVec, withinDistance: 0.5 } }, }, orderBy: { image_emb: { nearTo: qVec } }, take: 20, }); } ``` Cosine threshold for CLIP runs warmer than text embeddings — 0.5 ≈ "loosely related" where 0.5 on text-embedding-3 would be near-noise. Calibrate per model on a labelled set. --- ## Streaming insertion at scale A million chunks at 1536 dim is ~6 GB raw, ~12 GB after the HNSW index. You don't insert that with a `for` loop. Two patterns. **Batched createMany with parallelised embedding.** The embedding API call is the bottleneck — not the database insert. Pipeline them. ```ts async function bulkIngest(rows: { doc_id: string; body: string }[]) { const batchSize = 100; const concurrency = 8; const queue = [...rows]; const workers = Array.from({ length: concurrency }, async () => { while (queue.length) { const batch = queue.splice(0, batchSize); const vectors = await embed(batch.map((r) => r.body), 'openai'); await db.chunk.createMany({ data: batch.map((r, i) => ({ doc_id: r.doc_id, ord: i, body: r.body, token_count: Math.ceil(r.body.length / 4), embedding: vectors[i], model_id: 'openai/text-embedding-3-small', })), }); } }); await Promise.all(workers); } ``` Throughput on a `c7g.xlarge`-shaped box with OpenAI as the embedder: roughly 8-12k rows/min, gated by the embedding API. With a local `bge-small-en-v1.5` on the same box: 30-50k rows/min. **Deferred index build.** HNSW build at insert time roughly halves throughput. For one-shot loads, drop the index first, bulk-load, rebuild once at the end. Forge exposes the rebuild through `$queryRaw`: ```ts // 1. Drop the vector index. await db.$queryRaw`DROP INDEX IF EXISTS chunks_embedding_idx`; // 2. Bulk insert. Index-free inserts on pgvector run at ~30-50k rows/s // per worker on a beefy host. await bulkIngest(rows); // 3. Rebuild the index. HNSW build is single-threaded inside pgvector; // expect ~5-10 min per million rows at 1536 dim with m=16. await db.$queryRaw` CREATE INDEX chunks_embedding_idx ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64) `; // 4. Vacuum to reclaim and update planner stats. await db.$queryRaw`VACUUM ANALYZE chunks`; ``` For continuous ingest (a few hundred rows/sec), keep the HNSW index online — the rebuild cost is worse than the per-insert overhead. --- ## Eval and guardrails — MRR, nDCG, embedding cache Without an eval set, you cannot tell a regression from a feature. The minimum viable harness is 50-200 labelled `{ query, relevant_chunk_ids }` pairs and a script that prints MRR and nDCG@10. ```ts type Label = { query: string; relevant: string[] }; async function evalSet(labels: Label[]) { let mrrSum = 0; let ndcgSum = 0; for (const { query, relevant } of labels) { const [qVec] = await embed([query], 'openai'); const hits = await db.chunk.findMany({ orderBy: { embedding: { nearTo: qVec } }, take: 10, }); const rels = hits.map((h) => relevant.includes(h.id) ? 1 : 0); // MRR — 1 / rank of first relevant. 0 if not in top-10. const firstHit = rels.indexOf(1); mrrSum += firstHit === -1 ? 0 : 1 / (firstHit + 1); // nDCG@10 — sum of rel / log2(rank+2), normalised by ideal. const dcg = rels.reduce((s, r, i) => s + r / Math.log2(i + 2), 0); const ideal = relevant.slice(0, 10) .reduce((s, _, i) => s + 1 / Math.log2(i + 2), 0); ndcgSum += ideal ? dcg / ideal : 0; } return { mrr: mrrSum / labels.length, ndcg: ndcgSum / labels.length }; } // CI gate. Fail the build if metrics drop more than 2 points. test('retrieval quality holds', async () => { const { mrr, ndcg } = await evalSet(LABELS); expect(mrr).toBeGreaterThan(0.62); // baseline measured on main expect(ndcg).toBeGreaterThan(0.58); }); ``` Run the suite against a frozen embedding set — copy the labelled rows into a fixture DB so an upstream model change doesn't move the goalposts. **Embedding cache.** Hashing the chunk body and skipping the API call when nothing changed is the biggest unit-cost win you can take. forge expresses the cache as a small table: ```ts const EmbeddingCache = model('embedding_cache', { id: f.id(), content_hash: f.string().unique(), model_id: f.string(), embedding: f.vector(1536, { metric: 'cosine' }), created_at: f.dateTime().default('now'), }, { indexes: [{ keys: { content_hash: 1, model_id: 1 } }], }); async function embedCached(texts: string[], modelId: string): Promise { const hashes = texts.map((t) => sha256(t)); // SAT: select all hits. const hits = await db.embeddingCache.findMany({ where: { content_hash: { in: hashes }, model_id: modelId }, }); const map = new Map(hits.map((h) => [h.content_hash, h.embedding])); const missingIdx = hashes .map((h, i) => map.has(h) ? -1 : i) .filter((i) => i >= 0); if (missingIdx.length) { const fresh = await embed(missingIdx.map((i) => texts[i]), 'openai'); await db.embeddingCache.createMany({ data: missingIdx.map((i, j) => ({ content_hash: hashes[i], model_id: modelId, embedding: fresh[j], })), }); missingIdx.forEach((i, j) => map.set(hashes[i], fresh[j])); } return hashes.map((h) => map.get(h)!); } ``` Hit rate on a settled corpus is typically 80-95% after the first ingest pass — most "ingest" calls are re-ingest of unchanged content. The cache pays back its storage in a week. --- ## Cost model Embedding cost per 1M tokens, current at the time of writing: | Provider | Model | Dims | Cost / 1M tokens | |---|---|---|---| | OpenAI | text-embedding-3-small | 1536 | $0.020 | | OpenAI | text-embedding-3-large | 3072 | $0.130 | | Voyage | voyage-3 | 1024 | $0.060 | | Voyage | voyage-3-lite | 512 | $0.020 | | Cohere | embed-english-v3 | 1024 | $0.100 | | Mistral | mistral-embed | 1024 | $0.100 | | Local (Xenova bge-small) | bge-small-en-v1.5 | 384 | $0 + CPU | A ballpark: 1M chunks of 400 tokens ≈ 400M tokens ≈ $8 on OpenAI small. The 3-large version is 6.5x the cost for a typical 1-2 nDCG@10 gain; worth it only above the dimensions-matter threshold (high-precision search, agents that re-rerank rarely). Storage cost per vector: | Dim | Precision | Bytes / row | 1M rows | 100M rows | |---|---|---|---|---| | 384 | fp32 | 1536 + overhead | 1.7 GB | 170 GB | | 768 | fp32 | 3072 + overhead | 3.4 GB | 340 GB | | 1536 | fp32 | 6144 + overhead | 6.5 GB | 650 GB | | 1536 | fp16 (halfvec) | 3072 + overhead | 3.4 GB | 340 GB | | 1536 | binary | 192 + overhead | 220 MB | 22 GB | Add HNSW index overhead: roughly 25-30% on top of the column itself for typical `m=16`. **When to self-host the embedder.** A local `bge-small-en-v1.5` on a single GPU box (one $1500 RTX 4090, or $0.50/hr cloud) processes 5-10k texts/sec. Break-even vs OpenAI small at ~50M texts/month. Below that, hosted wins on operator cost; above, self-hosted wins on unit cost. The local Ollama path (e.g. `nomic-embed-text` v1.5) is the same shape but cheaper to set up — `ollama pull nomic-embed-text` and an HTTP call. Recall is competitive with OpenAI small; tail latency is higher.