# Full-text search — deep dive The README chapter **[Full-text search](../README.md#full-text-search)** covers the surface API: mark a column `.searchable()`, query it with `where: { col: { search: 'query' } }`, and forge emits the right index for whichever of the six databases you're on. This doc is the long-form companion. It walks each dialect's FTS engine, the exact SQL or stage the `search` operator becomes, what's lost in translation, how to read the underlying ranking score, where to drop to `$queryRaw`, how to combine FTS with vector search via RRF, and how to keep an index healthy across millions of writes. If you haven't read the README chapter yet, start there — this doc assumes you know the call shape and the dialect matrix. ## Contents * [The `.searchable()` decoration](#the-searchable-decoration) * [The `search` operator — per-dialect translation](#the-search-operator--per-dialect-translation) * [Query syntax per dialect](#query-syntax-per-dialect) * [Languages and analyzers](#languages-and-analyzers) * [Ranking and scoring](#ranking-and-scoring) * [Multi-column FTS](#multi-column-fts) * [Stop words and customization](#stop-words-and-customization) * [Searchable + soft-delete](#searchable--soft-delete) * [Searchable embeds and JSON](#searchable-embeds-and-json) * [Hybrid search (BM25 + vector)](#hybrid-search-bm25--vector) * [Incremental indexing and rebuilds](#incremental-indexing-and-rebuilds) * [Highlighting](#highlighting) * [Performance and when to switch engines](#performance-and-when-to-switch-engines) * [Six worked patterns](#six-worked-patterns) * [Common bugs](#common-bugs) --- ## The `.searchable()` decoration `f.text().searchable()` (also valid on `f.string()`) marks a field as participating in the dialect's full-text engine. The marker is read at table-create time by `forge push` and at runtime by `db.$migrate()` on the browser adapter. Nothing about the column's logical type changes — it's still a `text` / `varchar` / `TEXT` column on the base table. ```ts const Post = model('posts', { id: f.id(), title: f.string().searchable(), body: f.text().searchable(), org_id: f.string(), }); ``` What forge emits when the table is created: | Dialect | Emitted FTS object | |----------|----------------------------------------------------------------------------------| | Postgres | `CREATE INDEX … ON posts USING gin(to_tsvector('simple', "title"))` (one per searchable column) | | MySQL | `ALTER TABLE posts ADD FULLTEXT "ft_posts_title" (\`title\`)` (one per searchable column) | | SQLite | One shared `posts_fts` virtual table — `CREATE VIRTUAL TABLE posts_fts USING fts5(title, body, content=posts, content_rowid='rowid')` + 3 triggers (`AFTER INSERT`, `AFTER UPDATE`, `AFTER DELETE`) that keep the shadow in sync | | DuckDB | `PRAGMA create_fts_index('posts', 'id', 'title', 'body')` (uses the `fts` extension; one call per table covers every searchable column) | | Mongo | `db.posts.createIndex({ title: 'text', body: 'text' })` — Mongo only allows one text index per collection so all searchable fields share it | | MSSQL | No DDL emitted. SQL Server's full-text engine needs a catalog created out-of-band; the `searchable` marker is preserved on the schema so doctor + introspect can see it. At query time `searchClause` falls back to `LIKE '%q%'` so portability holds — see the MSSQL section below | The Postgres index name follows the pattern `forge__fts_`. MySQL uses `ft_
_`. SQLite fixes the shadow name at `
_fts`. The names are stable so you can introspect, drop, or rewrite outside forge without breaking the runtime. ### Where the SQLite shadow table lives SQLite is the only dialect that physically duplicates the data — the FTS5 virtual table stores its own tokenized index alongside the base table. Forge sets `content=posts` and `content_rowid='rowid'`, which turns FTS5 into a "contentless" index against the base rowid. Storage overhead is one tokenized copy of the searchable columns, not two full copies. The three triggers fire on every base-table write, so the shadow stays consistent without a rebuild step. ```sql -- What forge actually emits on push, with comments stripped: CREATE VIRTUAL TABLE IF NOT EXISTS "posts_fts" USING fts5( "title", "body", content="posts", content_rowid='rowid' ); CREATE TRIGGER IF NOT EXISTS "posts_ai" AFTER INSERT ON "posts" BEGIN INSERT INTO "posts_fts"(rowid, "title", "body") VALUES (new.rowid, new."title", new."body"); END; CREATE TRIGGER IF NOT EXISTS "posts_ad" AFTER DELETE ON "posts" BEGIN INSERT INTO "posts_fts"("posts_fts", rowid, "title", "body") VALUES('delete', old.rowid, old."title", old."body"); END; CREATE TRIGGER IF NOT EXISTS "posts_au" AFTER UPDATE ON "posts" BEGIN INSERT INTO "posts_fts"("posts_fts", rowid, "title", "body") VALUES('delete', old.rowid, old."title", old."body"); INSERT INTO "posts_fts"(rowid, "title", "body") VALUES (new.rowid, new."title", new."body"); END; ``` The trigger pattern is the FTS5 "delete-then-insert" idiom from the SQLite docs. It survives partial updates that don't touch a searchable column because the trigger fires on any UPDATE — small overhead, but the index never drifts. --- ## The `search` operator — per-dialect translation The portable form is `where: { col: { search: 'query' } }`. Forge routes it through `dialect.searchClause()`, which has the contract: > Given a quoted column reference and a bound parameter, return a SQL > boolean expression that is `TRUE` iff the row matches the query > string under the dialect's full-text rules. The dialects implement it like this: | Dialect | Generated expression | |----------|----------------------| | Postgres | `to_tsvector('simple', "title") @@ plainto_tsquery('simple', $1)` | | MySQL | `` MATCH(`title`) AGAINST (? IN NATURAL LANGUAGE MODE) `` | | SQLite | `"posts".rowid IN (SELECT rowid FROM "posts_fts" WHERE "posts_fts" MATCH ?)` | | DuckDB | `"title" ILIKE '%' || ? || '%'` — the DuckDB FTS extension is opt-in; the portable clause stays case-insensitive without it | | Mongo | `{ $text: { $search: 'query' } }` — promoted to top level of the filter, not nested under the field key | | MSSQL | `"title" LIKE '%' + @p1 + '%'` — fallback; rewrite to `CONTAINS("title", @p1)` in raw if you've installed and configured the SQL Server Full-Text Search component | A few things to notice: - **The translation is per column.** Combining multiple searchable columns into one query is dialect-specific — see [Multi-column FTS](#multi-column-fts). - **Postgres uses `plainto_tsquery` by default** (every word ANDed, punctuation stripped). For phrase / boolean / negation syntax you drop to raw with `websearch_to_tsquery`, which is what the [Hybrid search](#hybrid-search-bm25--vector) and ranking examples below do. The default keeps the portable operator safe — strings with stray quotes or backslashes never throw. - **MySQL natural-language mode** silently drops queries that match more than 50% of rows (the "common-word threshold"). On small dev databases your search returns zero hits even though the data obviously matches. Boolean mode bypasses the threshold — pass `IN BOOLEAN MODE` via raw if your dev corpus is too small. - **SQLite's subquery form** means the search clause composes cleanly with `org_id = ?`, `deleted_at IS NULL`, `take`, `orderBy` and all the rest. The query planner picks the FTS5 path when the rowid IN list is selective. - **Mongo's `$text`** is collection-wide, not per-field. Forge promotes the operator to the top of the filter when it sees a `search` leaf; if you put `search` on two different fields in the same query the second wins (Mongo doesn't allow two `$text` operators in one query). --- ## Query syntax per dialect The portable `search: 'query string'` form delivers the raw text to the engine. The engine then parses it according to its own grammar. Forge does not normalize across grammars — that's deliberate: stripping characters to satisfy the weakest engine would silently degrade search on the others. | Feature | Postgres (`plainto_tsquery`) | Postgres (`websearch_to_tsquery`) | MySQL (NL) | MySQL (Boolean) | SQLite FTS5 | Mongo `$text` | |-------------------|------------------------------|------------------------------------|------------|------------------|-------------|---------------| | Word AND | implicit (all words) | implicit | implicit | implicit | implicit | implicit | | Phrase ("a b") | no | yes — `"a b"` | no | yes — `"a b"` | yes — `"a b"`| yes — `"a b"` | | Prefix (`stem*`) | no | no | no | yes — `stem*` | yes — `stem*`| no | | Negation (`-foo`) | no | yes — `-foo` | no | yes — `-foo` | yes — `NOT foo` | yes — `-foo` | | OR (`a OR b`) | no | yes — `or` | no | no | yes — `OR` | yes (whitespace = OR) | The portable `search:` operator uses the safe parser on each engine (`plainto_tsquery` on PG, natural-language on MySQL, FTS5's standard parser on SQLite, `$text` on Mongo). When you want phrase / prefix / boolean behavior, drop to raw on PG and MySQL; SQLite and Mongo already parse those forms through the portable operator. ```ts // PG: phrase + negation. Drop to raw for websearch_to_tsquery. const rows = await db.$queryRaw<{ id: string }[]>` SELECT id FROM posts WHERE to_tsvector('english', body) @@ websearch_to_tsquery('english', ${`"database wrapper" -prisma`}) `; // MySQL: prefix match needs boolean mode. const rows = await db.$queryRaw<{ id: string }[]>` SELECT id FROM posts WHERE MATCH(body) AGAINST(${'data*'} IN BOOLEAN MODE) `; ``` --- ## Languages and analyzers Tokenization, stemming, and stop-word handling are the part of FTS that's least portable. Forge defaults to the most permissive configuration so the operator works without setup; you opt in to a specific language by passing it through raw or by configuring the index out-of-band. ### Postgres The default index is over `to_tsvector('simple', col)`. `simple` splits on whitespace and lowercases, but does no stemming — `databases` and `database` are different tokens. To switch language, declare the searchable field with a custom `expression:` index: ```ts const Post = model('posts', { id: f.id(), body: f.text().searchable(), }, { indexes: [ { // English stemming. Snowball-based; 'databases' → 'databas'. name: 'idx_posts_body_en', method: 'gin', expression: `to_tsvector('english', body)`, }, ], }); ``` Then query with the matching configuration: ```ts const rows = await db.$queryRaw<{ id: string }[]>` SELECT id FROM posts WHERE to_tsvector('english', body) @@ plainto_tsquery('english', ${q}) `; ``` Built-in configs: `simple`, `english`, `french`, `german`, `spanish`, `italian`, `portuguese`, `dutch`, `swedish`, `russian`, plus the rest of the Snowball list. For CJK, install `zhparser` (Postgres extension) and use `'chinese'`. For Arabic and Hebrew, install `pg_trgm` and combine with a trigram index — Snowball doesn't cover them well. ### MySQL InnoDB FULLTEXT defaults to a built-in parser that handles whitespace-delimited languages reasonably. CJK and other scripts that don't whitespace-delimit need the **ngram** parser: ```ts const Post = model('posts', { id: f.id(), body: f.text().searchable(), }, { indexes: [ { name: 'ft_posts_body_ngram', method: 'fulltext', keys: { body: 1 }, parser: 'ngram', // or 'mecab' for Japanese }, ], }); ``` `ngram_token_size` (server variable, default `2`) controls n-gram width. `2` matches CJK convention; `1` makes single-character queries hit, at the cost of much larger indexes. Set it server-wide before loading data — changing it requires a rebuild. ### SQLite FTS5 ships three tokenizers: `unicode61` (default — Unicode-aware case folding + accent stripping), `porter` (English Porter stemmer layered over `unicode61`), and `trigram` (every 3-character slice). Pick at index-create time via the `tokenize=` argument. Forge's default is `unicode61`. Override with a raw index alongside the auto-emitted shadow: ```sql -- Created out of band; forge keeps its shadow and you maintain this one. CREATE VIRTUAL TABLE posts_fts_porter USING fts5( title, body, content='posts', content_rowid='rowid', tokenize='porter unicode61 remove_diacritics 2' ); ``` `trigram` is the answer to "search-as-you-type with prefixes shorter than 3 chars" — see [pattern (a)](#a-search-as-you-type-with-prefix-matching). ### Mongo `text` indexes take a `default_language` (defaults to `english`) and an optional per-document `language` override (Mongo reads a `language` field on each doc). Supported languages cover the Snowball set. CJK is not natively supported — switch to Atlas Search for production CJK workloads. ```ts // Mixed-language storefront: per-doc language override. const Listing = model('listings', { id: f.id(), title: f.string().searchable(), body: f.text().searchable(), language: f.string(), // 'english' | 'french' | ... }); // On push, forge creates the text index. Mongo reads listing.language // on each doc and applies the right stemmer. ``` ### Worked: English + Yoruba in the same column Yoruba uses Latin script with tone-marked vowels (à, á, ẹ, ọ). Stemmers don't exist for it on any of the dialects, so the only real choices are "tokenize whitespace + match exactly" (PG `simple`, SQLite `unicode61 remove_diacritics 0`) or "fold diacritics and match loosely" (`unicode61 remove_diacritics 1`). For a bilingual product catalog where the customer might type `bata` or `bàtà`, the loose form wins: ```ts // SQLite — fold diacritics so `bata` matches `bàtà`. await db.$executeRaw` CREATE VIRTUAL TABLE products_fts USING fts5( title, body, content='products', content_rowid='rowid', tokenize='unicode61 remove_diacritics 1' ) `; // PG equivalent — unaccent + simple. // CREATE EXTENSION unaccent; // CREATE INDEX idx_products_body_unac ON products // USING gin(to_tsvector('simple', unaccent(body))); ``` The trade-off: diacritic-folding loses meaning in French and Vietnamese where accents distinguish words. Pick per-corpus, not globally. --- ## Ranking and scoring The portable `search` operator returns rows matched, not ranked. Ranking is engine-specific enough that forge doesn't try to expose it through the typed API — drop to `$queryRaw` and let the dialect score. ### Postgres — `ts_rank` / `ts_rank_cd` `ts_rank` gives a 0-1 cosine-like score weighted by lexeme frequency; `ts_rank_cd` is the "cover-density" variant that rewards term proximity in the document. ```ts const ranked = await db.$queryRaw<{ id: string; title: string; rank: number }[]>` SELECT id, title, ts_rank_cd(to_tsvector('english', body), websearch_to_tsquery('english', ${q})) AS rank FROM posts WHERE to_tsvector('english', body) @@ websearch_to_tsquery('english', ${q}) AND org_id = ${orgId} ORDER BY rank DESC LIMIT 20 `; ``` Tip: the `to_tsvector` call in `ts_rank_cd` is re-running tokenization on every match. Materialize a `tsvector` generated column to avoid it: ```sql ALTER TABLE posts ADD COLUMN body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED; CREATE INDEX idx_posts_body_tsv ON posts USING gin(body_tsv); ``` Then `ts_rank_cd(body_tsv, …)` reads the precomputed column. ### SQLite — `bm25(
_fts)` FTS5 exposes BM25 as a built-in ranking function. Lower scores mean better matches in FTS5's convention — sort ascending. ```ts const ranked = await db.$queryRaw<{ id: string; title: string; rank: number }[]>` SELECT p.id, p.title, bm25(posts_fts) AS rank FROM posts_fts JOIN posts p ON p.rowid = posts_fts.rowid WHERE posts_fts MATCH ${q} AND p.org_id = ${orgId} ORDER BY rank LIMIT 20 `; ``` `bm25()` accepts per-column weights as positional args: `bm25(posts_fts, 10.0, 1.0)` weights the first FTS5 column (title) ten times the second (body). ### MySQL — `MATCH … AGAINST` returns the score The same `MATCH(col) AGAINST(?)` expression is a numeric in the SELECT list. Reuse it for ordering — MySQL is smart enough to compute the score once. ```ts const ranked = await db.$queryRaw<{ id: string; title: string; rank: number }[]>` SELECT id, title, MATCH(body) AGAINST(${q}) AS rank FROM posts WHERE MATCH(body) AGAINST(${q}) AND org_id = ${orgId} ORDER BY rank DESC LIMIT 20 `; ``` Boolean mode returns scores too, but the formula differs (it's a sum of operator weights, not BM25). Don't mix modes in the same query. ### Mongo — `$meta: 'textScore'` Mongo computes a TF-IDF-ish score on every `$text` match. Pull it via the `$meta` projection and sort by it. ```ts // $queryRaw for Mongo is an aggregation pipeline — pass it through // the adapter's runCommand for now (db.$queryRaw on Mongo is the SQL // shim and won't accept this shape). const ranked = await db.posts.collection!.find( { $text: { $search: q }, org_id: orgId }, { projection: { score: { $meta: 'textScore' }, title: 1 } }, ).sort({ score: { $meta: 'textScore' } }).limit(20).toArray(); ``` The `score` value is non-portable; it's only ever useful for sorting within one query, not for thresholding ("> 0.5 is relevant"). ### DuckDB and MSSQL DuckDB's `fts` extension exposes `match_bm25(score)` after you call `PRAGMA create_fts_index(...)`. MSSQL with the FTS component exposes `CONTAINSTABLE(table, col, q)` which returns a `RANK` column — same pattern as the others. --- ## Multi-column FTS The decoration is per-field. How forge composes a query that searches across more than one searchable column is dialect-specific. | Dialect | Multi-column behaviour | |----------|------------------------| | Postgres | Each searchable column gets its own GIN index. The portable `search:` only hits one column at a time. For composite search use a generated `tsvector` over `title \|\| ' ' \|\| body`. | | MySQL | The portable `search:` hits the per-column FULLTEXT. For composite, declare a composite FULLTEXT: `method: 'fulltext'`, `keys: { title: 1, body: 1 }` — then `MATCH(title, body) AGAINST(?)`. | | SQLite | All `.searchable()` columns on the same table share one FTS5 shadow. `MATCH ?` against the shadow searches every column. To target one column: `MATCH 'title: alice'`. | | Mongo | All `.searchable()` columns share one text index. `$text` searches every indexed field. Per-field weights via `weights: { title: 10, body: 1 }` on the index. | The composite-`tsvector` PG pattern, expanded: ```ts const Post = model('posts', { id: f.id(), title: f.string(), // not .searchable() — covered by composite body: f.text(), tags: f.json(), }, { indexes: [ { name: 'idx_posts_search', method: 'gin', expression: `to_tsvector( 'english', coalesce(title, '') || ' ' || coalesce(body, '') || ' ' || coalesce((SELECT string_agg(value, ' ') FROM jsonb_array_elements_text(tags)), '') )`, }, ], }); ``` Then query via raw — there's no way to express the composite vector through the portable `search:` operator on PG: ```ts const hits = await db.$queryRaw<{ id: string }[]>` SELECT id FROM posts WHERE to_tsvector( 'english', coalesce(title, '') || ' ' || coalesce(body, '') || ' ' || coalesce((SELECT string_agg(value, ' ') FROM jsonb_array_elements_text(tags)), '') ) @@ plainto_tsquery('english', ${q}) `; ``` The MySQL composite-FULLTEXT lets you keep the typed query path: ```ts const Post = model('posts', { id: f.id(), title: f.string().searchable(), body: f.text().searchable(), }, { indexes: [ { name: 'ft_posts_title_body', method: 'fulltext', keys: { title: 1, body: 1 } }, ], }); ``` SQLite is the cleanest: declare `.searchable()` on every column you care about, and the auto-emitted shadow covers all of them in one index. The query `where: { title: { search: q } }` is equivalent to `where: { body: { search: q } }` because both translate to a `MATCH` against the shared shadow. --- ## Stop words and customization Stop-word handling defines what counts as a "word worth indexing" versus noise. The defaults are language-aware and almost always too aggressive for technical content. | Dialect | Default stop-word handling | How to change it | |----------|----------------------------|------------------| | Postgres | Per-language dictionary; `english` strips `a`, `the`, `is`, ~127 words. `simple` strips none. | `CREATE TEXT SEARCH DICTIONARY` + `ALTER TEXT SEARCH CONFIGURATION`. Out-of-band. | | MySQL | Built-in 36-word list (InnoDB). | Set `innodb_ft_server_stopword_table` to a table you populate; restart needed. | | SQLite | `unicode61` strips none; `porter` strips none. FTS5 has no built-in stop list. | Strip at write time, or use a custom tokenizer. | | Mongo | Per-language list, hidden. | Use `none` as language to disable, or strip at write. | The "drop stop words" hurts when: - **Technical search** — terms like "I/O", "if", "as", "do" are meaningful in code search. Use `simple` (PG), an empty stopword table (MySQL), or `unicode61` without a stemmer (SQLite). - **Short queries** — "the doors" parsed under `english` becomes `door`, losing distinction from "doors". For exact-product-name search, switch to `simple`. - **Single-character languages** — Chinese, Japanese, Korean. The default tokenizers won't split usefully. See `ngram` (MySQL) and the language section above. When in doubt, switch to `simple` (PG) / no stopwords (MySQL) / `unicode61` (SQLite). The cost of recall lost to over-aggressive stopping is much worse than the cost of slightly larger indexes. --- ## Searchable + soft-delete `f.text().searchable()` on a model with `softDelete: 'deleted_at'` creates a problem: the FTS index keeps tokenizing rows that should be hidden. The fix is a partial-filter index excluding the soft-deleted rows. ### Postgres ```ts const Post = model('posts', { id: f.id(), body: f.text(), deleted_at: f.dateTime().nullable(), }, { softDelete: { field: 'deleted_at' }, indexes: [ { name: 'idx_posts_body_live', method: 'gin', expression: `to_tsvector('english', body)`, where: `deleted_at IS NULL`, }, ], }); ``` Forge's read path adds `deleted_at IS NULL` automatically when `softDelete` is set, so the partial-filter and the runtime filter line up. The index is roughly half the size of an unfiltered one on a typical 30%-deleted archive table. ### MySQL MySQL's FULLTEXT doesn't accept a `WHERE` clause on the index. The workaround is to maintain a `live_body` generated column that is `NULL` when deleted and a copy of `body` when live, then index `live_body`: ```sql ALTER TABLE posts ADD COLUMN live_body text GENERATED ALWAYS AS (CASE WHEN deleted_at IS NULL THEN body END) VIRTUAL; ALTER TABLE posts ADD FULLTEXT (live_body); ``` ### SQLite The auto-trigger emits writes for every row. The cleanest fix is a soft-delete-aware trigger — replace the default `AFTER UPDATE` trigger with one that deletes from the shadow when `deleted_at` becomes non-null: ```sql DROP TRIGGER posts_au; CREATE TRIGGER posts_au AFTER UPDATE ON posts BEGIN INSERT INTO posts_fts(posts_fts, rowid, body) VALUES('delete', old.rowid, old.body); INSERT INTO posts_fts(rowid, body) SELECT new.rowid, new.body WHERE new.deleted_at IS NULL; END; ``` ### Mongo Text indexes don't take partial-filter expressions directly the way b-tree indexes do — but `createIndex` accepts `partialFilterExpression` on text indexes since 4.4: ```ts await db.posts.collection!.createIndex( { body: 'text' }, { partialFilterExpression: { deleted_at: null } }, ); ``` Forge doesn't emit this from the schema yet — add it once at push time outside the wrapper. --- ## Searchable embeds and JSON Searchable fields on `f.embed()` / `f.embedMany()` / `f.json()` columns are limited by what each dialect can tokenize through a JSON extract. ### Postgres `to_tsvector` accepts text — extract first, vectorize after. ```sql CREATE INDEX idx_orders_addr_search ON orders USING gin( to_tsvector('english', coalesce(addr->>'street', '') || ' ' || coalesce(addr->>'city', '') || ' ' || coalesce(addr->>'country', '')) ); ``` For arrays, use `jsonb_path_query_array` to collect every match into one text blob: ```sql CREATE INDEX idx_posts_tags_search ON posts USING gin( to_tsvector('simple', (SELECT string_agg(value, ' ') FROM jsonb_array_elements_text(tags))) ); ``` ### MySQL JSON + FULLTEXT only works through a generated column. The generated column is a scalar `text`, so you can stack a `FULLTEXT` index on it. ```sql ALTER TABLE orders ADD COLUMN addr_search text GENERATED ALWAYS AS ( concat_ws(' ', JSON_UNQUOTE(JSON_EXTRACT(addr, '$.street')), JSON_UNQUOTE(JSON_EXTRACT(addr, '$.city')), JSON_UNQUOTE(JSON_EXTRACT(addr, '$.country'))) ) STORED; ALTER TABLE orders ADD FULLTEXT (addr_search); ``` ### SQLite FTS5 stores text as-is. Extract at write time into a regular text column you mark `.searchable()`: ```ts const Order = model('orders', { id: f.id(), addr: f.embed({ street: f.string(), city: f.string(), country: f.string() }), addr_search: f.text().searchable(), // derived; populate in beforeWrite hook }); db.$events.on('beforeWrite', ({ model, data }) => { if (model !== 'orders' || !data.addr) return; data.addr_search = [data.addr.street, data.addr.city, data.addr.country] .filter(Boolean).join(' '); }); ``` ### Mongo Mongo's text index accepts a wildcard form that searches every string field in the document: ```ts await db.orders.collection!.createIndex({ '$**': 'text' }); ``` Slower and indexes more than you want — useful for low-volume admin search, less so for hot paths. --- ## Hybrid search (BM25 + vector) Pure BM25 misses paraphrase ("reverse a payment" vs "refund a charge"). Pure vector misses exact-match queries (model numbers, acronyms, error codes). Hybrid search runs both, then merges with Reciprocal Rank Fusion. The full pipeline — both lanes, RRF merge, dedup, hydration — is in [VECTOR.md → Hybrid search](./VECTOR.md#hybrid-search--bm25--vector-with-rrf). What follows is the FTS half in isolation, ready to feed into RRF. ```ts // PG lexical lane. Returns ranked ids only — keep the payload light // so the RRF merge is fast. async function lexical(query: string, orgId: string, take = 50) { return db.$queryRaw<{ id: string; rank: number }[]>` SELECT id, ts_rank_cd(to_tsvector('english', body), websearch_to_tsquery('english', ${query})) AS rank FROM posts WHERE to_tsvector('english', body) @@ websearch_to_tsquery('english', ${query}) AND org_id = ${orgId} AND deleted_at IS NULL ORDER BY rank DESC LIMIT ${take} `; } ``` `websearch_to_tsquery` is the right parser here — it handles phrases in quotes, `-negation`, and `or` naturally, which is what real users type into search boxes. SQLite + sqlite-vec: ```ts async function lexicalSqlite(query: string, orgId: string, take = 50) { return db.$queryRaw<{ id: string; rank: number }[]>` SELECT p.id, bm25(posts_fts) AS rank FROM posts_fts JOIN posts p ON p.rowid = posts_fts.rowid WHERE posts_fts MATCH ${query} AND p.org_id = ${orgId} AND p.deleted_at IS NULL ORDER BY rank LIMIT ${take} `; } ``` The semantic lane and the RRF merge are in [VECTOR.md → Hybrid search](./VECTOR.md#hybrid-search--bm25--vector-with-rrf). --- ## Incremental indexing and rebuilds Every dialect maintains its FTS index automatically as rows change. The rebuild patterns are for when configuration changes (new language, new stopword list, new tokenizer) and the existing index is stale. | Dialect | Auto-maintained on write? | Rebuild command | |----------|---------------------------|-----------------| | Postgres | Yes — the GIN index updates inline on INSERT / UPDATE / DELETE. With a generated `tsvector` column, the column itself recomputes. | `REINDEX INDEX idx_posts_body_en;` (online with `CONCURRENTLY`) | | MySQL | Yes — FULLTEXT updates on commit. Bulk writes can batch into the "FTS auxiliary tables" and flush via `OPTIMIZE TABLE`. | `OPTIMIZE TABLE posts;` (rebuilds the FT auxiliary tables) | | SQLite | Yes — the three triggers fire on every write. | `INSERT INTO posts_fts(posts_fts) VALUES('rebuild');` — the FTS5 'rebuild' command. | | DuckDB | No — `PRAGMA create_fts_index` is a snapshot. Re-run after bulk loads. | `PRAGMA drop_fts_index('posts'); PRAGMA create_fts_index('posts', 'id', 'title', 'body');` | | Mongo | Yes — text index updates on write. | `db.posts.reIndex()` (locks the collection — use `dropIndex` + `createIndex` for hot rebuilds). | | MSSQL | No (manual catalog). | Per FTS catalog: `ALTER FULLTEXT CATALOG REBUILD`. | The MySQL OPTIMIZE pattern matters more than it looks: heavy bulk inserts accumulate in the auxiliary "INSERT" table and don't show up in queries until either a `OPTIMIZE TABLE` runs or the auxiliary table fills past `innodb_ft_total_cache_size` and flushes. Schedule an OPTIMIZE after large imports. --- ## Highlighting Returning marked-up snippets ("...the **database wrapper** was...") is a per-dialect feature. Forge doesn't wrap it — drop to raw. ### Postgres — `ts_headline` ```ts const rows = await db.$queryRaw<{ id: string; snippet: string }[]>` SELECT id, ts_headline( 'english', body, websearch_to_tsquery('english', ${q}), 'StartSel=, StopSel=, MaxFragments=2, MinWords=15, MaxWords=35' ) AS snippet FROM posts WHERE to_tsvector('english', body) @@ websearch_to_tsquery('english', ${q}) LIMIT 20 `; ``` `ts_headline` is expensive — it re-tokenizes the document for every result. Cap with `LIMIT` before, never after. ### SQLite — `snippet()` / `highlight()` ```ts const rows = await db.$queryRaw<{ id: string; snippet: string }[]>` SELECT p.id, snippet(posts_fts, 1, '', '', '…', 32) AS snippet FROM posts_fts JOIN posts p ON p.rowid = posts_fts.rowid WHERE posts_fts MATCH ${q} LIMIT 20 `; ``` `snippet(fts_table, col_index, open_tag, close_tag, ellipsis, max_tokens)` — the second arg is the FTS5 column index (0-based), or `-1` to pick the best match across columns. ### MySQL — no native; post-process MySQL has no built-in highlight function. Pull the body and wrap on the client: ```ts function highlight(body: string, terms: string[]) { const re = new RegExp(`(${terms.map(escapeRe).join('|')})`, 'gi'); return body.replace(re, '$1'); } ``` ### Mongo — manual Same shape as MySQL: pull the body, post-process. Mongo's `$text` exposes no highlight in the wire protocol. --- ## Performance and when to switch engines Rough numbers from public benchmarks against a 1M-row text corpus with ~300 tokens per row. Treat as orientation, not promises. | Engine | Index size vs base | Write tput overhead | Median query latency | Best fit | |--------|-------------------|---------------------|----------------------|----------| | PG GIN tsvector | 20-40% | 15-30% slower writes | 5-30 ms | Default for most apps. Already on Postgres → free. | | PG GIN + generated tsvector column | 40-60% | 20-40% slower writes | 2-10 ms | Apps doing heavy ranking + highlight. Ranking gets ~3× cheaper. | | MySQL FULLTEXT | 15-30% | 10-20% slower writes | 8-40 ms | Default for InnoDB apps. Auxiliary-table flush behavior bites bulk loaders. | | SQLite FTS5 (unicode61) | 50-80% | 20-40% slower writes | 1-5 ms (under 1M rows) | The default for embedded / browser / single-tenant apps. | | SQLite FTS5 (trigram) | 200-400% | 40-60% slower writes | 3-10 ms | Search-as-you-type with 1-2 char prefixes. | | Mongo text | 10-25% | 10-20% slower writes | 10-50 ms | Mongo-native apps under ~10M docs. Above that, Atlas Search. | | DuckDB fts | 25-45% | re-index on bulk | 3-15 ms | Analytics / OLAP corpora. Not great for live writes. | The "switch to a dedicated search engine" decision points: - **Corpus over ~10M documents** — every dialect's FTS slows down, but the dedicated engines (Meilisearch, Typesense, Elasticsearch) have inverted-index optimizations the general-purpose ones don't. - **Multi-tenant with per-tenant ranking models** — none of the built-in FTS engines support tenant-specific stopword lists or ranking weights. Dedicated engines do. - **Faceted search at speed** — Postgres can do facets with aggregations, but a dedicated engine returns 1M-row facet counts in single-digit ms. - **Typo tolerance** — Meilisearch / Typesense have it built in. Postgres needs `pg_trgm` + raw SQL; SQLite needs `trigram` tokenizer + custom logic. If you don't hit those, stay in-database. The data-locality win (joining FTS results to other tables in one query) is large. --- ## Six worked patterns ### (a) Search-as-you-type with prefix matching The user types `dat`. You want every row whose searchable text contains a word starting with `dat`. **SQLite — FTS5 prefix syntax** works for prefixes ≥ 3 chars: ```ts const q = userInput; const rows = await db.$queryRaw<{ id: string; title: string }[]>` SELECT p.id, p.title FROM posts_fts JOIN posts p ON p.rowid = posts_fts.rowid WHERE posts_fts MATCH ${q + '*'} -- the * is FTS5's prefix marker LIMIT 10 `; ``` For 1-2 char prefixes, FTS5 falls over (the prefix is below the token size). Two options: - Create a trigram-tokenized FTS5 alongside the default: ```sql CREATE VIRTUAL TABLE posts_fts_tri USING fts5( title, body, content='posts', content_rowid='rowid', tokenize='trigram remove_diacritics 1' ); ``` Then route prefixes < 3 chars to `posts_fts_tri` with a `LIKE 'da%'` fallback for the 1-char case. - Or short-circuit in app code: for prefixes < 3 chars, use forge's `contains` operator. It compiles to `LIKE '%dat%'` which is cheap-enough on the same column for low-cardinality inputs. ```ts const rows = q.length >= 3 ? await db.posts.findMany({ where: { title: { search: `${q}*` } }, take: 10 }) : await db.posts.findMany({ where: { title: { startsWith: q } }, take: 10 }); ``` ### (b) Multi-language product catalog (PG tsvector per language + fallback chain) Store a `language` column per row. Build one searchable index per language plus a `simple`-language fallback for queries that don't specify. ```ts const Product = model('products', { id: f.id(), language: f.string(), // 'english' | 'french' | 'yoruba' title: f.string(), body: f.text(), }, { indexes: [ { name: 'idx_prod_en', method: 'gin', expression: `to_tsvector('english', title || ' ' || body)`, where: `language = 'english'` }, { name: 'idx_prod_fr', method: 'gin', expression: `to_tsvector('french', title || ' ' || body)`, where: `language = 'french'` }, { name: 'idx_prod_simple', method: 'gin', expression: `to_tsvector('simple', unaccent(title || ' ' || body))` }, ], }); ``` The fallback `simple` index handles Yoruba (no Snowball stemmer) and mixed-language queries. The partial-filter language indexes give proper stemming for the languages you know. ### (c) Email search across subject + body + sender.name + sender.email ```ts const Email = model('emails', { id: f.id(), org_id: f.string(), subject: f.string(), body: f.text(), sender: f.embed({ name: f.string(), email: f.string() }), }); ``` PG: composite generated tsvector over all four sources. ```sql ALTER TABLE emails ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(subject, '')), 'A') || setweight(to_tsvector('english', coalesce(body, '')), 'B') || setweight(to_tsvector('simple', coalesce(sender->>'name', '')), 'C') || setweight(to_tsvector('simple', coalesce(sender->>'email', '')), 'D') ) STORED; CREATE INDEX idx_emails_search ON emails USING gin(search_tsv); ``` `setweight` lets ranking weight subject hits above body hits above sender hits — `ts_rank_cd(search_tsv, q, 32)` applies the weights. ```ts const hits = await db.$queryRaw<{ id: string; rank: number }[]>` SELECT id, ts_rank_cd(search_tsv, websearch_to_tsquery('english', ${q}), 32) AS rank FROM emails WHERE search_tsv @@ websearch_to_tsquery('english', ${q}) AND org_id = ${orgId} ORDER BY rank DESC LIMIT 50 `; ``` ### (d) Search within a tenant — combining FTS with `org_id = ?` The query planner needs to pick "filter by tenant first, then FTS" or "FTS first, then filter by tenant" depending on which is more selective. On PG the planner usually gets it right. On SQLite, the FTS5 subquery is opaque to the planner — wrap it carefully: ```ts // Good — FTS5 first (highly selective), then the org filter. const rows = await db.posts.findMany({ where: { org_id: orgId, body: { search: q }, }, take: 20, }); // Compiles to: // SELECT … FROM "posts" WHERE // "posts".rowid IN (SELECT rowid FROM "posts_fts" WHERE "posts_fts" MATCH ?) // AND "org_id" = ? // LIMIT 20 ``` The order in the WHERE clause is `(FTS5 rowid IN ...) AND org_id = ?` — SQLite evaluates the FTS5 subquery once and intersects with the b-tree lookup for `org_id`. If you instead route the query as `org_id = ? AND search`, SQLite has to scan the org-filtered rows against the FTS shadow row-by-row, which is much slower at scale. Forge's planner picks the right order automatically. ### (e) Hybrid: FTS top-K + vector top-K → RRF merge → final top-10 The complete worked example lives at [VECTOR.md → Hybrid search](./VECTOR.md#hybrid-search--bm25--vector-with-rrf). This pattern is short because the FTS half is one of the snippets already shown — what's interesting is the RRF merge. ```ts async function search(q: string, orgId: string) { const lex = await lexical(q, orgId, 50); // FTS top-50 const sem = await semantic(q, orgId, 50); // pgvector top-50 const k = 60; const score = new Map(); lex.forEach((r, i) => score.set(r.id, (score.get(r.id) ?? 0) + 1 / (k + i))); sem.forEach((r, i) => score.set(r.id, (score.get(r.id) ?? 0) + 1 / (k + i))); const ids = [...score.entries()].sort((a, b) => b[1] - a[1]).slice(0, 10).map(([id]) => id); const rows = await db.posts.findMany({ where: { id: { in: ids } } }); return ids.map((id) => rows.find((r) => r.id === id)!).filter(Boolean); } ``` ### (f) Highlight matched terms in results UI Use `ts_headline` (PG) / `snippet` (SQLite) at query time; render the returned HTML in the client. The two functions emit safe markup, but treat output as untrusted and run it through DOMPurify before inserting — the source `body` column is whatever user-submitted text landed in the row. ```tsx function ResultCard({ row }: { row: { id: string; snippet: string } }) { const safe = useMemo(() => DOMPurify.sanitize(row.snippet), [row.snippet]); return
; } ``` For Mongo and MySQL where there's no native highlight, the same client-side highlight regex is fine — split the query into terms, wrap each match in ``. Skip terms shorter than 3 chars to avoid matching common letter combinations. --- ## Common bugs **Case sensitivity differs across dialects.** PG `to_tsvector` is case-insensitive in every config. SQLite `unicode61` is too. MySQL FULLTEXT is by default. But your `LIKE` fallback on MSSQL / DuckDB is collation-dependent — `LIKE '%Foo%'` may or may not match `foo` depending on the column collation. Set collation explicitly on the table. **Accent folding differs.** SQLite `unicode61 remove_diacritics 0` treats `café` and `cafe` as different tokens; `remove_diacritics 1` folds them. PG `simple` doesn't fold; `english` does (after unaccent). MySQL FULLTEXT depends on collation. If your users expect "naive" to match "naïve", set `remove_diacritics 1` on SQLite and install `unaccent` on PG. **`*` vs `:` in FTS5 prefix syntax.** SQLite FTS5 takes `data*` (prefix) and `title: data` (column-scoped). `:` without a known column name is a parse error, not a match — `where: { body: { search: 'a:b' } }` throws on SQLite. Strip `:` from user input, or escape it. **MySQL FULLTEXT 50% threshold.** Natural-language mode returns zero rows if the term appears in more than half the rows. On dev databases with a handful of rows, *everything* matches more than 50% of rows, so search "returns nothing". Switch to boolean mode for dev seeds, or seed enough rows that real queries fall under the threshold. **Mongo `$text` is collection-wide.** Two `search` operators in the same query don't AND — the second silently overwrites the first. The SQL dialects compose `AND`-style. Test cross-dialect queries against real data, not just type-check. **`websearch_to_tsquery` vs `plainto_tsquery`.** The first parses quotes and `-`; the second strips them. If you switch parsers without re-indexing, scoring shifts — `ts_rank_cd` reads off the `tsvector`, which doesn't care, but ordering between two runs of the same query under the two parsers won't match. **SQLite FTS5 `MATCH` against the wrong column name.** The portable `search` operator routes via `
.rowid IN (...
_fts MATCH ...)`. If you rename a `.searchable()` column without re-running push, the shadow keeps the old column name and the trigger writes into a column FTS5 has never heard of. Drop and recreate the shadow: ```sql DROP TABLE posts_fts; -- Re-run forge push, or: INSERT INTO posts_fts(posts_fts) VALUES('rebuild'); ``` **Generated `tsvector` columns and `pg_dump`.** A `STORED` generated column exports its value; a `VIRTUAL` one does not. After a restore, the index is empty until the next write touches each row. Use `STORED` for production tsvector columns, or run `REINDEX` after restore. --- The README's **[Full-text search](../README.md#full-text-search)** chapter remains the surface map; this file is the long form. The companion docs are **[VECTOR.md](./VECTOR.md)** for vector and hybrid search, **[JSON-PATH.md](./JSON-PATH.md)** for indexing JSON columns under the same operators, and **[GEO.md](./GEO.md)** for the geo operators that compose with `search` in location-aware queries.