# DuckDB DuckDB is forge-orm's analytical option — embedded, columnar, and natively able to read Parquet/CSV files or ATTACH another database as a catalog. This page documents what DuckDB brings to a forge stack: the type round-trip, the parquet/S3 surface, native geo + JSON + FTS extensions, and the patterns that work for OLAP-next-to-OLTP setups. The two implementation files to read alongside this doc are `src/adapters/duckdb/dialect.ts` (column types, placeholder style, geo/vector/JSON emit) and `src/adapters/duckdb/adapter.ts` (capability flags, extension autoload, the `$queryRaw` / `$executeRaw` / `$transaction` plumbing). Both are short and worth a skim before you reach for the heavier patterns below. ## Contents * [What DuckDB is for in forge](#what-duckdb-is-for-in-forge) * [Driver — `@duckdb/node-api`](#driver--duckdbnode-api) * [URL shape and `:memory:` vs file](#url-shape-and-memory-vs-file) * [Capability flags](#capability-flags) * [DDL emit table](#ddl-emit-table) * [Transactional model](#transactional-model) * [Reading Parquet and CSV inside a typed query](#reading-parquet-and-csv-inside-a-typed-query) * [S3 and HTTPFS](#s3-and-httpfs) * [Vector — `vss` and the `FLOAT[N]` column](#vector--vss-and-the-floatn-column) * [Geo — the `spatial` extension](#geo--the-spatial-extension) * [JSON — native type and JSON-path](#json--native-type-and-json-path) * [Full-text search — `fts` and the ILIKE fallback](#full-text-search--fts-and-the-ilike-fallback) * [`EXPORT DATABASE` — backup and round-trip](#export-database--backup-and-round-trip) * [`ATTACH` — Postgres / SQLite / MySQL as a catalog](#attach--postgres--sqlite--mysql-as-a-catalog) * [Persistent vs in-memory DuckDB](#persistent-vs-in-memory-duckdb) * [Window functions, aggregations, CTEs](#window-functions-aggregations-ctes) * [Common errors and limitations](#common-errors-and-limitations) * [Three worked examples](#three-worked-examples) * [Cross-links](#cross-links) --- ## What DuckDB is for in forge DuckDB is an embedded analytical engine — the OLAP counterpart to SQLite. It runs in-process, stores columnar, vectorises every operator, and pushes predicates into Parquet / CSV scans. The forge adapter treats it as a first-class dialect: same IR, same typed call shape, same `$transaction` semantics. What it adds over Postgres / SQLite is a short list, and that list is the reason you'd pick DuckDB at all. - **Columnar analytics next to your OLTP store.** Run a `groupBy({ by: ['day', 'segment'], _sum: { revenue: true } })` over a hundred million rows without taking the OLTP database down. Aggregations, windowed metrics, and `GROUP BY ROLLUP` are first-class — DuckDB plans them with vectorised hash aggregation, not the row-at-a-time path that Postgres takes for the same query without an aggregate index. - **Parquet / CSV ingest with no copy step.** `read_parquet('s3://…')` inside a `$queryRaw` template lets you `SELECT` directly from object storage. Predicate pushdown means you only scan the columns and row groups the `WHERE` clause touches. This is the workflow for analytics dashboards, ad-hoc reports, and ML feature pipelines that ingest from data-lake formats. - **OLAP next to OLTP.** `ATTACH 'host=… dbname=app' AS app (TYPE postgres)` exposes a live Postgres database as a DuckDB catalog. Now a single `SELECT` joins the OLTP tables against a Parquet file in S3 and writes the result to a local DuckDB table. This is the pattern that replaces a 3-system ETL (Postgres → Airflow → Snowflake) with two SQL statements inside a Node process. - **An app-local analytics cache.** A `.duckdb` file on the API node is effectively a typed materialised view. Refresh on a cron, query through the forge wrapper. Sub-millisecond reads, no network round-trip, and you get the typed surface that pgAnalytics-via-views does not give you. What DuckDB is **not** good at, and what forge cannot paper over: it's a single-writer, single-process store. Concurrent writers serialise on the file lock. Foreign keys are accepted at DDL time but **not enforced** at write time (see [Capability flags](#capability-flags)). There is no PITR. There is no replication. Treat DuckDB as a derived read store downstream of your OLTP system, not as the system of record for production transactions. For the picture-shaped picker between the six dialects, see the [adapter capability matrix in DRIVERS.md](./DRIVERS.md#adapter-capabilities). For the backup story see [BACKUP-RESTORE.md → DuckDB](./BACKUP-RESTORE.md#duckdb). --- ## Driver — `@duckdb/node-api` The peer dependency is **`@duckdb/node-api`** — DuckDB's official Node bindings. Install it alongside `forge-orm`: ```bash npm i forge-orm @duckdb/node-api ``` The adapter loads the driver lazily; `@duckdb/node-api` only has to be present at runtime, not at typecheck or build time. The `loadDriver` helper throws a clear "package not installed" error if you forget, with the install string baked into the message. The two call shapes — URL-driven and driver-injected — both work: ```ts import { createDb } from 'forge-orm'; import { DuckDBInstance } from '@duckdb/node-api'; import { duckdbDriver } from 'forge-orm/adapters/duckdb'; import { schema } from './schema'; // URL-driven — adapter opens its own DuckDBInstance. const db = await createDb({ schema, type: 'duckdb', url: 'duckdb:./analytics.duckdb', }); // Driver-injected — useful for MotherDuck, custom config, or shared instances. const instance = await DuckDBInstance.create('./analytics.duckdb', { threads: '8', memory_limit: '4GB', }); const connection = await instance.connect(); const db2 = await createDb({ schema, type: 'duckdb', driver: duckdbDriver(connection), }); ``` The injected form is the one you reach for when you need to pass DuckDB config options (`threads`, `memory_limit`, `temp_directory`) that the URL parser cannot express, or when you want to share a single DuckDB instance across multiple forge clients (one per tenant, one per worker thread). For the underlying driver-port contract — `query`, `transaction`, `close`, and the param-coercion rules — see [DRIVERS.md → DuckdbDriver](./DRIVERS.md#duckdbdriver). --- ## URL shape and `:memory:` vs file ```ts 'duckdb://./reports.duckdb' // file at ./reports.duckdb 'duckdb:./reports.duckdb' // same — both prefixes are accepted 'duckdb::memory:' // in-process, no persistence 'duckdb:' // empty body → in-process 'duckdb:/var/lib/forge/db' // absolute path 'duckdb://md:my_db?token=…' // MotherDuck (cloud DuckDB) — passes through ``` The adapter strips `duckdb:` (with or without the `//`) and hands the remainder straight to `DuckDBInstance.create()`. Anything DuckDB itself accepts as a path will work, including the `md:` prefix that the `@duckdb/node-api` binding routes to MotherDuck. For an in-process DB, prefer `:memory:` explicitly — an empty string also works but is less obvious in logs and stack traces. --- ## Capability flags `DuckdbAdapter.capabilities` is the source of truth for what the wrapper expects the driver to do versus what forge does in JS. The DuckDB values: | Capability | Value | Meaning | |---|---|---| | `nativeCascades` | `false` | DuckDB accepts `ON DELETE CASCADE` syntactically but does not enforce FK constraints at write time. Forge's app-side cascade walker takes over for `Cascade` and `SetNull` rules. | | `nativeUpsert` | `true` | `ON CONFLICT (cols) DO UPDATE SET …` works since DuckDB 0.8; the typed `upsert()` compiles to that shape directly. | | `nullsOrdering` | `true` | `ORDER BY … NULLS FIRST / NULLS LAST` is honoured. The wrapper does not have to emulate ordering with `CASE`. | | `jsonPath` | `true` | `json_extract(col, '$.a.b')` is the JSON path operator, same as SQLite. See [JSON-PATH.md → DuckDB](./JSON-PATH.md#duckdb). | | `transactionsRequireReplicaSet` | `false` | The Mongo flag is meaningless here; it's `false` for every SQL dialect. | The `nativeCascades: false` flag is the one that changes runtime behaviour you might not expect. A `delete()` on a parent row triggers the wrapper's cascade walker — see [RELATIONS.md → On-delete behaviour](./RELATIONS.md#on-delete) — which issues one `DELETE` per dependent collection. This is correct but slower than a native cascade. If you cascade-delete millions of rows on DuckDB, batch the IDs and run the deletes yourself, or treat the parent table as the system of record (delete upstream, re-derive DuckDB). --- ## DDL emit table The DuckDB dialect borrows the Postgres DDL generator and overrides the `columnType()` method. The full table: | Schema field | DuckDB column type | Notes | |---|---|---| | `f.id('cuid' \| 'uuid' \| 'bigserial')` | `VARCHAR` / `UUID` / `BIGINT` | `bigserial` emits a sequence + DEFAULT (no native serial type) | | `f.objectId()` | `VARCHAR` | 24-char hex, same shape as PG | | `f.string()` | `VARCHAR` | DuckDB has one variable-length string type | | `f.text()` | `VARCHAR` | no separate `TEXT` | | `f.int()` | `INTEGER` | 32-bit signed | | `f.bigint()` | `BIGINT` | 64-bit signed | | `f.float()` | `DOUBLE` | 64-bit IEEE | | `f.decimal({ precision, scale })` | `DECIMAL(p,s)` | fixed-point; DuckDB has `HUGEINT` (128-bit) for arbitrary-precision but forge does not map to it | | `f.uuid()` | `UUID` | native type; `uuid()` generates one at the SQL layer | | `f.bool()` | `BOOLEAN` | | | `f.dateTime()` | `TIMESTAMPTZ` | stored UTC, surfaced as JS `Date` | | `f.json()` | `JSON` | native logical type — dictionary-encoded on disk, columnar | | `f.enum(values)` | `VARCHAR` + `CHECK` | DuckDB has a native `ENUM` type but the forge dialect emits a `CHECK` constraint to stay parameter-only | | `f.embed(shape)` | `JSON` | embedded sub-document; see [EMBED.md](./EMBED.md) | | `f.embedMany(shape)` | `JSON` | array of embedded sub-documents | | `f.stringArray()` | `VARCHAR[]` | DuckDB has native list types — no JSON marshalling at read time | | `f.intArray()` | `INTEGER[]` | same | | `f.geoPoint()` | `GEOMETRY` | from the `spatial` extension; `JSON` in fallback mode | | `f.vector({ dims: N })` | `FLOAT[N]` | fixed-size list; `vss` extension uses these | Three points worth flagging: - **`VARCHAR` covers strings, text, IDs, enums, objectIds.** DuckDB has no length limit on `VARCHAR` and no penalty for not specifying one — internally it's dictionary-encoded when the cardinality is low. Don't bother declaring `VARCHAR(255)` even where Postgres habits push you to. - **`HUGEINT`, `MAP`, and `STRUCT` are not mapped by forge's schema builder.** You can use them via raw SQL (`$queryRaw`) on materialised views or staging tables; the typed surface only covers what the forge IR can build. The `f.embed()` shape gets close to `STRUCT` — it emits to JSON, but you can extract individual fields with the dot operator (`location.lat`) inside raw SQL because DuckDB's JSON type accepts struct-style access on JSON values. - **`UUID` is native — no extension required.** The `uuid()` SQL function generates a new v4. Forge's `f.id('uuid')` uses the application-side `crypto.randomUUID()` by default; if you want the DB to generate, drop to raw SQL for the insert. For the full type list against all six dialects see [TYPES.md](./TYPES.md). For the embed / embedMany emit see [EMBED.md → DuckDB](./EMBED.md#duckdb). --- ## Transactional model DuckDB has BEGIN / COMMIT / ROLLBACK, no SAVEPOINT, single-writer serialisation. The shipped driver wraps the connection's `run()` method and pins the queryable inside the transaction callback: ```ts await db.$transaction(async (tx) => { const order = await tx.order.create({ data: { customer_id, total } }); await tx.orderLine.createMany({ data: lines.map((l) => ({ ...l, order_id: order.id })) }); return order; }); ``` A throw inside the callback triggers ROLLBACK; otherwise the driver emits COMMIT. The forge factory's per-tx proxy threads the session through every wrapper call — every `create` / `findMany` / `update` inside the callback runs against the transaction-pinned queryable, not the pool. What DuckDB does **not** support: - **No SAVEPOINT.** A nested `db.$transaction(async (inner) => …)` inside the outer callback degrades to a single transaction — the inner `BEGIN` / `COMMIT` collapses into the outer one, and a throw from inside the inner block rolls the whole thing back. See [TRANSACTIONS.md → Savepoints and nested `$transaction`](./TRANSACTIONS.md#savepoints-and-nested-transaction) for the dialect-by-dialect table. - **No advisory locks.** Concurrency in DuckDB is "one writer at a time on the file". The migration runner does not bother with advisory locks because the file lock already serialises pushes. - **No isolation level knobs.** DuckDB runs at snapshot isolation internally. There's no `SET TRANSACTION ISOLATION LEVEL` to twiddle. The autocommit behaviour matches Postgres: a single statement runs inside an implicit transaction that commits when the statement completes. The explicit `BEGIN` / `COMMIT` you get through `$transaction` is the way to batch multiple statements atomically. For the deadlock / serialisation-failure retry story (mostly N/A on DuckDB because of single-writer), see [TRANSACTIONS.md → Deadlock and serialization-failure retry](./TRANSACTIONS.md#deadlock-and-serialization-failure-retry). --- ## Reading Parquet and CSV inside a typed query This is where DuckDB stops looking like a normal relational store. The `read_parquet()` and `read_csv()` table functions let any SELECT treat a file path (or an S3 URL) as a table. Inside a forge wrapper the natural seam is `$queryRaw`: ```ts const rows = await db.$queryRaw<{ org_id: string; revenue: number }>` SELECT org_id, SUM(revenue) AS revenue FROM read_parquet(${'s3://reports/2026-06-*.parquet'}) WHERE day >= ${cutoff} GROUP BY org_id `; ``` A few notes on the shape: - **The path is a value, not an identifier.** Pass it through the template interpolation — forge binds it as a parameter. Don't string-concat the URL into the SQL. - **Glob patterns work.** `2026-06-*.parquet` expands at scan time; DuckDB lists the bucket and reads matching keys in parallel. - **Predicate pushdown is automatic.** DuckDB will only read the row groups whose Parquet statistics overlap your `WHERE` clause. A query that filters by `day >= '2026-06-01'` against a year of daily Parquet files reads ~1/12 of the bytes. - **The row type is yours to declare.** `$queryRaw` doesn't introspect Parquet — forge has no schema for the file. Type the generic and `select` the columns explicitly. The mirror operation — write a query result to disk — is `COPY TO`: ```ts await db.$executeRaw` COPY ( SELECT * FROM events WHERE org_id = ${orgId} ) TO ${`/exports/events-${orgId}.parquet`} (FORMAT PARQUET, COMPRESSION ZSTD) `; ``` This writes a Parquet file on the DuckDB process's host. If forge is running on the API node, that's the API node's disk — keep an eye on free space. For cloud exports use `s3://` paths in the `TO` target; DuckDB's HTTPFS extension handles the upload. To ingest a Parquet file into a forge-managed table, the `INSERT … SELECT` form works: ```ts await db.$executeRaw` INSERT INTO events (org_id, kind, at) SELECT org_id, kind, at FROM read_parquet(${'s3://my-bucket/events/2026-06-*.parquet'}) WHERE at >= ${cutoff} `; ``` DuckDB casts column types from the Parquet schema to the destination column types at ingest time. If the source has a `BIGINT` for `org_id` and the destination is `VARCHAR`, DuckDB applies the cast (or fails with a `Conversion Error` — forge translates this to `P2007`). CSV is the same shape with `read_csv`: ```ts const rows = await db.$queryRaw<{ sku: string; qty: number }>` SELECT sku, qty FROM read_csv(${`/tmp/inventory.csv`}, header = true) `; ``` `read_csv_auto` is the variant that infers types and header presence. For one-shot reads of unknown CSVs that's the right call; for known schemas the explicit `read_csv(..., columns = { sku: 'VARCHAR', qty: 'INTEGER' })` is faster and gives a clear error if a file drifts off-spec. The full Parquet / CSV reference is the DuckDB docs; what matters for forge is that these are just functions inside a SELECT, and `$queryRaw` is the seam. For the full `$queryRaw` / `$executeRaw` reference see [RAW-SQL.md → DuckDB patterns](./RAW-SQL.md#duckdb-patterns). --- ## S3 and HTTPFS The `httpfs` extension is the thing that turns `s3://…` URLs into something DuckDB can read. Install and configure it once per process: ```ts await db.$executeRaw`INSTALL httpfs`; await db.$executeRaw`LOAD httpfs`; ``` `INSTALL` is a no-op if the extension is cached locally. `LOAD` makes the table functions (`read_parquet` / `read_csv`) accept `s3://`, `https://`, and `gcs://` URLs. The adapter does **not** auto-load `httpfs` (unlike `spatial`) because S3 credentials are out-of-band — loading without credentials would defer the error to first read. Once loaded, set credentials via a SECRET (the modern DuckDB API, ≥ 0.10) or via session settings (the older API): ```ts // Modern — DuckDB ≥ 0.10. Persists in the database file. await db.$executeRaw` CREATE OR REPLACE SECRET s3_main ( TYPE S3, KEY_ID ${process.env.AWS_ACCESS_KEY_ID}, SECRET ${process.env.AWS_SECRET_ACCESS_KEY}, REGION ${process.env.AWS_REGION ?? 'us-east-1'} ) `; // Legacy — session-only settings. Don't persist across reconnect. await db.$executeRaw`SET s3_region = ${process.env.AWS_REGION}`; await db.$executeRaw`SET s3_access_key_id = ${process.env.AWS_ACCESS_KEY_ID}`; await db.$executeRaw`SET s3_secret_access_key = ${process.env.AWS_SECRET_ACCESS_KEY}`; ``` The SECRET form survives a reconnect because DuckDB stores it in the catalog. The SET form has to be re-issued every time the process restarts; wrap it in a one-shot init function and call it before the first S3 read. `httpfs` also covers plain HTTPS — `read_parquet('https://example.com/data.parquet')` works without any credentials. For partner data drops over signed URLs this is the friction-free path. The signature lives in the URL string which is passed as a parameter through the template, so it's bound not spliced. For private S3-compatible stores (Cloudflare R2, Backblaze B2, MinIO) override the endpoint: ```ts await db.$executeRaw` CREATE OR REPLACE SECRET r2_main ( TYPE S3, KEY_ID ${R2_KEY}, SECRET ${R2_SECRET}, ENDPOINT 'r2.cloudflarestorage.com', REGION 'auto', URL_STYLE 'path' ) `; ``` `URL_STYLE path` is the change for non-AWS providers; the default `vhost` style only works with `bucket.s3.region.amazonaws.com`. --- ## Vector — `vss` and the `FLOAT[N]` column DuckDB has native fixed-size arrays — `FLOAT[1536]` is a real column type, not a JSON blob. The `vss` extension adds an HNSW index over those arrays. Forge's `f.vector(N, { metric })` maps directly: ```ts const Document = m({ id: f.id('cuid'), text: f.text(), embedding: f.vector({ dims: 1536, metric: 'cosine' }), }); await db.document.create({ data: { text, embedding } }); const hits = await db.document.findMany({ where: { embedding: { near: { vector: q, withinDistance: 0.3 } } }, orderBy: { embedding: { nearTo: { vector: q } } }, take: 10, }); ``` The dialect emits: - **Column type** — `FLOAT[N]` where `N` is the declared dim count. - **Inline literals** — `[0.1, 0.2, …]::FLOAT[N]` (cast required for DuckDB's type inference). - **Distance functions** — `array_cosine_distance`, `array_distance` (L2), or `array_inner_product` (dot) chosen by the field's `metric`. - **Index** — `CREATE INDEX … USING HNSW (col)`. The `vss` extension has to be installed; it doesn't auto-load. To use HNSW: ```ts await db.$executeRaw`INSTALL vss`; await db.$executeRaw`LOAD vss`; ``` Then push the schema. The forge DDL emits `USING HNSW` for any `indexes: [{ on: 'embedding', method: 'vector' }]` declaration; DuckDB takes it from there. A working-set caveat: HNSW indexes in `vss` are in-memory. The `SET hnsw_enable_experimental_persistence = true` flag persists them to disk, but as of 1.0 the persistence story is opt-in and slower at build time. For analytics workloads where the index rebuilds on process start, leave persistence off and rebuild from the column on boot. DuckDB-vss handles the 100k-vector / 1M-vector tier comfortably and matches sqlite-vec brute-force at the small end. For 10M+ vectors with a heavy query rate, pgvector + a real Postgres host is the better fit. The full per-dialect benchmark sits in [VECTOR.md → Picking a dialect](./VECTOR.md#picking-a-dialect). --- ## Geo — the `spatial` extension The `spatial` extension ships with DuckDB ≥ 0.9 and is **auto-loaded by the adapter** on connect: ```ts await this._driver.query('INSTALL spatial', []); await this._driver.query('LOAD spatial', []); ``` Failures are non-fatal — non-geo schemas keep working. If you declared `f.geoPoint()` and the extension is missing at runtime, the first geo query throws a clear DuckDB error (the adapter does not rewrite this — see [DOCTOR.md](./DOCTOR.md) for the live probe). The dialect emits: | Operation | Emit | |---|---| | Column | `GEOMETRY` (native type from the spatial extension) | | Insert | `ST_Point($lng, $lat)` — 2D; `ST_Point3D($lng, $lat, $alt)` when `dims: 3` | | `near` filter | `ST_Distance_Sphere(col, ST_Point($lng, $lat)) < $meters` | | `nearTo` orderBy | `ORDER BY ST_Distance_Sphere(col, ST_Point($lng, $lat))` | | `withinPolygon` | `ST_Within(col, ST_GeomFromText($wkt::VARCHAR))` | | Spatial index | `CREATE INDEX … USING RTREE (col)` | A few DuckDB-specific notes: - **`ST_Distance_Sphere` is the only distance function the dialect uses.** This is a great-circle approximation, accurate to ~0.2% for terrestrial distances. For ellipsoid precision (PostGIS-quality) there's no DuckDB primitive — drop to raw SQL with `ST_Distance(ST_Transform(col, …))` and the relevant SRID. - **`ST_GeomFromText` rejects untyped parameters.** The dialect casts the WKT param to `VARCHAR` explicitly — DuckDB's prepared-statement inference defaults to `ANY` and the geometry parser won't accept that. If you write raw geo SQL, cast your WKT params. - **The R-Tree index is the spatial-index family.** Forge's `method: 'spatial'` resolves to `USING RTREE`. Range queries against bounding boxes use the index; point-in-polygon and distance queries use the index for the candidate-set prefilter and then refine in the post-filter. - **Fallback mode is supported.** `f.geoPoint({ fallback: true })` stores the point as JSON (`{ lng, lat }`) and runs the filter in application JS via the shared Haversine helper. Useful for tests, edge runtimes without the spatial extension, or schemas that need to portably target a DuckDB instance that hasn't loaded `spatial`. The 3D coordinate support (`dims: 3`) emits `ST_Point3D` for inserts but distance calculation is still 2D-on-sphere — DuckDB has no `Z`-aware sphere distance. If you need true 3D distance (rare outside drone / aerospace), compute it in JS. The full geo reference, including the per-dialect feature matrix (R-Tree vs GiST vs 2dsphere) sits in [GEO.md → Dialect feature matrix](./GEO.md#dialect-feature-matrix). --- ## JSON — native type and JSON-path DuckDB has a native `JSON` logical type — it stores as text internally but presents as a typed column with dedicated functions. Forge's `f.json()`, `f.embed()`, and `f.embedMany()` all map to it. The path operator the dialect emits is `json_extract`: ```ts const adults = await db.user.findMany({ where: { profile: { path: ['age'], gte: 18 }, }, }); // Compiles to: // SELECT … FROM "user" WHERE json_extract("profile", '$.age') >= $1 ``` The `$.a.b[0].c` path string is built by the dialect's `jsonPathExpr` — numeric segments become `[N]`, named segments become `.name`. Same shape as SQLite, which makes the dialect-mux easier when you target both. JSON-specific notes for DuckDB: - **The JSON extension auto-loads.** Since DuckDB 0.9 the `json` extension is bundled and loaded on first use — no `INSTALL json` call needed. - **Expression indexes on JSON paths work.** A declaration like `indexes: [{ expr: { duckdb: `(json_extract(meta, '$.tier'))` } }]` builds a real index, and the planner uses it for equality matches. See [JSON-PATH.md → DuckDB — expression indexes](./JSON-PATH.md#duckdb--expression-indexes-analytics-workloads). - **`STRUCT`-style dot access works on JSON.** `meta.profile.age` inside raw SQL is shorthand for `json_extract(meta, '$.profile.age')`. Forge's typed query API still emits the function form for clarity and parameter binding; if you write raw SQL you can use either. - **`json_each` for unnesting.** `SELECT … FROM users, json_each(users.tags)` flattens a JSON array into rows. Useful for "users with any tag in this set" queries that don't fit the typed `embedMany` operator vocabulary. The full JSON-path reference, including the per-dialect operator table and the index-time tradeoffs, sits in [JSON-PATH.md → DuckDB](./JSON-PATH.md#duckdb). For the embed / embedMany API surface and how it maps to JSON on DuckDB see [EMBED.md → DuckDB](./EMBED.md#duckdb). --- ## Full-text search — `fts` and the ILIKE fallback DuckDB has an `fts` extension (`INSTALL fts; LOAD fts;`) but the API shape is opt-in per-table via `PRAGMA create_fts_index`, which doesn't map cleanly onto forge's `.searchable()` column flag. The DuckDB dialect's `searchClause` therefore falls back to `ILIKE`: ```ts const hits = await db.post.findMany({ where: { title: { contains: q, mode: 'insensitive' } }, }); // Emit: // SELECT … FROM "post" WHERE "title" ILIKE '%' || $1 || '%' ``` This works without any extensions and is portable to schemas that have not enabled `fts`. The downside is no index — it's a sequential scan of every row's `title`. For a thousand-row table that's fine; for a million-row corpus it isn't. If you want real FTS on DuckDB, the pattern is to build the index yourself via `PRAGMA create_fts_index` and query through `$queryRaw`: ```ts // One-time setup — after `LOAD fts`. await db.$executeRaw`PRAGMA create_fts_index('post', 'id', 'title', 'body')`; // Query — uses the match_bm25() scoring function. const hits = await db.$queryRaw<{ id: string; score: number }>` SELECT p.id, fts_main_post.match_bm25(p.id, ${q}) AS score FROM post p WHERE fts_main_post.match_bm25(p.id, ${q}) IS NOT NULL ORDER BY score DESC LIMIT 20 `; ``` The `fts_main_post` schema and the `match_bm25` function names come from `create_fts_index`. Treat this as opt-in per table, re-build after bulk loads (DuckDB's FTS index is a snapshot — it does not auto-update on insert). The full FTS reference, including the per-dialect index-mode table and the BM25 / TF-IDF scoring story, is in [FTS.md → DuckDB and MSSQL](./FTS.md#duckdb-and-mssql). --- ## `EXPORT DATABASE` — backup and round-trip DuckDB's backup format is `EXPORT DATABASE`. It writes every table to Parquet (or CSV, your choice), plus a `schema.sql` and a `load.sql`: ```ts await db.$executeRaw`EXPORT DATABASE ${'/backups/2026-06-24'} (FORMAT PARQUET)`; ``` The output directory holds: ``` /backups/2026-06-24/ schema.sql # CREATE TABLE / CREATE INDEX statements load.sql # COPY FROM '' (FORMAT PARQUET)
.parquet # one Parquet per table ``` To restore into a fresh DuckDB: ```ts const fresh = await createDb({ schema, type: 'duckdb', url: 'duckdb:./restored.duckdb' }); await fresh.$executeRaw`IMPORT DATABASE ${'/backups/2026-06-24'}`; ``` The export is portable — you can move it between machines, store it in S3 (Parquet files compress well with ZSTD), and reimport into any DuckDB version that understands the Parquet schema. There is no PITR (it's a snapshot) and no incremental form (it's a full dump every time). For backup-cadence guidance and the full per-dialect strategy see [BACKUP-RESTORE.md → DuckDB](./BACKUP-RESTORE.md#duckdb). A simpler alternative is to copy the `.duckdb` file: ```bash # Only safe when the DB is closed — call db.$disconnect() first. cp ./analytics.duckdb ./backups/analytics-$(date +%F).duckdb ``` While the DB is open the file is mid-write and a copy may be inconsistent. `EXPORT DATABASE` runs inside an implicit transaction so the snapshot is consistent even if the DB is hot. --- ## `ATTACH` — Postgres / SQLite / MySQL as a catalog The headline feature: DuckDB can ATTACH another database — another DuckDB file, a Postgres server, a SQLite file, a MySQL server, a DuckDB-over-HTTPS endpoint — as a catalog, and then query it through the same SQL surface. This is the mechanism for "join the analytics warehouse against the OLTP store in one query" without moving the data. ### Attach another DuckDB file ```ts await db.$executeRaw`ATTACH ${'/data/archive.duckdb'} AS archive (READ_ONLY)`; const merged = await db.$queryRaw` SELECT u.id, u.email, a.tier FROM main."user" u JOIN archive."tier_history" a ON a.user_id = u.id `; ``` The `READ_ONLY` flag is the safe default for a secondary DuckDB — attaching read-write requires that no other process holds the file. ### Attach Postgres ```ts await db.$executeRaw`INSTALL postgres`; await db.$executeRaw`LOAD postgres`; await db.$executeRaw` ATTACH ${process.env.OLTP_DATABASE_URL} AS app (TYPE postgres, READ_ONLY) `; const dashboard = await db.$queryRaw<{ day: string; revenue: number }>` SELECT date_trunc('day', o.created_at) AS day, SUM(o.total) AS revenue FROM app.public."order" o WHERE o.created_at >= ${cutoff} GROUP BY 1 ORDER BY 1 `; ``` The Postgres extension pushes predicates and projections down to Postgres — the `WHERE o.created_at >= …` runs server-side, not in DuckDB. DuckDB receives only the matching rows and computes the aggregate locally. For dashboard queries that touch a slice of a large OLTP table, this is dramatically cheaper than dumping the table to Parquet first. What it does **not** do: DuckDB's plan visibility into Postgres is limited. A complex join that crosses the attach boundary may run slower than the same query running directly inside Postgres. Use `EXPLAIN` and check what's pushed down before relying on this in a hot path. ### Attach SQLite ```ts await db.$executeRaw`INSTALL sqlite`; await db.$executeRaw`LOAD sqlite`; await db.$executeRaw`ATTACH ${'./local.sqlite'} AS local (TYPE sqlite, READ_ONLY)`; const rows = await db.$queryRaw` SELECT * FROM local.users WHERE active `; ``` Same shape. The SQLite extension reads SQLite files directly — useful for joining an app-local SQLite cache against a Parquet archive. ### Attach MySQL ```ts await db.$executeRaw`INSTALL mysql`; await db.$executeRaw`LOAD mysql`; await db.$executeRaw` ATTACH ${process.env.MYSQL_URL} AS mysql_app (TYPE mysql, READ_ONLY) `; ``` ATTACH is the feature that turns DuckDB from "another database" into "the join layer between every database you already have". The forge typed surface doesn't expose ATTACH directly (it's an out-of-IR operation), but every typed read works against the attached catalogs through `$queryRaw`. ### Cross-DB joins through forge A common shape: forge's typed reads against the local DuckDB tables, mixed with raw cross-catalog joins for the analytical query: ```ts // Typed: list the local analytics segments. const segments = await db.segment.findMany({ where: { tenant_id } }); // Raw: join against the attached Postgres OLTP to enrich. const enriched = await db.$queryRaw<{ id: string; segment_id: string; user_count: number }>` SELECT s.id AS segment_id, s.tenant_id AS id, COUNT(u.id) AS user_count FROM main."segment" s JOIN app.public."user" u ON u.tenant_id = s.tenant_id WHERE s.tenant_id = ${tenant_id} GROUP BY 1, 2 `; ``` The typed query reads from the DuckDB-native `segment` table. The raw query reads the same DuckDB-native table and joins it against the ATTACHed Postgres `user` table. Both calls go through the same forge client. --- ## Persistent vs in-memory DuckDB `duckdb::memory:` opens an in-process DB that vanishes on `close()`. `duckdb:./foo.duckdb` opens a persistent file. When to use each: - **In-memory** — tests, ephemeral ETL jobs, request-scoped staging, a "query Parquet from S3 and return a result" handler. Fast to open (~50 ms), no disk I/O, no cleanup. The whole DB lives in RAM so it's bounded by your `memory_limit` config; default is 80% of system RAM. - **Persistent file** — the analytics cache pattern, the materialised view pattern, anything where the DB survives the process. A `.duckdb` file is one artefact (no WAL, no journal, no segment files) and is safe to back up via `EXPORT DATABASE` or — while the process is closed — a plain `cp`. A useful middle ground: open in-memory, ATTACH a persistent file as a secondary catalog, and write the persistent state to that catalog. ```ts const db = await createDb({ schema, type: 'duckdb', url: 'duckdb::memory:' }); await db.$executeRaw`ATTACH ${'./cache.duckdb'} AS cache`; await db.$executeRaw` CREATE OR REPLACE TABLE cache.daily_revenue AS SELECT day, SUM(revenue) AS revenue FROM read_parquet(${'s3://reports/2026-06-*.parquet'}) GROUP BY 1 `; ``` The in-memory DB is the workspace; the persistent file holds the cached result. Process restart drops the workspace and keeps the cache. --- ## Window functions, aggregations, CTEs DuckDB's planner is built for analytical queries. Three categories where it pays off and forge's typed surface doesn't fully cover: ### Window functions Forge's `findMany` does not emit `OVER (…)`. For rank, lag, percentile, moving-window queries reach for `$queryRaw`: ```ts const ranked = await db.$queryRaw<{ user_id: string; revenue: number; rank: number }>` SELECT user_id, SUM(amount) AS revenue, ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY SUM(amount) DESC) AS rank FROM "order" WHERE created_at >= ${cutoff} GROUP BY tenant_id, user_id `; ``` DuckDB vectorises window operators. Even an unindexed `ORDER BY …` window over a million-row table runs in tens of milliseconds. ### Aggregations beyond `_sum` / `_avg` The typed `groupBy` aggregator vocabulary covers `_count` / `_avg` / `_sum` / `_min` / `_max`. DuckDB ships much more — `quantile_cont`, `approx_count_distinct`, `histogram`, `corr`, `regr_slope`, the full statistical surface. All of those live in raw SQL. ```ts const distribution = await db.$queryRaw<{ bucket: number; n: number }>` SELECT quantile_cont(amount, [0.5, 0.9, 0.95, 0.99]) AS quantiles, approx_count_distinct(user_id) AS distinct_users FROM "order" WHERE created_at >= ${cutoff} `; ``` For analytical dashboards `approx_count_distinct` (HyperLogLog) is the right call for cardinality — `COUNT(DISTINCT)` is exact and slow, the approx variant is one pass over the data. ### CTEs and recursive CTEs CTEs work via raw SQL. DuckDB supports `WITH RECURSIVE` for graph / tree traversal: ```ts const tree = await db.$queryRaw<{ id: string; depth: number }>` WITH RECURSIVE tree(id, parent_id, depth) AS ( SELECT id, parent_id, 0 FROM "category" WHERE parent_id IS NULL UNION ALL SELECT c.id, c.parent_id, t.depth + 1 FROM "category" c JOIN tree t ON c.parent_id = t.id ) SELECT id, depth FROM tree ORDER BY depth `; ``` The forge typed surface does not generate recursive CTEs (it's an explicit out-of-scope feature — see [RELATIONS.md](./RELATIONS.md) for the parent/child pattern that forge does cover). Raw SQL is the escape hatch. --- ## Common errors and limitations The DuckDB adapter's `errors.ts` maps DuckDB's `errorType` and message patterns to Prisma-style P-codes. The high-frequency cases: | DuckDB error | Forge code | Cause | |---|---|---| | `Constraint Error: duplicate key … UNIQUE constraint` | `P2002` | `nativeUpsert` not used; insert into a unique column. Convert to `upsert()` or catch and handle. | | `Constraint Error: NOT NULL` | `P2011` | Insert/update with `null` against a `NOT NULL` column. Schema fields without `.nullable()` emit NOT NULL. | | `Constraint Error: CHECK constraint` | `P2004` | Likely an enum mismatch — forge enums emit `CHECK ("col" IN ('a','b','c'))`. Add the value to the schema's enum and push. | | `Constraint Error: foreign key` | `P2003` | Rare — DuckDB doesn't enforce FKs at write time. This fires if you wrote a check via raw SQL. | | `Catalog Error: Table … does not exist` | `P2021` | Schema not pushed, wrong DB attached, or you're querying an `ATTACH`ed catalog without the qualifier. | | `Catalog Error: column … does not exist` | `P2022` | Column was renamed in schema but the migration didn't run. `npx forge diff` to see the drift. | | `Conversion Error` | `P2007` | Type mismatch — e.g. inserting a string into an INTEGER column. Check the offending field's schema kind. | DuckDB-specific limitations the forge adapter handles silently: - **Partial indexes are not supported.** The DDL emitter strips the `WHERE` clause off a `CREATE INDEX` and warns. The unique constraint degrades to a plain unique — a row-uniqueness check rebroadcasts at the wrapper layer. See [INDEXES.md](./INDEXES.md) for the partial-unique rebroadcast detail. - **Covering indexes (`INCLUDE`) are not supported.** Stripped at DDL emit time with a warning. Use a normal index — DuckDB's columnar storage means an index lookup followed by row fetch is already fast. - **Foreign keys are not enforced at write time.** Accepted at DDL time so introspection sees them. Cascades go through forge's app-side walker. - **No `SAVEPOINT`.** Nested `$transaction` calls collapse into the outer transaction. - **`fts` is opt-in.** `.searchable()` falls back to `ILIKE`. A foot-gun the adapter cannot prevent: DuckDB is **single-writer**. If two Node processes open the same `.duckdb` file with write permissions, the second one's `query()` blocks on the file lock or fails outright. The typical production layout is one DuckDB-owning process (a worker, a cron, a long-running API node) and any other processes read via `ATTACH ' DATABASE … (READ_ONLY)'`. Plan for this up front — it's the difference between "embedded analytics that just works" and "the second deploy collides with the first". --- ## Three worked examples ### (a) Ingest a CSV → typed query Local CSV file dropped by a vendor, want it queryable through the typed surface for the rest of the request. ```ts // schema.ts export const Inventory = m({ sku: f.string().unique(), name: f.string(), qty: f.int(), cost: f.decimal({ precision: 12, scale: 2 }), }); // load.ts import { db } from './client'; export async function loadInventory(csvPath: string): Promise { // Stage into a forge-managed table via INSERT … SELECT. const affected = await db.$executeRaw` INSERT INTO "inventory" (sku, name, qty, cost) SELECT sku, name, qty, cost FROM read_csv(${csvPath}, header = true, columns = { sku: 'VARCHAR', name: 'VARCHAR', qty: 'INTEGER', cost: 'DECIMAL(12,2)' } ) ON CONFLICT (sku) DO UPDATE SET name = EXCLUDED.name, qty = EXCLUDED.qty, cost = EXCLUDED.cost `; return affected; } // query.ts — typed surface from here on. const lowStock = await db.inventory.findMany({ where: { qty: { lt: 10 } }, orderBy: { sku: 'asc' }, }); ``` The `read_csv` declares the column types explicitly so DuckDB doesn't have to infer. The `ON CONFLICT (sku) DO UPDATE` makes the ingest idempotent — a re-run with the same CSV returns the same row count and overwrites changed values. ### (b) Parquet on S3 → join with local Postgres via ATTACH Analytics dashboard: revenue rollup from a Parquet archive in S3, joined with the live customer table in the OLTP Postgres. ```ts import { createDb } from 'forge-orm'; import { schema } from './schema'; const db = await createDb({ schema, type: 'duckdb', url: 'duckdb::memory:', // in-process, no persistence }); // One-time per process: load extensions and credentials. await db.$executeRaw`INSTALL httpfs; LOAD httpfs`; await db.$executeRaw`INSTALL postgres; LOAD postgres`; await db.$executeRaw` CREATE OR REPLACE SECRET s3 ( TYPE S3, KEY_ID ${process.env.AWS_ACCESS_KEY_ID}, SECRET ${process.env.AWS_SECRET_ACCESS_KEY}, REGION ${process.env.AWS_REGION} ) `; await db.$executeRaw` ATTACH ${process.env.OLTP_DATABASE_URL} AS app (TYPE postgres, READ_ONLY) `; // The query — one statement that hits S3 and Postgres in parallel. const rollup = await db.$queryRaw<{ customer_id: string; customer_name: string; revenue: number; }>` WITH revenue AS ( SELECT customer_id, SUM(amount) AS revenue FROM read_parquet(${'s3://archive/orders/year=2026/*.parquet'}) WHERE day >= ${cutoff} GROUP BY 1 ) SELECT c.id AS customer_id, c.name AS customer_name, r.revenue FROM revenue r JOIN app.public."customer" c ON c.id = r.customer_id ORDER BY r.revenue DESC LIMIT 100 `; ``` Three databases participated: the S3 Parquet archive (read via HTTPFS), the OLTP Postgres (via ATTACH, predicate-pushed), and the in-memory DuckDB (the workspace where the join lands). The Node process holds a single forge client. ### (c) Analytical dashboard backed by an embedded DuckDB cache API endpoint that serves a dashboard. The cache is a persistent DuckDB file refreshed nightly; reads are sub-millisecond. ```ts // nightly-refresh.ts — cron job that rebuilds the cache. import { createDb } from 'forge-orm'; import { schema } from './schema'; export async function refreshCache() { const db = await createDb({ schema, type: 'duckdb', url: 'duckdb:./cache.duckdb', }); // Pull yesterday's events from OLTP (via ATTACH) and replace the daily table. await db.$executeRaw`INSTALL postgres; LOAD postgres`; await db.$executeRaw` ATTACH ${process.env.OLTP_DATABASE_URL} AS app (TYPE postgres, READ_ONLY) `; await db.$transaction(async (tx) => { await tx.$executeRaw`DELETE FROM "daily_revenue" WHERE day >= ${dayBefore}`; await tx.$executeRaw` INSERT INTO "daily_revenue" (day, tenant_id, revenue, orders) SELECT date_trunc('day', created_at) AS day, tenant_id, SUM(total) AS revenue, COUNT(*) AS orders FROM app.public."order" WHERE created_at >= ${dayBefore} GROUP BY 1, 2 `; }); await db.$disconnect(); } // api.ts — request handler reads the cache through the typed surface. import { createDb } from 'forge-orm'; import { schema } from './schema'; const dashboardDb = await createDb({ schema, type: 'duckdb', url: 'duckdb:./cache.duckdb', }); export async function dashboardHandler(tenantId: string) { return dashboardDb.dailyRevenue.findMany({ where: { tenant_id: tenantId }, orderBy: { day: 'desc' }, take: 30, }); } ``` The cache is a typed forge model (`f.json()` or any other type, defined in `schema`). Reads use the typed surface — `findMany`, `where`, `orderBy` — and benefit from the columnar layout for the rollup operations the dashboard runs. The refresh job rebuilds the rollup once per day; the API node serves it. The OLTP Postgres never sees a dashboard query. The DuckDB cache lives on the API node's local disk. Total cost: zero additional infra. --- ## Cross-links - [DRIVERS.md](./DRIVERS.md) — the `DuckdbDriver` port shape, param coercion, and the bring-your-own-driver story. - [QUERIES.md](./QUERIES.md) — the typed `findMany` / `where` operator surface that the DuckDB executor implements. - [RAW-SQL.md → DuckDB patterns](./RAW-SQL.md#duckdb-patterns) — the `$queryRaw` / `$executeRaw` examples for Parquet, COPY TO, EXPORT DATABASE, STRUCT extraction. - [VECTOR.md](./VECTOR.md) — `f.vector(N)` and the `vss` extension; the per-dialect benchmark table. - [GEO.md](./GEO.md) — `f.geoPoint()`, the spatial extension, R-Tree indexes, and the per-dialect feature matrix. - [JSON-PATH.md → DuckDB](./JSON-PATH.md#duckdb) — `json_extract`, expression indexes, the typed `path` operator. - [FTS.md → DuckDB and MSSQL](./FTS.md#duckdb-and-mssql) — the `fts` extension, `match_bm25`, and the ILIKE fallback. - [EMBED.md → DuckDB](./EMBED.md#duckdb) — `f.embed()` / `f.embedMany()` emit to native JSON. - [TYPES.md](./TYPES.md) — the full type table across all six dialects. - [TRANSACTIONS.md](./TRANSACTIONS.md) — savepoint, isolation, and retry semantics (DuckDB row in the per-dialect tables). - [MIGRATIONS.md](./MIGRATIONS.md) — DDL push, `$migrate()` runtime apply, the per-dialect rollback fidelity story. - [DIFF.md → DuckDB](./DIFF.md#duckdb) — drift detection, introspection via `duckdb_*` metadata functions, and the `diff apply` shape. - [BACKUP-RESTORE.md → DuckDB](./BACKUP-RESTORE.md#duckdb) — `EXPORT DATABASE` cadence and the no-PITR story. - [INDEXES.md](./INDEXES.md) — index method support, the partial-index degrade-to-plain rebroadcast. - [DOCTOR.md](./DOCTOR.md) — the live probe that surfaces missing extensions (`spatial`, `vss`, `fts`, `httpfs`).