# IMPLEMENT.md — adding SQLite-over-HTTP search to a GitHub Pages site > **You are reading the right file.** This is the "how do I build what memex > builds" guide for LLM/AI assistants and engineers integrating SQLite + > HTTP byte-range loading into another GitHub Pages project. It captures > non-obvious lessons learned the hard way during memex's development. If > you skip the *Critical pitfalls* section you will almost certainly hit > "database disk image is malformed" and waste hours. ## Contents 1. [What memex actually does](#what-memex-actually-does-one-paragraph) 2. [Two integration paths](#two-integration-paths-vendor-vs-build-from-source) — pick one 3. [Critical pitfalls](#critical-pitfalls-read-this-before-writing-code) (1–9, incl. 4b + 8b) 4. [Architecture](#architecture) 5. [Step-by-step recipe](#step-by-step-adding-this-to-a-new-project) (build path) 6. [Schema design rules under byte-range loading](#schema-design-rules-under-byte-range-loading) 7. [Performance / cost model](#performance--cost-model) 8. [Cross-origin DB access (CORS)](#cross-origin-db-access-cors) 9. [Cache invalidation when the DB updates mid-session](#cache-invalidation-when-the-db-updates-mid-session) 10. [Upgrading memex / bumping the vendored SHA](#upgrading-memex--bumping-the-vendored-sha) 11. [Continuous-deploy smoke test (Playwright CI)](#continuous-deploy-smoke-test-playwright-ci) 12. [Common debugging mistakes](#common-debugging-mistakes) 13. [Alternatives we considered and rejected](#alternatives-we-considered-and-rejected) 14. [Files to read in this repo](#files-to-read-in-this-repo) 15. [TL;DR for an LLM in a hurry](#tldr-for-an-llm-in-a-hurry) --- ## What memex actually does (one paragraph) memex builds a SQLite database (with FTS5 indexes) at CI time, deploys it to GitHub Pages as a single static `.db` file, and the browser queries it via HTTP `Range:` requests — fetching only the pages each query touches, typically **< 1% of the database per query**. No backend, no API, no download-the-whole-file step. The DB can grow to hundreds of MB without penalizing page load. The trick is entirely client-side: `sqlite-wasm-http` exposes a custom SQLite VFS that intercepts page reads and turns them into ranged `fetch()` calls against a static URL. --- ## Two integration paths (vendor vs build-from-source) There are exactly two supported ways to add this to a new project. Pick deliberately — they have different upgrade costs and dependency profiles. ### Path A — Vendor memex's pre-built `dist/wasm/` (recommended for most) memex publishes a webpack-bundled copy of `sqlite-wasm-http` + `sqlite3.wasm` + the `openMemexDb()` wrapper at [`dist/wasm/`](dist/wasm/). Copy those files into your bundle root and load `memex.js` as an ES module: ```html ``` - **Pros**: no npm, no webpack, no `node_modules`. The library patches are already baked in. Works in any static-site pipeline that can `curl` files into a bundle directory. - **Cons**: pinned to memex's release cadence; you cannot customize the library or shrink the WASM binary. - **Real-world example**: FastLED/boards uses this path — its `builders/site.py` downloads `memex.js` + `memex-*.js` + `sqlite3.wasm` from `raw.githubusercontent.com/zackees/memex//dist/wasm/` at build time. See [pitfall #6](#6-two-ways-to-consume-memex-rebuild-from-source-vs-vendor-distwasm) for the SHA-pinning warning. ### Path B — `npm install sqlite-wasm-http` + the patch script If you need to control the library (custom backend type, smaller WASM, extra patches), install the library directly and copy memex's `patch-sqlite-wasm-http.mjs` into your `scripts/`: - **Pros**: full control of the SQLite WASM build (memex's `prepare-wasm.mjs` shrinks `sqlite3.wasm` from ~1.5 MB to ~540 KB gzipped via `wasm-strip` + `wasm-opt -Oz`). Direct access to the promiser API for custom worker pools. - **Cons**: requires Node.js, webpack/bundler of your choice, and ongoing maintenance of the patch script when bumping `sqlite-wasm-http`. - **Recipe**: see [step-by-step](#step-by-step-adding-this-to-a-new-project) below. **If unsure, start with Path A.** The vendored bundle is what memex's own demo page (`pages/`) uses — by definition it works end-to-end on GH Pages. Upgrade by re-downloading at a new SHA. --- ## Critical pitfalls (read this before writing code) These are the things that took the longest to discover. Internalize them. ### 1. GitHub Pages gzips `application/octet-stream` and **ignores Range when it does** When a browser requests `your-db.db` with the default headers (`Accept-Encoding: gzip, deflate, br, zstd`) **and** a `Range:` header, Fastly (the CDN behind GH Pages) does this: 1. Sees `Content-Type: application/octet-stream` → decides to gzip. 2. Sees `Accept-Encoding: gzip` → returns the gzipped representation. 3. **Silently ignores the `Range:` header** → responds with HTTP `200 OK` (not `206 Partial Content`) containing the entire gzipped file. 4. Sets `Content-Length:` to the **gzipped** size and `Content-Range:` is absent. Reproduce it: ```bash curl -fsS -I -X GET -H "Accept-Encoding: gzip" -r 0-4095 \ https://your-site.github.io/your-db.db # HTTP/1.1 200 OK # Content-Length: 664016 ← gzipped FULL size # Content-Encoding: gzip ← compressed # Vary: Accept-Encoding # (no Content-Range — Range was ignored) ``` The library then reads what it thinks is "bytes 0–4095" of the DB but is actually the first 4 KB of a gzip stream. SQLite sees garbage and reports `database disk image is malformed`. **This is not a memex problem, a `sqlite-wasm-http` problem, or a `sql.js-httpvfs` problem.** It is a GitHub Pages + Fastly + range + compression interaction. Any client library that issues ranged requests without the workaround will fail the same way. ### 2. The fix is `Accept-Encoding: identity` Tell the server you do not accept compression. Fastly then serves the file uncompressed and honors `Range`: ```bash curl -fsS -I -X GET -H "Accept-Encoding: identity" -r 0-4095 \ https://your-site.github.io/your-db.db # HTTP/1.1 206 Partial Content # Content-Length: 4096 ← the actual slice # Content-Range: bytes 0-4095/9650176 ← the TRUE uncompressed size # (no Content-Encoding) ``` ### 3. `Accept-Encoding` is a "forbidden" request header — but works anyway Per the WHATWG Fetch spec, JavaScript is not allowed to set `Accept-Encoding`. Both `XMLHttpRequest.setRequestHeader()` and `fetch()` are supposed to silently strip it. **Chrome ignores the spec here.** In worker contexts, both Chrome and Firefox forward `Accept-Encoding: identity` when set via `fetch()` (and in many cases via XHR too). You can verify this in Chrome DevTools → Network → click a request → scroll to "Request Headers" → expand "Provisional headers are shown" or use CDP's `Network.requestWillBeSentExtraInfo`. memex relies on this behavior. If you ever need to support a stricter browser, fall back to chunked mode (see *Alternatives* below). ### 4. Use `sqlite-wasm-http`, **not** `sql.js-httpvfs` Both libraries do the same job in principle, but only one is workable on GH Pages: | | `sqlite-wasm-http` (memex) | `sql.js-httpvfs` (phiresky) | |---|---|---| | Transport | `fetch()` | `XMLHttpRequest` | | Custom headers via config | ✅ `createHttpBackend({ headers: {…} })` | ❌ no public API | | GH Pages out of the box | ✅ with `Accept-Encoding: identity` | ❌ requires monkey-patching the worker | | Worker pool / sync mode | ✅ both | sync only | | Maintenance | Stale (~2023) | Stale (~2023) | Both libraries are unmaintained. memex uses `sqlite-wasm-http` because its headers config is the only natively-supported escape hatch for the gzip problem. ### 4b. `sql.js-httpvfs` cannot be "fixed in place" with an XHR monkey-patch If you already deployed `sql.js-httpvfs` and want to dodge the rewrite, the obvious idea is: prepend a shim to `sqlite.worker.js` that wraps `XMLHttpRequest.prototype.open` and injects `setRequestHeader('Accept-Encoding', 'identity')` on every request. This does **not** work. Verified empirically on Chrome 140 (Headless), Firefox 121, Safari 17: - `XMLHttpRequest.setRequestHeader('Accept-Encoding', anything)` is **silently no-op'd** by all major browsers — including in worker contexts. There is no error thrown; the call simply has no effect. - Wrapping `fetch()` in the worker scope does propagate the header successfully, but the failure mode is "all-or-nothing" — if the library uses XHR (as `sql.js-httpvfs` does), the fetch wrapper does nothing. The wire-level proof (CDP capture from a real Chrome page that loaded a patched `sql.js-httpvfs` worker): ``` REQ accept-encoding='gzip, deflate, br, zstd' range='(no range)' RESP content-encoding='gzip' content-range='(no cr)' content-length='664016' Page status: 'failed: Length of the file not known. It must either be supplied in the config or given by the HTTP server.' ``` That error message — "Length of the file not known" — is the `sql.js-httpvfs` signature for "I got a gzipped 200 instead of a ranged 206 and have no way to know the real file size." If you see it on GH Pages, you are hitting this exact problem. **The real fix is to switch to `sqlite-wasm-http`.** XHR forbids the header in spec AND in browser implementation; `fetch()` only forbids it in spec. memex's library uses `fetch()` and threads the header through `createHttpBackend({ headers })`. That's the supported path. ### 5. The library still needs patching even with custom headers The default size-detection logic in `sqlite-wasm-http` does a HEAD-ish probe whose response cannot be trusted under range/gzip interactions. memex's `pages-src/scripts/patch-sqlite-wasm-http.mjs` rewrites two files in `node_modules/` after install: - `dist/vfs-sync-http.js` — switches the synchronous XHR probe to `Range: bytes=0-0` and reads the true file size from the `Content-Range` header (format `bytes 0-0/`). - `dist/vfs-http-worker.js` — same fix for the async fetch path. Both patches also propagate the user-supplied `headers` option into the probe request, so `Accept-Encoding: identity` is sent during the very first byte too — otherwise the probe itself returns the gzipped full file and the library records the wrong file size. You can read the exact regex replacements at [`pages-src/scripts/patch-sqlite-wasm-http.mjs`](pages-src/scripts/patch-sqlite-wasm-http.mjs). If you ever bump `sqlite-wasm-http`, re-test the regexes — the upstream dist file structure occasionally changes. ### 6. Two ways to consume memex: rebuild from source vs vendor `dist/wasm/` (See [Two integration paths](#two-integration-paths-vendor-vs-build-from-source) at the top for the path-selection discussion.) This pitfall is about the **vendor** path specifically. **Pin to a commit SHA, not `main`.** The chunk filenames (`memex-141.js`, `memex-901.js`, …) are webpack-generated IDs that change every time memex is rebuilt. If you reference `main` and memex republishes, your site breaks because `memex.js` tries to dynamic-import a chunk ID that no longer exists in the published `dist/wasm/`. Example download URL with pinned SHA: ``` https://raw.githubusercontent.com/zackees/memex/03fe8df…/dist/wasm/memex.js ``` When you bump the SHA, re-download **the full set** of `memex-*.js` files and verify the new chunk list at the same path. See [Upgrading memex](#upgrading-memex--bumping-the-vendored-sha) below for the precise procedure. ### 7. memex's `query()` returns columns + rows arrays, not row objects ```js const { columns, rows } = await query(db, 'SELECT a, b FROM t LIMIT 1'); // columns: ['a', 'b'] // rows: [[1, 2]] ``` If your existing code accesses `row.a` / `row.b` (row objects, e.g. from sql.js's `getAsObject()`), wrap `query()` and zip the result into per-column objects: ```js async function rowsAsObjects(db, sql, params) { const res = await query(db, sql, params); return res.rows.map(r => Object.fromEntries(res.columns.map((c, i) => [c, r[i]]))); } ``` Positional `?` placeholders work — pass an array as the third argument. Named `$name` placeholders also work — pass an object instead. ### 8. Local development needs explicit MIME types for module scripts When testing locally with Python's stdlib `http.server` (or `RangeHTTPServer`), `.js` files are served as `Content-Type: text/plain` on Windows because the system mime registry doesn't map `.js` → `application/javascript`. Browsers strictly enforce MIME type for ``. 4. Build `index.db` with FTS5 indexes + `PRAGMA page_size=4096` + `VACUUM`. Never `LIKE '%x%'` over big tables. See [schema rules](#schema-design-rules-under-byte-range-loading). 5. Deploy via `actions/deploy-pages@v4`. No `gh-pages` branch needed. 6. Verify with `curl -r 0-4095 -H "Accept-Encoding: identity" https://owner.github.io/repo/index.db` — must return `206 Partial Content` with `Content-Range: bytes 0-4095/`. If you get `200 OK` instead, the deploy is broken — diagnose **before** opening the browser. 7. Add the [Playwright smoke test](#continuous-deploy-smoke-test-playwright-ci) as a nightly CI job. **Path B — build from source** 1. `npm install sqlite-wasm-http`. 2. Copy memex's `patch-sqlite-wasm-http.mjs` and run it as `postinstall`. 3. Open the DB with `headers: { 'Accept-Encoding': 'identity' }` in `createHttpBackend`. 4. Steps 4–7 from Path A apply identically. **The one critical line in both paths**: `Accept-Encoding: identity`. Without it, GH Pages returns the full gzipped file with HTTP 200 every time you ask for a range — and SQLite reports `database disk image is malformed`. If any step is unclear, re-read [Critical pitfalls](#critical-pitfalls-read-this-before-writing-code). The cost of getting them wrong is hours of "why does SQLite say the DB is malformed when I can `sqlite3 index.db` locally just fine".