# IMPLEMENT.md — adding SQLite-over-HTTP search to a GitHub Pages site
> **You are reading the right file.** This is the "how do I build what memex
> builds" guide for LLM/AI assistants and engineers integrating SQLite +
> HTTP byte-range loading into another GitHub Pages project. It captures
> non-obvious lessons learned the hard way during memex's development. If
> you skip the *Critical pitfalls* section you will almost certainly hit
> "database disk image is malformed" and waste hours.
## Contents
1. [What memex actually does](#what-memex-actually-does-one-paragraph)
2. [Two integration paths](#two-integration-paths-vendor-vs-build-from-source) — pick one
3. [Critical pitfalls](#critical-pitfalls-read-this-before-writing-code) (1–9, incl. 4b + 8b)
4. [Architecture](#architecture)
5. [Step-by-step recipe](#step-by-step-adding-this-to-a-new-project) (build path)
6. [Schema design rules under byte-range loading](#schema-design-rules-under-byte-range-loading)
7. [Performance / cost model](#performance--cost-model)
8. [Cross-origin DB access (CORS)](#cross-origin-db-access-cors)
9. [Cache invalidation when the DB updates mid-session](#cache-invalidation-when-the-db-updates-mid-session)
10. [Upgrading memex / bumping the vendored SHA](#upgrading-memex--bumping-the-vendored-sha)
11. [Continuous-deploy smoke test (Playwright CI)](#continuous-deploy-smoke-test-playwright-ci)
12. [Common debugging mistakes](#common-debugging-mistakes)
13. [Alternatives we considered and rejected](#alternatives-we-considered-and-rejected)
14. [Files to read in this repo](#files-to-read-in-this-repo)
15. [TL;DR for an LLM in a hurry](#tldr-for-an-llm-in-a-hurry)
---
## What memex actually does (one paragraph)
memex builds a SQLite database (with FTS5 indexes) at CI time, deploys it
to GitHub Pages as a single static `.db` file, and the browser queries it
via HTTP `Range:` requests — fetching only the pages each query touches,
typically **< 1% of the database per query**. No backend, no API, no
download-the-whole-file step. The DB can grow to hundreds of MB without
penalizing page load.
The trick is entirely client-side: `sqlite-wasm-http` exposes a custom
SQLite VFS that intercepts page reads and turns them into ranged `fetch()`
calls against a static URL.
---
## Two integration paths (vendor vs build-from-source)
There are exactly two supported ways to add this to a new project. Pick
deliberately — they have different upgrade costs and dependency profiles.
### Path A — Vendor memex's pre-built `dist/wasm/` (recommended for most)
memex publishes a webpack-bundled copy of `sqlite-wasm-http` +
`sqlite3.wasm` + the `openMemexDb()` wrapper at
[`dist/wasm/`](dist/wasm/). Copy those files into your bundle root and
load `memex.js` as an ES module:
```html
```
- **Pros**: no npm, no webpack, no `node_modules`. The library patches
are already baked in. Works in any static-site pipeline that can
`curl` files into a bundle directory.
- **Cons**: pinned to memex's release cadence; you cannot customize the
library or shrink the WASM binary.
- **Real-world example**: FastLED/boards uses this path — its
`builders/site.py` downloads `memex.js` + `memex-*.js` + `sqlite3.wasm`
from `raw.githubusercontent.com/zackees/memex//dist/wasm/` at
build time. See [pitfall #6](#6-two-ways-to-consume-memex-rebuild-from-source-vs-vendor-distwasm)
for the SHA-pinning warning.
### Path B — `npm install sqlite-wasm-http` + the patch script
If you need to control the library (custom backend type, smaller WASM,
extra patches), install the library directly and copy memex's
`patch-sqlite-wasm-http.mjs` into your `scripts/`:
- **Pros**: full control of the SQLite WASM build (memex's
`prepare-wasm.mjs` shrinks `sqlite3.wasm` from ~1.5 MB to ~540 KB
gzipped via `wasm-strip` + `wasm-opt -Oz`). Direct access to the
promiser API for custom worker pools.
- **Cons**: requires Node.js, webpack/bundler of your choice, and
ongoing maintenance of the patch script when bumping
`sqlite-wasm-http`.
- **Recipe**: see [step-by-step](#step-by-step-adding-this-to-a-new-project) below.
**If unsure, start with Path A.** The vendored bundle is what memex's
own demo page (`pages/`) uses — by definition it works end-to-end on
GH Pages. Upgrade by re-downloading at a new SHA.
---
## Critical pitfalls (read this before writing code)
These are the things that took the longest to discover. Internalize them.
### 1. GitHub Pages gzips `application/octet-stream` and **ignores Range when it does**
When a browser requests `your-db.db` with the default headers
(`Accept-Encoding: gzip, deflate, br, zstd`) **and** a `Range:` header,
Fastly (the CDN behind GH Pages) does this:
1. Sees `Content-Type: application/octet-stream` → decides to gzip.
2. Sees `Accept-Encoding: gzip` → returns the gzipped representation.
3. **Silently ignores the `Range:` header** → responds with HTTP `200 OK`
(not `206 Partial Content`) containing the entire gzipped file.
4. Sets `Content-Length:` to the **gzipped** size and `Content-Range:` is
absent.
Reproduce it:
```bash
curl -fsS -I -X GET -H "Accept-Encoding: gzip" -r 0-4095 \
https://your-site.github.io/your-db.db
# HTTP/1.1 200 OK
# Content-Length: 664016 ← gzipped FULL size
# Content-Encoding: gzip ← compressed
# Vary: Accept-Encoding
# (no Content-Range — Range was ignored)
```
The library then reads what it thinks is "bytes 0–4095" of the DB but is
actually the first 4 KB of a gzip stream. SQLite sees garbage and reports
`database disk image is malformed`.
**This is not a memex problem, a `sqlite-wasm-http` problem, or a
`sql.js-httpvfs` problem.** It is a GitHub Pages + Fastly + range +
compression interaction. Any client library that issues ranged requests
without the workaround will fail the same way.
### 2. The fix is `Accept-Encoding: identity`
Tell the server you do not accept compression. Fastly then serves the
file uncompressed and honors `Range`:
```bash
curl -fsS -I -X GET -H "Accept-Encoding: identity" -r 0-4095 \
https://your-site.github.io/your-db.db
# HTTP/1.1 206 Partial Content
# Content-Length: 4096 ← the actual slice
# Content-Range: bytes 0-4095/9650176 ← the TRUE uncompressed size
# (no Content-Encoding)
```
### 3. `Accept-Encoding` is a "forbidden" request header — but works anyway
Per the WHATWG Fetch spec, JavaScript is not allowed to set
`Accept-Encoding`. Both `XMLHttpRequest.setRequestHeader()` and
`fetch()` are supposed to silently strip it.
**Chrome ignores the spec here.** In worker contexts, both Chrome and
Firefox forward `Accept-Encoding: identity` when set via `fetch()` (and
in many cases via XHR too). You can verify this in Chrome DevTools →
Network → click a request → scroll to "Request Headers" → expand
"Provisional headers are shown" or use CDP's `Network.requestWillBeSentExtraInfo`.
memex relies on this behavior. If you ever need to support a stricter
browser, fall back to chunked mode (see *Alternatives* below).
### 4. Use `sqlite-wasm-http`, **not** `sql.js-httpvfs`
Both libraries do the same job in principle, but only one is workable on
GH Pages:
| | `sqlite-wasm-http` (memex) | `sql.js-httpvfs` (phiresky) |
|---|---|---|
| Transport | `fetch()` | `XMLHttpRequest` |
| Custom headers via config | ✅ `createHttpBackend({ headers: {…} })` | ❌ no public API |
| GH Pages out of the box | ✅ with `Accept-Encoding: identity` | ❌ requires monkey-patching the worker |
| Worker pool / sync mode | ✅ both | sync only |
| Maintenance | Stale (~2023) | Stale (~2023) |
Both libraries are unmaintained. memex uses `sqlite-wasm-http` because
its headers config is the only natively-supported escape hatch for the
gzip problem.
### 4b. `sql.js-httpvfs` cannot be "fixed in place" with an XHR monkey-patch
If you already deployed `sql.js-httpvfs` and want to dodge the rewrite,
the obvious idea is: prepend a shim to `sqlite.worker.js` that wraps
`XMLHttpRequest.prototype.open` and injects `setRequestHeader('Accept-Encoding', 'identity')`
on every request. This does **not** work. Verified empirically on
Chrome 140 (Headless), Firefox 121, Safari 17:
- `XMLHttpRequest.setRequestHeader('Accept-Encoding', anything)` is
**silently no-op'd** by all major browsers — including in worker
contexts. There is no error thrown; the call simply has no effect.
- Wrapping `fetch()` in the worker scope does propagate the header
successfully, but the failure mode is "all-or-nothing" — if the
library uses XHR (as `sql.js-httpvfs` does), the fetch wrapper does
nothing.
The wire-level proof (CDP capture from a real Chrome page that loaded
a patched `sql.js-httpvfs` worker):
```
REQ accept-encoding='gzip, deflate, br, zstd' range='(no range)'
RESP content-encoding='gzip' content-range='(no cr)' content-length='664016'
Page status:
'failed: Length of the file not known.
It must either be supplied in the config or given by the HTTP server.'
```
That error message — "Length of the file not known" — is the
`sql.js-httpvfs` signature for "I got a gzipped 200 instead of a
ranged 206 and have no way to know the real file size." If you see it
on GH Pages, you are hitting this exact problem.
**The real fix is to switch to `sqlite-wasm-http`.** XHR forbids the
header in spec AND in browser implementation; `fetch()` only forbids
it in spec. memex's library uses `fetch()` and threads the header
through `createHttpBackend({ headers })`. That's the supported path.
### 5. The library still needs patching even with custom headers
The default size-detection logic in `sqlite-wasm-http` does a HEAD-ish
probe whose response cannot be trusted under range/gzip interactions.
memex's `pages-src/scripts/patch-sqlite-wasm-http.mjs` rewrites two
files in `node_modules/` after install:
- `dist/vfs-sync-http.js` — switches the synchronous XHR probe to
`Range: bytes=0-0` and reads the true file size from the
`Content-Range` header (format `bytes 0-0/`).
- `dist/vfs-http-worker.js` — same fix for the async fetch path.
Both patches also propagate the user-supplied `headers` option into the
probe request, so `Accept-Encoding: identity` is sent during the very
first byte too — otherwise the probe itself returns the gzipped full
file and the library records the wrong file size.
You can read the exact regex replacements at
[`pages-src/scripts/patch-sqlite-wasm-http.mjs`](pages-src/scripts/patch-sqlite-wasm-http.mjs).
If you ever bump `sqlite-wasm-http`, re-test the regexes — the upstream
dist file structure occasionally changes.
### 6. Two ways to consume memex: rebuild from source vs vendor `dist/wasm/`
(See [Two integration paths](#two-integration-paths-vendor-vs-build-from-source)
at the top for the path-selection discussion.) This pitfall is about the
**vendor** path specifically.
**Pin to a commit SHA, not `main`.** The chunk filenames
(`memex-141.js`, `memex-901.js`, …) are webpack-generated IDs that
change every time memex is rebuilt. If you reference `main` and memex
republishes, your site breaks because `memex.js` tries to dynamic-import
a chunk ID that no longer exists in the published `dist/wasm/`. Example
download URL with pinned SHA:
```
https://raw.githubusercontent.com/zackees/memex/03fe8df…/dist/wasm/memex.js
```
When you bump the SHA, re-download **the full set** of `memex-*.js`
files and verify the new chunk list at the same path. See
[Upgrading memex](#upgrading-memex--bumping-the-vendored-sha) below for
the precise procedure.
### 7. memex's `query()` returns columns + rows arrays, not row objects
```js
const { columns, rows } = await query(db, 'SELECT a, b FROM t LIMIT 1');
// columns: ['a', 'b']
// rows: [[1, 2]]
```
If your existing code accesses `row.a` / `row.b` (row objects, e.g.
from sql.js's `getAsObject()`), wrap `query()` and zip the result into
per-column objects:
```js
async function rowsAsObjects(db, sql, params) {
const res = await query(db, sql, params);
return res.rows.map(r => Object.fromEntries(res.columns.map((c, i) => [c, r[i]])));
}
```
Positional `?` placeholders work — pass an array as the third argument.
Named `$name` placeholders also work — pass an object instead.
### 8. Local development needs explicit MIME types for module scripts
When testing locally with Python's stdlib `http.server` (or
`RangeHTTPServer`), `.js` files are served as `Content-Type: text/plain`
on Windows because the system mime registry doesn't map `.js` →
`application/javascript`. Browsers strictly enforce MIME type for
``.
4. Build `index.db` with FTS5 indexes + `PRAGMA page_size=4096` + `VACUUM`. Never `LIKE '%x%'` over big tables. See [schema rules](#schema-design-rules-under-byte-range-loading).
5. Deploy via `actions/deploy-pages@v4`. No `gh-pages` branch needed.
6. Verify with `curl -r 0-4095 -H "Accept-Encoding: identity" https://owner.github.io/repo/index.db` — must return `206 Partial Content` with `Content-Range: bytes 0-4095/`. If you get `200 OK` instead, the deploy is broken — diagnose **before** opening the browser.
7. Add the [Playwright smoke test](#continuous-deploy-smoke-test-playwright-ci) as a nightly CI job.
**Path B — build from source**
1. `npm install sqlite-wasm-http`.
2. Copy memex's `patch-sqlite-wasm-http.mjs` and run it as `postinstall`.
3. Open the DB with `headers: { 'Accept-Encoding': 'identity' }` in `createHttpBackend`.
4. Steps 4–7 from Path A apply identically.
**The one critical line in both paths**: `Accept-Encoding: identity`.
Without it, GH Pages returns the full gzipped file with HTTP 200 every
time you ask for a range — and SQLite reports `database disk image is
malformed`. If any step is unclear, re-read [Critical pitfalls](#critical-pitfalls-read-this-before-writing-code).
The cost of getting them wrong is hours of "why does SQLite say the DB
is malformed when I can `sqlite3 index.db` locally just fine".