# Backup and restore forge-orm doesn't ship backup tools — you use the native facility of whichever database you connect to. This page collects the patterns that work for each dialect, plus the operational rules (restore drills, schema-mismatch handling, encrypted off-site copies) that turn raw backup files into a working disaster-recovery story. Companion to **[MIGRATIONS.md](./MIGRATIONS.md)** (how the schema gets to the DB in the first place) and **[ROLLBACK.md](./ROLLBACK.md)** (how to undo a recent schema apply). Backup is the floor under both of those — when a rollback can't recover the data, when a schema apply corrupted a table you didn't notice for a week, the restore path is what's left. If you're reading this in an incident, jump to [Restore drill — a script you can lift](#restore-drill--a-script-you-can-lift) and come back for the rest. ## Contents * [Why forge stays out of backup](#why-forge-stays-out-of-backup) * [Backup formats — logical vs physical](#backup-formats--logical-vs-physical) * [Per-dialect backup primitives](#per-dialect-backup-primitives) * [Postgres](#postgres) * [MySQL / MariaDB](#mysql--mariadb) * [SQLite](#sqlite) * [DuckDB](#duckdb) * [MSSQL](#mssql) * [MongoDB](#mongodb) * [PITR — point-in-time recovery per dialect](#pitr--point-in-time-recovery-per-dialect) * [Schema-only vs data-only vs full](#schema-only-vs-data-only-vs-full) * [Restore + schema mismatch](#restore--schema-mismatch) * [Encrypted backups](#encrypted-backups) * [Cloud-managed backup — what it covers and what it doesn't](#cloud-managed-backup--what-it-covers-and-what-it-doesnt) * [Backup in CI — nightly cron and retention](#backup-in-ci--nightly-cron-and-retention) * [SQLite streaming with litestream](#sqlite-streaming-with-litestream) * [Multi-region disaster recovery](#multi-region-disaster-recovery) * [Browser and mobile — there is no OPFS backup](#browser-and-mobile--there-is-no-opfs-backup) * [Worked examples](#worked-examples) * [Restore drill — a script you can lift](#restore-drill--a-script-you-can-lift) * [Cross-references](#cross-references) --- ## Why forge stays out of backup Backup is a database-engine concern, not an ORM concern. Postgres has `pg_dump`, `pg_basebackup`, and write-ahead-log archiving. SQLite has `.backup`, `VACUUM INTO`, and the file copy with the `-wal` and `-shm` siblings. Mongo has `mongodump` and replica-set snapshots. Each of those is tuned to its engine's storage layout, locking model, and on-disk format. An ORM that wrapped them would be one of two things — a thin shell that just shells out (no value over running the tool directly), or a thick re-implementation that gets the corner cases wrong (worse than the native tool). So forge does the same thing here that it does for connection pooling, query planning, and replication topology: it gets out of the way. The schema you push with `forge push` lives in the database; the database's own backup primitives are what brings it back. forge's job is to make sure the *shape* of the data is reproducible from your TypeScript schema, which means a restored backup can always be brought forward to the current schema with one more `forge push`. That property — restore the data, then push the shape — is the only forge-specific thing on this page. Everything else is dialect-native. What this means in practice: * **You don't run a forge subcommand to take a backup.** You run `pg_dump`, `mysqldump`, `mongodump`, `sqlite3 .backup`, `EXPORT DATABASE`, `BACKUP DATABASE`. Those tools ship with the database client and know how to coordinate with the live server. * **You run `forge push` after restore**, against the restored database, to align its schema with HEAD. If the backup is from a week ago and the schema's moved on, the restore brings the old shape back and `forge push` brings it forward. This is the only piece of the workflow that's forge-shaped. * **You can use `forge diff` as a verification step** after restore to confirm the restored DB matches HEAD. `forge diff --check` exit 0 means the restore landed clean and there's nothing left to apply; exit 3 means the restore is older than HEAD and `forge push` will bring it forward. See [Restore + schema mismatch](#restore--schema-mismatch). --- ## Backup formats — logical vs physical Every dialect supports at least one of two backup shapes. The choice drives restore time, portability, and what you can recover from. | Format | What it is | When it wins | What it loses | |---|---|---|---| | **Logical** | A SQL or JSON stream that re-creates the schema and inserts the rows. `pg_dump` (default), `mysqldump`, `mongodump`, DuckDB's `EXPORT DATABASE`. | Portable across engine versions and sometimes across engines. Easy to grep, diff, partial-restore one table. Encrypts well as a stream. | Slow to restore on large DBs — each `INSERT` is parsed, planned, and written. No PITR (you get the moment the dump started, no later). | | **Physical** | A byte-level copy of the data files. `pg_basebackup`, MariaBackup, SQLite file copy, MSSQL native backup, Mongo replica-set snapshot. | Fast to restore (just put the files in place and start the engine). Carries indexes, statistics, and bloat exactly as they were. Pairs with WAL/binlog for PITR. | Engine-version-specific (Postgres 14 base backup won't open under PG 16 without `pg_upgrade`). Encryption requires the engine's at-rest support or filesystem-level. | The default operating posture for most teams is **logical backup nightly + WAL/binlog continuous archiving for PITR**. The nightly logical dump is your portable, encrypted-off-site, "I can untar this on a laptop" artefact. The WAL archive is what lets you replay forward from the nightly to within seconds of an incident. The two together are what every cloud-managed backup product is also doing under the hood. --- ## Per-dialect backup primitives ### Postgres * **`pg_dump`** — logical, per-database. Streams SQL (or `-Fc` custom format) to stdout. Doesn't block writers; uses a serializable snapshot. The default for nightly backups. * **`pg_dumpall`** — logical, cluster-wide. Use when you need roles, tablespaces, and every database in one file. Most app deployments use `pg_dump` of the one app database instead. * **`pg_basebackup`** — physical, cluster-wide. Streams the data directory and the WAL needed to be consistent. Pair with continuous WAL archiving for PITR. * **WAL archiving** — `archive_mode = on`, `archive_command = 'cp %p /archive/%f'` (or `pgbackrest`, `wal-g`, `barman`). Each WAL segment is shipped as it fills; restore replays them in order. * **PITR** — base backup + WAL segments + a `recovery_target_time = '2026-06-24 14:32:00'` setting in `recovery.signal`. Postgres replays WAL up to that timestamp, then stops. Choose by RPO: | RPO target | What to run | |---|---| | 24 hours | `pg_dump` nightly to S3, no WAL archive. | | 1 hour | `pg_dump` nightly + WAL archive every 5 minutes. | | 1 minute | `pg_basebackup` weekly + continuous WAL archive (`pgbackrest --archive-async`). | | Seconds | A streaming standby replica with synchronous commit, plus base backup + WAL. | ### MySQL / MariaDB * **`mysqldump`** — logical, single-threaded. The historical workhorse. Use `--single-transaction --quick --routines --triggers --events` for a consistent snapshot of an InnoDB-only DB without locking writers. * **`mysqlpump`** — logical, parallel. Faster than `mysqldump` on dumps with many tables. Lacks the same "single-transaction across the whole dump" guarantee — schedule against a read replica or during quiet hours. * **MariaBackup / Percona XtraBackup** — physical, hot. Copies InnoDB tablespaces while writers continue. Pair with binlog for PITR. * **Binary log (binlog)** — `log_bin = ON`, `binlog_format = ROW`. Each committed transaction writes a row-based event. `mysqlbinlog --start-datetime=… --stop-datetime=…` replays a range against a restored backup. * **PITR** — restore the most recent physical backup, then `mysqlbinlog` the binlogs from the backup's binlog position to the target time, piped into `mysql`. MySQL's gotcha is `--single-transaction`. Without it, `mysqldump` takes a `FLUSH TABLES WITH READ LOCK` that blocks writers. With it, MyISAM tables (still common in legacy schemas) aren't consistent because MyISAM doesn't honour the transaction. Audit for MyISAM before relying on `--single-transaction` — `SELECT table_name FROM information_schema.tables WHERE engine = 'MyISAM'`. The second MySQL gotcha is `gtid_mode`. Without GTIDs, PITR via binlog requires you to remember the binlog file + position the backup ended at (`mysqldump --master-data=2` writes it as a comment in the dump). With GTIDs, `mysqlbinlog` knows where the dump's GTID set ends and replays the rest cleanly. Turn GTIDs on (`gtid_mode=ON`, `enforce_gtid_consistency=ON`) before you need them; flipping it requires a rolling restart of replicas. ### SQLite SQLite is a single file (plus `-wal` and `-shm` siblings when WAL mode is on). The backup story has three shapes. * **`sqlite3 source.db ".backup target.db"`** — the canonical hot backup. Uses the SQLite online backup API: copies pages while writers continue, retrying as needed. The target is a fresh DB file. Restore is a file copy back. * **`VACUUM INTO 'target.db'`** — runs in a single transaction, produces a defragmented copy. Slightly different from `.backup` — it rebuilds rather than copying pages — so the output is smaller but the source experiences a longer-held read lock. Use for periodic compaction-plus-backup. * **Cold file copy** — `cp source.db backup.db && cp source.db-wal backup.db-wal && cp source.db-shm backup.db-shm`. **Only safe when no process is writing.** A live copy of a WAL-mode database without the `-wal` and `-shm` files is a partially-written database; opening it loses every transaction since the last checkpoint. If you cold-copy, run `PRAGMA wal_checkpoint(TRUNCATE);` first. * **`litestream`** — continuous streaming of WAL segments to S3 / B2 / GCS. Sub-second RPO. See [SQLite streaming with litestream](#sqlite-streaming-with-litestream). For a forge-orm Tauri / Capacitor app with a local SQLite, the realistic backup story is "the data lives on the device, the server is the backup" — see [Browser and mobile — there is no OPFS backup](#browser-and-mobile--there-is-no-opfs-backup). For a server-side SQLite (a small SaaS, an edge function with a sidecar DB), litestream is the answer. ### DuckDB * **`EXPORT DATABASE 'dir' (FORMAT PARQUET);`** — logical. Writes per-table Parquet files plus a `schema.sql` and a `load.sql`. Portable, columnar-compressed, restorable into a fresh DuckDB with `IMPORT DATABASE 'dir';`. The default for DuckDB backup. * **`COPY table TO 'file.parquet' (FORMAT PARQUET);`** — per-table. Useful for partial backups or feeding downstream analytics. * **File copy** — DuckDB's `.duckdb` file is also a single artefact. When the DB is closed (the process has called `db.$disconnect()`), `cp source.duckdb backup.duckdb` is correct. While the DB is open, the file may be mid-write; use `EXPORT DATABASE` instead. * **No WAL archive, no PITR** — DuckDB is positioned as an analytical engine, not a transactional one. Backups are snapshots. If you need point-in-time, your DuckDB instance is downstream of an OLTP source (Postgres, MySQL) — back the source up, treat the DuckDB as a derived cache. ### MSSQL * **`BACKUP DATABASE … TO DISK = '…';`** — physical, native. Full, differential, or transaction-log. The full backup is the base, differentials are deltas since the last full, log backups capture every transaction. * **`RESTORE DATABASE … FROM DISK = '…' WITH NORECOVERY;`** then `RESTORE LOG … WITH STOPAT = '2026-06-24T14:32:00';` — PITR by replaying log backups onto a non-recovered full restore. * **Log shipping** — automated periodic log backups copied to a standby and restored there. RPO matches the log backup interval (commonly 5 minutes). * **Always On Availability Groups** — synchronous replication for HA, asynchronous for DR. Not a substitute for backup — covers hardware loss, not "I dropped the wrong table". The MSSQL backup model is the most complete of any dialect — full + differential + log gives you tight control over restore time and storage cost. The cost is operational complexity: you maintain the chain (a missing differential or log breaks PITR). Use Ola Hallengren's maintenance solution for the cron side; it's the standard. Sizing for a typical MSSQL deployment: | Backup type | Cadence | Restore time on 100 GB | |---|---|---| | Full | Weekly | 10–30 min depending on storage | | Differential | Daily | Full + the latest differential, 15–40 min | | Log | Every 15 min | Full + differential + 0–95 logs, 20 min–2 hr | The differential cadence is the lever most teams under-tune. Daily differentials keep restore short; less-frequent differentials make the log chain longer and the restore slower. Match cadence to your RTO target, not to disk savings — log storage is much cheaper than recovery time during an incident. ### MongoDB * **`mongodump`** — logical, per-database or cluster-wide. BSON output, restorable with `mongorestore`. The standard for sub-100GB dumps. Slow for larger; restore takes hours. * **Replica-set filesystem snapshot** — `db.fsyncLock()`, take an LVM / EBS / ZFS snapshot of the data directory on a secondary, `db.fsyncUnlock()`. Physical, consistent, restorable as a fresh mongod data directory. The standard for >100GB. * **Oplog tailing** — capture the replica set's oplog continuously (`mongodump --oplog`), then `mongorestore --oplogReplay --oplogLimit ''` for PITR. * **Atlas / MongoDB Cloud Manager** — managed continuous backup with PITR. The hosted version of "snapshot + oplog". forge's Mongo adapter doesn't generate DDL (Mongo is schemaless); backups are about preserving collections and indexes. `forge push` after restore re-establishes indexes from the schema, which is faster and more reliable than restoring indexes from the dump. `mongodump --oplog` is the option most teams miss. Without it, the dump is a snapshot from the moment each collection's dump started — collections dumped later don't see writes that happened during earlier collections' dumps, leading to subtle cross-collection inconsistency on restore. With `--oplog`, the tool also captures the oplog window of the dump, and `mongorestore --oplogReplay` brings every collection forward to a single consistent point. --- ## PITR — point-in-time recovery per dialect Point-in-time recovery answers "give me the database as it was at 14:32:07 last Tuesday". Some dialects support it natively; some need third-party tooling; some don't support it at all. | Dialect | Native PITR | RPO floor | What you need | |---|---|---|---| | Postgres | Yes | Seconds | Base backup + WAL archive (`pgbackrest`, `wal-g`, `barman`). | | MySQL | Yes | Seconds | Physical backup (MariaBackup / XtraBackup) + binlog. | | MSSQL | Yes | Seconds | Full + log backups, replayed with `STOPAT`. | | MongoDB | Yes | Seconds | Snapshot + oplog. Atlas does this transparently. | | SQLite | Via litestream | Sub-second | WAL streaming to S3; replay generation by generation. | | DuckDB | No | N/A | Take periodic exports; PITR belongs upstream. | PITR isn't free. The WAL / binlog / oplog stream is continuous I/O — sized for steady-state write throughput, which on a busy DB is hundreds of MB/hour. Plan storage: 30 days of WAL at 100 MB/hour is 70 GB. S3 storage is cheap, but the retrieval bandwidth matters in a restore — see [Backup in CI — nightly cron and retention](#backup-in-ci--nightly-cron-and-retention). The other PITR gotcha is **clock alignment**. A `recovery_target_time` only works if the DB server's clock matched UTC at the moment of the incident. NTP-skewed servers produce PITR targets that miss by the skew. If you're chasing seconds, log every backup's wall-clock-vs-DB-clock delta and use that to correct the target. A practical PITR drill: 1. Take a base backup at `T0`. 2. Run a known mutation at `T0 + 30m` (insert a row with `id = 'pitr-canary'`). 3. Run a destructive mutation at `T0 + 1h` (drop the canary row). 4. Restore the base backup, replay WAL / binlog / oplog to `T0 + 45m`. 5. Verify `pitr-canary` exists in the restored DB. If the canary's there, PITR works. If it isn't, either replay didn't reach `T0 + 45m` (check the recovery target settings) or the canary fell outside the replay window (check the archive contents). Run this drill once per quarter against a copy of prod data; it's the cheapest insurance against PITR-shaped incidents. --- ## Schema-only vs data-only vs full Most logical backup tools support splitting the dump into schema and data. The split matters in three scenarios: * **Clone for staging** — `--schema-only` (Postgres `-s`, MySQL `--no-data`, Mongo doesn't really do this — use `mongoexport` with empty queries). Recreates an empty DB shaped like prod, ideal for running tests against. forge's alternative is `forge push` against an empty database from the same schema file; the dump version captures vendor-specific objects (custom types, stored procs, materialised views) that the schema declaration doesn't. * **Refresh data without disturbing schema** — `--data-only`. Used to swap in fresh data from prod into a staging DB that's already been schema-migrated to a newer version. Brittle — if the column set drifted, `--data-only` rows won't load. Prefer the [restore-then-push pattern](#restore--schema-mismatch). * **Pre-rollback backup** — full, before running `forge diff apply` on a destructive migration. The safety net for "the down block didn't quite undo what the up block did". See ROLLBACK.md's emergency procedures section. | Flag set | Postgres | MySQL | DuckDB | Mongo | |---|---|---|---|---| | Schema only | `pg_dump -s` | `mysqldump --no-data` | `EXPORT DATABASE … (FORMAT PARQUET)` then keep only `schema.sql` | n/a — re-run `forge push` against an empty DB | | Data only | `pg_dump -a` | `mysqldump --no-create-info` | Per-table `COPY TO` | `mongodump` (collections are data + indexes; no separation) | | Full | `pg_dump` (default) | `mysqldump` (default) | `EXPORT DATABASE` | `mongodump` | The forge-specific recommendation: store the schema *as code* (`src/schema.ts` at a Git SHA), not as a SQL dump. The dump version becomes stale the moment the schema changes; the code version is the source of truth `forge push` reads from. Schema-only dumps are useful for debugging or for non-forge tools (BI, schema-comparison) — they aren't your restore path. --- ## Restore + schema mismatch The single most useful forge-specific property in this whole document: **restore the data, then `forge push` to bring forward**. The scenario: nightly backup from three days ago. Today's schema has two new columns and a new index. You restore the backup into a fresh DB. The restored DB's tables don't have the new columns. ```sh # 1. Provision a fresh DB. createdb -h restored-host restored_db # 2. Restore the logical backup. gunzip -c backup-2026-06-21.sql.gz | psql -h restored-host restored_db # 3. Verify the restore landed clean against the OLD schema. git checkout 2026-06-21-tag -- src/schema.ts DATABASE_URL=postgres://…/restored_db npx forge diff --check # Exit 0 — restore matches the snapshot's schema. # 4. Bring the schema forward to HEAD. git checkout HEAD -- src/schema.ts DATABASE_URL=postgres://…/restored_db npx forge push # Adds the two new columns and the new index. # 5. Verify again — should be clean against HEAD now. DATABASE_URL=postgres://…/restored_db npx forge diff --check # Exit 0. ``` The same pattern works in any direction. Restoring an old backup into a fresh DB and then pushing HEAD lands the current schema with the old data. New columns are NULL (or take their declared defaults). New non-null columns without defaults will fail the push — fix by adding the default in the schema, or by writing a one-off backfill before pushing. Two failure modes worth knowing: * **Schema removed a column the backup had data in** — the restore loads the column; `forge push` doesn't drop it. The schema declares HEAD's shape; the column is "extra in DB" drift. Run `forge diff` to see it. If you want it gone, `forge diff apply` will generate the destructive `ALTER TABLE … DROP COLUMN`. If you want to keep the data for migration, write a one-off transform script first. * **Schema renamed a column** — push sees it as drop + add (renames look like that to the differ; see MIGRATIONS.md). The restored data sits in the old column name; the schema declares the new one. Push will add the new column empty. You write the rename data step explicitly: `UPDATE table SET new_col = old_col; ALTER TABLE table DROP COLUMN old_col;`. This is one of the cases where push-style ORMs need a hand — and where a Prisma-style numbered migration would have shipped the rename atomically. * **Schema changed a column type** — `push` doesn't `ALTER TYPE`; that's a `forge diff apply` operation. After restore, the old type sits in the restored DB; `forge diff --check` reports a `columnType` mismatch. You either run `forge diff apply` (which generates the type-change DDL with whatever cast logic the differ can express) or you write the explicit `ALTER TABLE … ALTER COLUMN … TYPE … USING …` yourself for cases the differ can't handle (a string → uuid that needs validation, a timestamp → date that loses precision). * **Schema gained a non-nullable column without a default** — `forge push` against the restored DB fails. Restored rows can't all gain a value out of thin air. Fix by adding a default in the schema declaration, by writing a one-off backfill script run between restore and push, or by making the column nullable, pushing, backfilling, then tightening to non-null in a follow-up. --- ## Encrypted backups A backup that isn't encrypted at rest in your off-site store is a one-zero-day-away data breach. Three layers, any one of which is sufficient, all three of which is the recommendation: * **Engine-level encryption** — Postgres `pg_basebackup` with TDE (12+ with the right build), MySQL with InnoDB encryption keyring, MSSQL TDE. The backup inherits encryption from the data files. Restore needs the keyring. * **Pipe through a streaming cipher before upload** — `pg_dump … | gpg --encrypt -r ops@team.example | aws s3 cp - s3://backups/db-$(date +%F).sql.gpg`. Independent of engine; one extra step in the restore (`aws s3 cp … | gpg --decrypt | psql …`). * **Object-store SSE** — S3 / B2 / GCS server-side encryption with a KMS-managed key. Cheap to enable, mandatory baseline. Doesn't protect against an attacker with the same IAM role you use for backups — pair with one of the above. Key management is the part teams get wrong. The decryption key cannot live in the same blast radius as the backup. If a compromised cloud account holds both the encrypted backup and the KMS key that unlocks it, you have plaintext. Patterns that work: * **Hardware-backed KMS** with policies that prevent the backup IAM role from also calling `kms:Decrypt`. Restores require a separate role check. * **Off-cloud keystore** — backup encryption key in 1Password / Vault / a hardware token, the backup script `gpg --encrypt`s with a public key whose private half is offline. Cold but bulletproof for disaster scenarios. * **Shamir's secret sharing** for the master key. Recovery requires N-of-M operators to combine shards. Operationally heavy; used by compliance-regulated teams. The litestream story (see [SQLite streaming with litestream](#sqlite-streaming-with-litestream)) bakes in age-encrypted object-store writes. For dump-style backups, the GPG pipe is the simplest credible posture. A pattern worth knowing: **per-customer encryption for multi-tenant apps**. If a SaaS holds tenant data and one customer demands "delete all my data", the GDPR-friendly answer is encrypted-per-tenant backups where the deletion is "destroy this tenant's key". The backup file lives forever in S3; without the key it's noise. The forge-orm side of this is a hook on the dump pipeline — split rows by tenant, encrypt each tenant's stream with their key, write per-tenant objects. Heavier to implement than per-DB encryption, but the only honest answer when the deletion-of-record requirement runs against the keep-backups-for-7-years requirement. --- ## Cloud-managed backup — what it covers and what it doesn't Most teams use a managed database (RDS, Aurora, Cloud SQL, Azure SQL, Atlas, Neon, PlanetScale). Each ships an automated backup product. What they cover: | Service | Default backup | PITR window | What you get | |---|---|---|---| | AWS RDS (Postgres/MySQL) | Daily snapshot | 1–35 days | Snapshot + WAL/binlog; restore creates a new instance. | | AWS Aurora | Continuous | 1–35 days | Sub-second PITR; restore creates a new cluster from a clone. | | GCP Cloud SQL | Daily | 7 days default | Snapshot + WAL; restore creates a new instance. | | Azure SQL Database | Continuous | 7–35 days | Native log backup chain; geo-restore to a paired region. | | MongoDB Atlas | Continuous | 1–7 days (free), longer paid | Snapshot + oplog; restore creates a new cluster. | | Neon | Branching (CoW) | 7–30 days | Branch from any point in retention; near-instant. | | PlanetScale | Daily | 30 days | Snapshot per branch; restore via support. | What they cover well: **disk loss, region loss, accidental table drop within the retention window.** What they don't cover: * **Logical corruption that took a week to notice.** If retention is 7 days and the bad write happened 8 days ago, the managed backup is gone. Hold your own out-of-cloud copies for incidents you might catch late. * **Account takeover.** A compromised AWS root can delete every snapshot. Cross-account backup vaults (AWS Backup → separate account) defeat this; many teams skip the step. * **Provider outage longer than your business tolerates.** RDS in `us-east-1` going down is an industry event, not a recoverable-from-snapshot event, until it comes back. Cross-cloud copies (RDS → S3 → R2 → Backblaze) cost cents per GB and avoid this entirely. * **Schema correctness across restore.** A managed restore brings the data back at its snapshot time. If the schema has moved on, you still need `forge push` to align. The managed product doesn't know about your TypeScript schema. The realistic posture for a production team on a managed DB: **lean on the managed backup as the primary recovery path, plus weekly logical dumps to a separate cloud account or provider as the "I'm getting fired if this fails too" backstop.** The weekly dump is small (a few minutes of cron + S3 cents), and it covers every cloud-managed backup failure mode listed above. --- ## Backup in CI — nightly cron and retention A backup that nobody runs isn't a backup. Wire it to a scheduler — GitHub Actions cron, Vercel Cron, a Kubernetes CronJob, an EventBridge schedule — and verify it runs. ```yaml # .github/workflows/backup-nightly.yml name: Nightly Postgres backup on: schedule: - cron: '17 3 * * *' # 03:17 UTC, off-peak, off-the-hour to avoid the herd workflow_dispatch: jobs: backup: runs-on: ubuntu-latest timeout-minutes: 60 steps: - uses: actions/checkout@v4 - name: Install pg_dump matching prod major run: | sudo apt-get update sudo apt-get install -y postgresql-client-16 - name: Dump, encrypt, upload env: DATABASE_URL: ${{ secrets.DATABASE_URL }} BACKUP_GPG_PUBKEY: ${{ secrets.BACKUP_GPG_PUBKEY }} AWS_ACCESS_KEY_ID: ${{ secrets.BACKUP_AWS_KEY }} AWS_SECRET_ACCESS_KEY: ${{ secrets.BACKUP_AWS_SECRET }} run: | set -euo pipefail STAMP=$(date -u +%Y%m%dT%H%M%SZ) echo "$BACKUP_GPG_PUBKEY" | gpg --import pg_dump --format=custom --compress=9 "$DATABASE_URL" \ | gpg --encrypt --recipient backup@team.example \ | aws s3 cp - "s3://app-backups/postgres/$STAMP.dump.gpg" - name: Prune retention env: AWS_ACCESS_KEY_ID: ${{ secrets.BACKUP_AWS_KEY }} AWS_SECRET_ACCESS_KEY: ${{ secrets.BACKUP_AWS_SECRET }} run: | # Keep 7 daily, 4 weekly, 12 monthly — handled by S3 lifecycle, not here. aws s3 ls s3://app-backups/postgres/ | head -5 ``` Retention is best expressed as an object-store lifecycle rule, not as a script-with-`rm`. A misconfigured retention script can delete every backup in a single bad run; an S3 lifecycle policy applies per-object, can't accidentally delete future objects, and is auditable in the bucket config. Retention defaults that work for most teams: * 7 daily backups. * 4 weekly (last day of the week, kept 4 weeks). * 12 monthly (first day of the month, kept 1 year). * 7 yearly (first day of the year, kept 7 years — adjust for compliance). That's 30 objects steady-state per database, regardless of how long the team has been running. Cheap. The S3 lifecycle rule expresses it directly via `Transition` to Glacier for older tiers and `Expiration` for the oldest. The other half of CI backup is **verifying the cron actually ran**. A backup-job-never-fires alert is the silent killer. Pattern: every successful upload writes a heartbeat key (`s3://app-backups/heartbeat/postgres`); a separate scheduled probe alerts if the heartbeat is older than 26 hours. Same shape for every dialect. The third half (the one nobody talks about) is **monitoring the size of the dump**. A backup that suddenly halves in size between two runs is a near-certain signal of partial dump, broken connection, or accidental TRUNCATE in prod. Track size as a metric; alert on day-over-day delta over 20% in either direction. The 20% threshold is loose enough to avoid noise on small DBs and tight enough to catch real problems on growing DBs. --- ## SQLite streaming with litestream [litestream.io](https://litestream.io) replicates a SQLite database to S3-compatible object storage continuously, by tailing the WAL. It's the backup story for any single-node server-side SQLite — Fly.io apps, edge functions with a sidecar DB, small SaaS that picked SQLite for its operational simplicity. The forge-orm integration is zero-config — litestream runs as a sidecar process, the forge `db` handle opens the same SQLite file, and litestream's WAL tail is invisible to the application. ```yaml # litestream.yml dbs: - path: /var/lib/app/app.sqlite replicas: - type: s3 bucket: app-litestream path: app region: us-east-1 retention: 720h # 30 days of generation history retention-check-interval: 24h snapshot-interval: 24h # full snapshot daily; PITR replays WAL between sync-interval: 1s # WAL frames pushed every second encryption-key: ${LITESTREAM_AGE_KEY} # age encryption at rest ``` ```sh # Restore — straightforward, no app code involved. litestream restore -o /var/lib/app/app.sqlite \ -timestamp 2026-06-24T14:32:00Z \ s3://app-litestream/app ``` The restored file is dropped at the path forge's DB URL points at; restart the app and forge opens it as if nothing happened. Run `forge push` against the restored DB if the schema has moved on (the same pattern as every other dialect — see [Restore + schema mismatch](#restore--schema-mismatch)). litestream gotchas: * **It owns the WAL.** Other tools that checkpoint the WAL (a third-party backup script, an over-eager `PRAGMA wal_checkpoint(TRUNCATE)`) can break litestream's generation chain. Don't mix. * **One writer.** SQLite is already single-writer; litestream assumes the same. Multi-process writers via WAL mode still work, but the litestream sidecar must be on the same host as the writers. * **Sub-second RPO is sub-second loss potential.** A crash between two `sync-interval` ticks loses up to that interval's worth of writes. Set the interval to your RPO. The default 1s is appropriate for SaaS-grade durability. * **Restoration is slow on large generations.** A 100 GB DB with 30 days of WAL replays in tens of minutes, not seconds. For databases over ~50 GB consider periodic snapshot rotation (lower `snapshot-interval`) or a different storage engine. --- ## Multi-region disaster recovery Single-region backups recover from disk loss and human error. Multi-region recovers from "the region is on fire". The shape depends on RPO and RTO targets. | Strategy | RPO | RTO | Cost | Forge fit | |---|---|---|---|---| | Cold backup, cross-region S3 replication | Hours | Hours | $ | Any dialect; `pg_dump | aws s3 cp` to a bucket with cross-region replication. | | Warm standby, async replica in DR region | Seconds–minutes | Minutes | $$ | Postgres logical replication, MySQL async replication, Mongo replica set across regions. Forge connects via `DATABASE_URL`; failover is a URL change. | | Hot standby, sync replica in DR region | Zero | Seconds | $$$ | Postgres synchronous_commit, MSSQL Always On synchronous AG, Aurora Global Database. Forge sees one URL behind a managed failover. | | Active-active, multi-master | Zero | Zero | $$$$ | CockroachDB, YugabyteDB, Spanner. Forge connects via the standard adapter (Postgres-wire for the first two); writes resolve under the engine's consistency model. | The forge-orm-specific notes: * **Forge doesn't manage failover.** When the primary URL becomes the standby URL, the application reads the new URL from the environment, restarts, and reconnects. This is the same shape as every other database client. * **Replicas in `DATABASE_REPLICA_URL`** for read-traffic splitting is documented in [BACKEND.md](./BACKEND.md#read-replicas-and-split-routing). The DR replica is the same shape — point it at a different URL when the primary's gone. * **Schema convergence in DR** — if the DR region is async, there's a window where the standby is behind the primary. A failover during a deploy (the primary saw the new schema, the standby didn't yet) means the failed-over DR runs against an older schema. The first action on the new primary is `forge push` to align. * **Backups are still required** alongside replicas. A replica that faithfully copies "I dropped the wrong table" is not a backup; it's a faster copy of the disaster. Replicas cover hardware and region failure; backups cover human and software failure. You need both. A failover runbook worth keeping next to this doc: 1. Confirm the primary is actually gone — not a flap, not a network partition that resolves in 30s. Spurious failovers cost more than they save. 2. Promote the standby to primary in the engine (`pg_promote()` for Postgres streaming replication; the managed service's failover button for RDS/Aurora/Cloud SQL). 3. Update `DATABASE_URL` in the application environment to point at the new primary. 4. Restart the application; verify it connects. 5. Run `forge diff --check` against the new primary. Exit 0 means schema's clean. Exit 3 means the standby was behind a schema apply — run `forge push` to bring it forward. 6. Repoint replicas / DR standbys at the new primary, or rebuild them if their replication chain broke. 7. Update DNS / load-balancer pointers so external clients reach the new primary. 8. Post-incident: write the timeline. The next failover should be faster. --- ## Browser and mobile — there is no OPFS backup A forge-orm browser app (sqlite-wasm + OPFS via the worker driver — see [BROWSER.md](./BROWSER.md)) keeps the database in the user's origin storage. A mobile app via Capacitor or tauri-plugin-sql keeps it in the app's sandbox. Neither is backed up by the OS in any way you can rely on. | Platform | What's there | What backs it up | What doesn't | |---|---|---|---| | Browser, OPFS | `/storage/.../OPFS//…` | Nothing. Cleared on "clear site data", evicted under storage pressure, gone on profile reset. | iCloud, Chrome sync, browser settings sync. | | iOS, Capacitor | App container `Documents/` | iCloud Backup *if* the file isn't excluded *and* the user enabled it. | Cellular data, low-storage devices. | | Android, Capacitor | App private storage | Google's auto-backup *if* `android:allowBackup="true"` and the file's under 25 MB. | Anything larger. | | Tauri desktop | App data dir | Time Machine, File History, restic, whatever the user runs. | Anything they haven't set up. | The realistic pattern for any forge-orm device-side DB: **the device is the cache, the server is the source of truth.** Sync writes to a server API; let the server's backup story (which is one of the patterns above) be the data's safety net. If the device's DB is corrupted or wiped, the app re-syncs from the server on next launch. This is documented in [MOBILE.md](./MOBILE.md#sync-strategies). The backup-side note here is just the inverse: **if you're treating the device DB as authoritative, you have no backup**. Sync is the backup. --- ## Worked examples ### (a) Postgres nightly to S3, encrypted, with restore drill The CI workflow above ships the dump. The restore drill — which validates the dump is actually restorable — runs weekly. ```yaml # .github/workflows/backup-restore-drill.yml name: Restore drill on: schedule: - cron: '0 6 * * 1' # Monday 06:00 UTC workflow_dispatch: jobs: drill: runs-on: ubuntu-latest services: postgres: image: postgres:16 env: { POSTGRES_PASSWORD: postgres } ports: ['5432:5432'] options: --health-cmd pg_isready steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: 20 } - run: npm ci - name: Pull the most recent backup env: AWS_ACCESS_KEY_ID: ${{ secrets.BACKUP_AWS_KEY }} AWS_SECRET_ACCESS_KEY: ${{ secrets.BACKUP_AWS_SECRET }} BACKUP_GPG_PRIVKEY: ${{ secrets.BACKUP_GPG_PRIVKEY }} run: | set -euo pipefail LATEST=$(aws s3 ls s3://app-backups/postgres/ | sort | tail -1 | awk '{print $4}') aws s3 cp "s3://app-backups/postgres/$LATEST" "$LATEST" echo "$BACKUP_GPG_PRIVKEY" | gpg --import gpg --decrypt "$LATEST" > backup.dump - name: Restore into the test instance env: PGPASSWORD: postgres run: pg_restore --clean --if-exists -h localhost -U postgres -d postgres backup.dump - name: Verify schema matches the SHA the backup was taken at env: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres run: | # The backup's STAMP encodes the time; the deploy that produced it # tagged the schema. Verify the restore is clean against that tag. TAG=$(aws s3api head-object --bucket app-backups --key "postgres/$LATEST" --query 'Metadata.git_sha' --output text) git checkout "$TAG" -- src/schema.ts npx forge diff --check - name: Bring forward to HEAD env: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres run: | git checkout HEAD -- src/schema.ts npx forge push npx forge diff --check - name: Run a smoke query env: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres run: npx tsx scripts/smoke-after-restore.ts ``` `scripts/smoke-after-restore.ts` exercises a small read-and-write workload against the restored DB. The point isn't extensive coverage — it's catching "the restore landed but every row's `created_at` is NULL because the column got added after the backup". One small write, one read-back, one consistency check is enough. ### (b) SQLite via litestream, restored on a fresh node ```sh # Provision a fresh node, install litestream, restore, start app. curl -sL https://github.com/benbjohnson/litestream/releases/download/v0.3.13/litestream-v0.3.13-linux-amd64.tar.gz \ | tar -xzC /usr/local/bin cat > /etc/litestream.yml <<'EOF' dbs: - path: /var/lib/app/app.sqlite replicas: - type: s3 bucket: app-litestream path: app encryption-key: ${LITESTREAM_AGE_KEY} EOF # Restore the latest generation. mkdir -p /var/lib/app LITESTREAM_AGE_KEY=… litestream restore -o /var/lib/app/app.sqlite \ s3://app-litestream/app # Run forge push to align schema with HEAD (the restore is from whenever). DATABASE_URL=sqlite:///var/lib/app/app.sqlite npx forge push # Verify. DATABASE_URL=sqlite:///var/lib/app/app.sqlite npx forge diff --check # Start litestream + app. litestream replicate -config /etc/litestream.yml & node dist/server.js ``` The whole flow is six commands. The forge-shaped piece is two of them: `forge push` after restore, `forge diff --check` to verify. The rest is litestream and the OS. ### (c) Mongo restore drill ```sh # Pull the most recent dump from Atlas snapshot export, restore, push, verify. set -euo pipefail LATEST=$(aws s3 ls s3://app-backups/mongo/ | sort | tail -1 | awk '{print $4}') aws s3 cp "s3://app-backups/mongo/$LATEST" "$LATEST" tar -xzf "$LATEST" # Spin up a throwaway mongod for the restore. docker run -d --name drill-mongo -p 27017:27017 mongo:7 sleep 5 mongorestore --uri=mongodb://localhost:27017 --drop ./dump/ # Bring indexes forward — forge push is idempotent on Mongo. DATABASE_URL=mongodb://localhost:27017/app npx forge push # Verify by listing indexes the schema declares. DATABASE_URL=mongodb://localhost:27017/app npx forge diff --check docker rm -f drill-mongo ``` Mongo's variant is the most forge-friendly — there's no DDL to replay, just the data and the indexes. `forge push` is the index-management primitive; it's idempotent against a freshly-restored DB. --- ## Restore drill — a script you can lift If you take one thing from this page, take this: **a backup you've never restored is a backup you don't have**. The most common failure mode of disaster recovery isn't the backup not existing — it's the restore process not working when it's needed. The corruption was outside the dump's snapshot, the encryption key was rotated, the network egress is rate-limited, the on-call doesn't remember the bucket name. Every quarter at minimum, every month if you can, run the drill. ```sh #!/usr/bin/env bash # scripts/restore-drill.sh — usage: ./restore-drill.sh set -euo pipefail KEY="${1:?usage: restore-drill.sh }" DRILL_DB="restore_drill_$(date +%Y%m%d_%H%M%S)" echo "[1/5] Pull backup $KEY" aws s3 cp "s3://app-backups/$KEY" /tmp/drill.dump.gpg gpg --decrypt /tmp/drill.dump.gpg > /tmp/drill.dump echo "[2/5] Provision target $DRILL_DB" createdb -h "$DRILL_HOST" "$DRILL_DB" trap "dropdb -h $DRILL_HOST $DRILL_DB" EXIT echo "[3/5] Restore" pg_restore --clean --if-exists -h "$DRILL_HOST" -d "$DRILL_DB" /tmp/drill.dump echo "[4/5] Bring forward to HEAD" DATABASE_URL="postgres://$DRILL_HOST/$DRILL_DB" npx forge push DATABASE_URL="postgres://$DRILL_HOST/$DRILL_DB" npx forge diff --check echo "[5/5] Smoke test" DATABASE_URL="postgres://$DRILL_HOST/$DRILL_DB" npx tsx scripts/smoke-after-restore.ts echo "OK — restore validated for $KEY" ``` Cadence: * **Weekly** — automated, against the most recent backup, in CI. The workflow above. * **Monthly** — manual, against a random backup from the prior month. Confirms older backups in the rotation still restore. * **Quarterly** — a full game-day. Simulate a region outage; restore into a different region or cloud; bring the app up against the restored DB; check observability for a full request flow. This is the drill that catches "the on-call doesn't know what to do", which is the failure mode all the others miss. The drill output should be an alert-able artefact — a metric (`drill_seconds`, `drill_success`), a heartbeat key, a Slack message. If the drill silently breaks, you're back to "untested backup". --- ## Cross-references * **[MIGRATIONS.md](./MIGRATIONS.md)** — `forge push`, `forge diff`, drift. The schema-side of the restore + push pattern. * **[ROLLBACK.md](./ROLLBACK.md)** — `forge rollback`, the snapshot store, blue/green. The fast-undo path that, when it doesn't recover, hands off to this page. * **[BACKEND.md](./BACKEND.md)** — connection pooling, replicas, health checks. The DR replica URL switch and the health-probe story tie into the multi-region patterns above. * **[MOBILE.md](./MOBILE.md)** — device-side SQLite, sync strategies. The reason device backups aren't a real thing and what to do instead. * **[BROWSER.md](./BROWSER.md)** — sqlite-wasm + OPFS, `db.$migrate()`. The browser-side equivalent of the same device-DB constraint. * **[DRIVERS.md](./DRIVERS.md)** — adapter-level connection details. Useful when wiring a restore-target DB into the same connection shape as prod. * **[DOCTOR.md](./DOCTOR.md)** — `forge doctor` for post-restore verification beyond `forge diff` (probes extensions, vendor-specific objects).