# DDR-102: Hub-sync cold-start divergence resolution — journal-gated fast-forward, dual snapshot, newest-wins - **Date:** 2026-06-11 - **Status:** Accepted (implemented — `feature-hub-sync-cold-start-safety.md`) - **Tags:** sync, hub, data-safety, conflict-resolution, rate-limit, multiplexing, incident - **Related:** [DDR-054](./DDR-054-linked-mode-trust-model-and-task-4-hardening.md) (trust model — UNCHANGED), [DDR-056](./DDR-056-linked-mode-gitignore-strategy.md) (`_state/` gitignore lane the journal rides), [DDR-064](./DDR-064-single-shared-collab-doc.md) (shared-doc convergence + the empty-hub guard, now a named decision row), [DDR-079](./DDR-079-tsx-sync-default-on.md) (TSX default-on — unchanged), [DDR-053](./DDR-053-hub-admin-auth-architecture.md) (scope-bound tokens — unchanged). Plan: [`feature-hub-sync-cold-start-safety.md`](../plans/feature-hub-sync-cold-start-safety.md). ## Context The 2026-06-11 incident on `test.studyfi.com/hub` (two AI-StudyMate checkouts linked to one hub): peer B's 6 kB `ui/maskot.tsx` — a day of mascot work — was silently overwritten at boot by peer A's stale 2551 B version that had seeded the hub earlier. The only trace was a `conflicts: [{"kind":"cold-start-hub-wins"}]` entry in `_sync.json`. Simultaneously, ~65 of B's 83 canvases never synced at all: each dev server opened 83 WebSockets, the two boot bursts (166 auths) tripped the hub's 100/min per-label rate bucket (both machines shared one label — the second `maude design link` had overwritten the first token in `hubs.json`), and Hocuspocus' infinite per-socket retries kept the bucket pinned forever. Boot printed `83/83 canvas(es) syncing` before any handshake; the peer saw only the generic `permission-denied` with a hint pointing at token scopes. The v1.1 scoping ("resolution is always hub-wins; the interactive 3-way prompt is deferred" — `agent.ts`, Phase 9 plan) treated cold-start divergence as a notification problem. The incident proved it is a data-loss problem. ## Decision **This DDR supersedes the v1.1 "cold-start resolution is always hub-wins" clause** (Phase 9 plan + the old `agent.ts` / `migrate-seed.ts` comments). The DDR-054 trust model and the DDR-064 convergence architecture are explicitly unchanged. 1. **Per-machine content-hash journal, not timestamps, detects divergence.** `/_state/sync-journal.json` (`sync/journal.ts`) records, per slug, the `hashBytes` of the last body/css this machine successfully reconciled disk↔doc — every traversal of the boundary checkpoints it (agent + projection writes and applies, adopt, migrate-seed). Hub-wins may overwrite local disk **only** when `hash(local) == journal hash` — a clean fast-forward, everything local was already synced. Timestamps lie across machines; "did THIS machine sync these exact bytes" is exact. Absent/corrupt journal → conservative (divergence path). The journal is per-hub: relinking to a different URL wipes it. 2. **Cold-start decision table is a pure module** (`sync/cold-start.ts` `decideColdStart`), consumed by BOTH sync paths (agent `reconcile()`, shared-doc `migrateSeed`): local empty → materialize-hub; doc empty → seed-local-up (the DDR-064 empty-hub guard, bit-identical, now a named row); identical → noop + checkpoint; journal match → fast-forward (silent); anything else → **conflict**. 3. **Conflict protocol: dual snapshot, then newest-wins — fail-closed.** Both versions are snapshotted to `_history//` via `history.ts` `writeSnapshot` (reasons `pre-sync-local` / `pre-sync-hub`) BEFORE any write; then the newer side wins — doc-side `syncMeta.bodyEditAt` stamp vs local file mtime; unknown/tie → hub-wins (the v1.1 default, now recoverable). `/design:rollback` is the recovery UX — even a wrong pick costs one command, not a day of work. **The pre-overwrite snapshot is FAIL-CLOSED** (closes the `/flow:done` attacker finding F1): the production snapshot writer is best-effort and swallows a `writeSnapshot` error into `null`, so a hub-wins resolution checks that the LOSER's snapshot actually landed — if the local snapshot is missing (full disk, read-only `_history/`, a `Bun.write` error), the destructive overwrite is **refused**: local is kept and seeded UP instead (nothing lost on either side), the conflict carries `snapshotFailed: true`, and CLI/banner surface it as an error. Without this guard the best-effort posture (correct for status writes, copied from `status.ts`) would silently re-open the v1.1 incident on any peer with a wedged `_history/`. Conflict entries (`_sync.json`) carry `kind: 'cold-start-diverged'`, `winner`, the snapshot timestamps, and the optional `snapshotFailed` flag (the legacy `cold-start-hub-wins` kind remains in the union for old readers). 4. **`syncMeta` is a dedicated doc lane** (Y.Map, `codec.ts`): `bodyEditAt` + `by`, stamped in the SAME transaction as every local→doc body apply. It cannot piggyback on the meta codec — `last_modified` is in `META_LOCAL_KEYS` (per-machine, never syncs). Older peers never stamp → null → hub-wins fallback (interop-safe). 5. **Comments cold-start is an id-union, not a winner.** Union by stable comment `id` (doc order first, local-only appended; same-id keeps the doc's version), rebuilt wholesale through the delete-then-insert codec — so the DDR-064 duplication trap stays closed and nothing is lost. Annotations/css follow the body winner (visually coupled); meta keeps the shared-subset merge. 6. **One `HocuspocusProviderWebsocket` per hub URL** (`createDefaultProviderFactory`), every canvas's provider attached to it (`websocketProvider:` + explicit `attach()` — with an injected socket the 4.x provider does NOT auto-attach). 83 sockets → 1; the boot burst and the per-socket retry storm collapse to one reconnect loop. **Verified from the 4.1.0 source: the hub authenticates once per DOCUMENT even on a multiplexed socket** (each provider sends its own Auth message on socket open) — so multiplexing alone does NOT fix the rate bucket; the hub-side resize below is the companion fix. Per-provider `status` events keep firing (the socket fans them out to attached providers); awareness stays per-provider. 7. **Rate-limit philosophy: brute force is about INVALID attempts.** The hub's single 100/min per-label bucket throttled *valid* tokens (and only valid ones — invalid attempts were never counted). Split: valid-token auths get 600/min per label (`HUB_CONN_RATE_LIMIT` env override); invalid-token attempts get a tight 100/min **per IP**. Rate-limit rejections carry a retry hint. 8. **Rejection reasons cross the wire.** Hocuspocus propagates `error.reason ?? 'permission-denied'` — NOT `error.message` — to the peer's `onAuthenticationFailed`. The hub now throws `authError(reason)` (an Error carrying `.reason`), so peers receive `'token not authorized for this documentName'` / `'invalid token'` / `'rate limit exceeded…'` verbatim (pinned end-to-end in `apps/hub/test/auth-reasons.test.mjs`). Client-side, the runtime classifies (`rate-limit` / `not-authorized` / `invalid-token` / `generic`), aggregates the burst into ONE debounced warn with a reason-correct hint, marks slugs `auth-rejected` in the monitor, **destroys providers for permanent classes** (scope/invalid — retrying only spams the bucket) and re-probes them on a 5-minute timer with the same doc (agent wiring survives the provider swap). 9. **Honest status.** Per-doc states (`pending`/`connected`/`auth-rejected`) roll up into `_sync.json` (`docs: {synced,pending,rejected}` + `rejectedSlugs` ≤20, additive fields); `lastSyncAt` updates on real sync activity (reconcile completions), not just offline→online transitions; boot prints a short "linking…" line and the real summary AFTER handshakes settle (15 s ceiling — rejected providers never resolve `onceSynced`). ## Consequences - Booting two divergent peers in either order **never loses bytes**: the loser is one `/design:rollback` away. Clean catch-up boots stay silent (no snapshot/conflict spam) thanks to the journal gate. - The journal is a NEW per-machine state file; losing it is safe (degrades to the conservative conflict path — extra snapshots, never data loss). - Hub redeploy (`ghcr.io/1agh/maude-hub` on the next `v*` tag) is needed for the limiter + reason fixes; peers get the data-safety core (journal/conflict protocol/multiplexing) regardless. Old peers (v0.29, 83 sockets) keep working against the new hub; new peers against an old hub degrade to `generic` classification. - `history.ts` `fileSlug` now strips `.tsx`/`.jsx` like `bin/slug.sh` — previously a `.tsx` canvas's server-side snapshots landed in a `ui-foo.tsx` dir that `/design:rollback` could never find (latent bug, fixed here because the conflict snapshots depend on the dirs matching). `writeSnapshot` also bumps timestamps to avoid same-millisecond collisions (the dual pre-sync pair). - `maude design link` now warns when replacing a stored token (the same-machine overwrite that put both peers on one label was silent).