# Peritext in Sal vs. the paper

Cross-reference between the Lean formalization in
`Sal/{CRDTs,MRDTs}/Peritext/` and the CSCW 2022 paper:

> Geoffrey Litt, Sarah Lim, Martin Kleppmann, and Peter van Hardenberg.
> *Peritext: A CRDT for Collaborative Rich Text Editing*. Proc. ACM
> Hum.-Comput. Interact. 6, CSCW2, Article 531 (November 2022),
> 36 pages. <https://doi.org/10.1145/3555644>

This document is a living map: where the Lean matches the paper,
where it deliberately departs, and where follow-up work is flagged.

## At a glance — convergence, causality, intent preservation

| Paper claim | Sal artifact | Status / notes |
|---|---|---|
| **§A.1 Convergence** | 24 RA-linearizability VCs in `Peritext_CRDT.lean` and `Peritext_MRDT.lean` (`merge_comm`, `merge_idem`, `base_2op`, `ind_lca_2op`, `inter_*`, `ind_*`, `lem_0op`) | ✅ All closed via `by sal`, kernel-checked. Universally quantified over op interleavings. |
| **§3.1 Causality preservation** | Framework assumption (causal delivery); `afters` map structurally encodes insert causality; `causal_order_visible_lt` (RGA readside) lifts it to visible-order | ✅ Same convention as paper §3.1. |
| **§A.2 Intent preservation — convergence at the read** | `readRichText_visible_convergent` | ✅ Read-side analogue of merge convergence. |
| §A.2 Ex 1 — insertion within a span | `insert_within_span_in_span_visible` + propagation lemmas | ✅ |
| §A.2 Ex 2 — overlapping same-type Adds | `partial_overlap_all_adds_formatted_visible` | ✅ |
| §A.2 Ex 3 — different mark types coexist | `different_type_adds_coexist_visible` | ✅ |
| §A.2 Ex 4 — same-type, different values (colors) | not formalised | ⚠️ Requires `markValue` field on `MarkOp` — state-shape change. |
| §A.2 Ex 5 — bold vs non-bold conflict | `add_wins_over_concurrent_remove_visible` + `no_add_cover_implies_unformatted_visible` | ✅ LWW by opId (paper §4.4). |
| §A.2 Ex 6 — overlapping comments via distinct markType | follows from `different_type_adds_coexist_visible` | ✅ |
| §A.2 Ex 7 — bold-boundary insertion expands | `ex7_bold_older_sibling_in_span` + `bold_expand_in_span_visible` (via `bold_expand_reach`) | ✅ |
| §A.2 Ex 8 — link-boundary insertion does not expand | `ex8_link_descendant_visible_lt_endId` + `_not_in_span_visible_of_wf` | ✅ Uses `wf_afters` acyclicity. |
| (Beyond paper) Anchors survive tombstones | `anchors_survive_tombstones_visible` | ✅ Implicit in §4.4 paper discussion. |

## Formalization target

The paper describes an **op-based** CRDT; our Lean is a **state-based**
reformulation. The two are equivalent in expressive power (paper §2.3),
but the correctness criteria land in different places:

| Paper | Our Lean | Notes |
|---|---|---|
| Convergence (Thm A.1, §A.3) | 24 RA-linearizability VCs (per variant) | Pointwise equality on each state component. CRDT has four components (`chars : map`, `afters : map`, `deleted : map`, `marks : set`); MRDT has three (`chars : set CharRec`, `removed : set OpId`, `marks : set`). Per-replica snapshot rather than op-log consistency. Both variants are fully sorry-free. |
| Causality preservation (§A.1) | Framework-level assumption | Sal's state-based model assumes causal delivery; the VCs verify the local reconciliation rule. |
| Intention preservation (§A.2, 8 examples) | Characterization theorems in `Peritext_ReadSide.lean` (mirrored on both CRDT and MRDT sides) | See the table below. |

## Intent-preservation: paper examples ↔ Lean theorems

| Paper | Section | Lean theorem | Status |
|---|---|---|---|
| Ex 1 — insertion within a span | §3.1, §A.2 | `insert_within_span_in_span_visible` (both sides) + `in_span_visible_propagate` / `_of_reach` | ✅ Paper-faithful via visible-order; caller provides the right-side bound from RGA geometry in their specific scenario. |
| Ex 2 — overlapping same-type Adds | §3.2, §A.2 | `partial_overlap_all_adds_formatted_visible` | ✅ |
| Ex 3 — different mark types coexist | §3.2, §A.2 | `different_type_adds_coexist_visible` | ✅ |
| Ex 4 — overlapping same-type different-values (colors) | §3.2.1, §A.2 | **Not expressible in our model** | Our `MarkOp` has `markType : ℕ` but no per-mark `value` field. Ex 4's resolution (LWW among distinct color values of the same markType) requires representing both "red" and "blue" as instances of the same markType, which we can't. Encoding each color as a distinct markType would make `different_type_adds_coexist_visible` apply — but that's the opposite of Ex 4's intent. |
| Ex 5 — conflicting bold vs non-bold | §3.2.1, §A.2 | `add_wins_over_concurrent_remove_visible` (positive) + `no_add_cover_implies_unformatted_visible` (negative) | ✅ LWW by opId. |
| Ex 6 — overlapping comments via distinct markType | §3.2.2, §A.2 | Follows from `different_type_adds_coexist_visible` | ✅ (Comments encode each instance with a unique `markType`; the theorem then applies.) |
| Ex 7 — bold-boundary insertion expands | §3.3, §A.2 | `ex7_bold_older_sibling_in_span` (cross-sibling case) + `bold_expand_in_span_visible` (post-endId region) | ✅ Two complementary results: the cross-sibling case (insert lands before `endId` in visible order) uses the standard visible-order bound; the post-`endId` bold-expand region uses `bold_expand_reach`, which identifies afters-descendants of `endId` reachable via chains of post-mark-opId characters (paper's §3.3 "grab new inserts at the after-side" encoded via opId comparison). |
| Ex 8 — link/comment-boundary insertion doesn't expand | §3.3, §A.2 | `ex8_link_descendant_visible_lt_endId` (positive) + `ex8_link_descendant_not_in_span_visible` / `_of_wf` (full negation, link case) | ✅ Under link semantics (`endSide = false`), afters-descendants of `endId` are correctly excluded. Full negation uses the `wf_afters` acyclicity invariant. |
| Table 1 — "Can marks overlap?" | §3.4 | Emergent from `formatted_visible` being per-`markType` | ✅ |
| Table 1 — "Do marks expand?" | §3.4 | Captured by Ex 7 / Ex 8 visible-order demos above | ✅ |

Plus one additional theorem not tied to a specific paper example:

- `anchors_survive_tombstones_visible` — tombstoning any character
  leaves the formatting of the other visible characters unchanged.
  Implicit in the paper's §4.4 discussion of tombstoned anchors; we
  state it as a parameterized theorem over all states.

## Deliberate departures

**List vs. per-char `readRichText_visible`.** The paper (§4.4) presents the
rendered document as a list of `{text, format}` spans. Our
`readRichText_visible` is instead a per-character function whose
exact signature depends on the variant:

- **CRDT:** `OpId → Option (ℕ × (ℕ → Bool))` — `(codepoint,
  markType → is-formatted)` per visible char.
- **MRDT:** `OpId → Option (ℕ → Bool)` — just the formatting
  function; the codepoint lookup is skipped because in the MRDT's
  set-of-triples `chars` representation, payload lookup at an `OpId`
  requires `Classical.choose` on an existential.

The list form is available as `readRichText_list` (both sides), a
Prop-valued spec via `is_rga_traversal`: callers construct the
traversal externally and prove it satisfies the spec. See
*Still pending* below for the existence-theorem follow-up.

## Infrastructure summary

The read-side is built on the RGA visible-order relation `visible_lt`
(four-rule inductive: parent-child, sibling, left-descendant-of-sibling,
transitive closure) and its reflexive closure `visible_le`. On top of
those:

- `in_span_visible` — paper-faithful covering predicate with side-bit
  refinements on both boundaries.
- `mark_wins_visible` / `formatted_visible` / `readRichText_visible`
  — read-side projection built from `in_span_visible` plus
  `mark_beats` (LWW by opId, paper §4.4).
- `is_rga_traversal` + `readRichText_list` — list-form presentation
  as a Prop-valued spec; callers construct a traversal externally
  and prove it satisfies the spec (the framework's `set α := α → Bool`
  doesn't natively enumerate; see `docs/list-form-readrichtext-design.md`).
- Congruence lemmas (`_eq_of_afters_eq` family on CRDT,
  `_eq_of_chars_eq` on MRDT) lift pointwise state equality through
  the inductive relations.
- Insert-monotonicity chain (`after_of_preserved_under_insert`,
  `afters_reach_preserved_under_insert`, `visible_lt_preserved_under_insert`,
  `visible_le_preserved_under_insert`) powers the Ex 1 insert-within-span
  closure.
- `wf_afters` — state-level acyclicity invariant on `afters`, with
  `visible_lt_asymm_of_wf` and `visible_le_antisymm_of_wf` as
  corollaries.
- `bold_expand_reach` — paper §3.3 bold-expand region past `endId`,
  characterized by afters-chain reachability through post-mark-opId
  characters only. Embedded as the second disjunct of `in_span_visible`'s
  `endSide = true` right bound.

## Still pending

- **Existence of `is_rga_traversal`** for every state — requires
  either a finite-carrier extension to `Set_Extended` (framework-wide
  refactor) or per-theorem `Finset` hypotheses threaded through the
  read-side. The current spec is usable without existence.
- **Preservation of `wf_afters` under `do_` / `merge`.** Holds
  structurally for states reached from `init_st` but needs an
  inductive trace proof.

### How Ex 7 / bold-expand is captured: `bold_expand_reach`

Paper §3.3's bold-expand rule says: under `endSide = after`, the end
anchor grabs inserts that arrive at or past `endId`. Op-based Peritext
recovers "arrives after the mark" from opId comparisons — every op
carries a `(replica, counter)` opId, and comparing `c.opId` against
`mark.opId` tells you whether `c` was inserted after the mark was
created. This information is state-recoverable (opIds are stored in
state), so bold-expand can be formalized in a state-based model.

We capture this via the `bold_expand_reach` inductive predicate:

```lean
inductive bold_expand_reach (s : concrete_st) (m : MarkOp) : OpId → Prop
  | at_endId : bold_expand_reach s m (mark_endId m)
  | step {c c_parent} :
      after_of s c c_parent = true →
      bold_expand_reach s m c_parent →
      opid_max (mark_opId m) c = c →    -- c.opId > mark.opId
      bold_expand_reach s m c
```

A character `c` is in the bold-expand region iff there's an
afters-chain from `c` back to `endId` where every step's character
post-dates the mark (`c.opId > mark.opId`). Pre-mark siblings break
the chain, so older characters at the boundary are correctly
excluded. New post-mark inserts are correctly included.

`in_span_visible`'s right bound under `endSide = true` is the
disjunction `visible_le c endId ∨ bold_expand_reach s m c` — the
standard link-contract bound OR the bold-expand region.

**Caveat on truly-concurrent inserts.** Two ops that are HB-incomparable
to the mark may have `opId > mark.opId` or `opId < mark.opId` depending
on replica-counter interleaving. Our rule commits to the total order
given by `opid_max`, which matches the RGA sibling tiebreak and is
therefore consistent with how visible order is computed. Replicas
agree on both, so convergence is preserved.

## Paper implementation details not modelled

- **Tombstone-scanning on insert (§4.2.2).** The paper's `Insert`
  scans the marks set when placing a character after tombstones
  that are span endpoints (so "frolicked" lands outside the link
  in Fig 6). Our `do_` has no mark awareness. Adding it breaks the
  four-component independence of our state and would require
  re-verifying the 24 VCs. See `docs/tombstone-scanning-design.md`.
- **Op-set compression at anchor boundaries (§4.3, Algorithm 1).**
  We store a flat `set AnchorAttachment` rather than per-position
  op-sets. Observationally equivalent for convergence; an
  implementation that refactors to op-set form would need to
  re-prove the VCs.
- **Incremental patch emission (§4.5).** UI-layer concern; not part
  of the state-based spec.

## Out of scope

These are acknowledged as future work in the paper itself (§5):

- **Block-level structure** (paragraphs, headings, lists, nested
  blocks).
- **Embedded non-text objects** (images, horizontal rules). Our
  `Insert` carries `ch : ℕ` — a single codepoint — with no block
  or media support.

## Related files

- `Sal/CRDTs/Peritext/Peritext_CRDT.lean` — spec + 24 VCs.
- `Sal/CRDTs/Peritext/Peritext_ReadSide.lean` — read-side projection
  + paper-semantic theorems.
- `Sal/MRDTs/Peritext/{Peritext_MRDT,Peritext_ReadSide}.lean` — MRDT
  counterparts. The MRDT's `eq` is pointwise `==` on each set
  component, which lifts via `funext` + `Prod.ext` to full
  functional equality on the state — making
  `readRichText_visible_convergent` a tidy chain of component-wise
  `funext`s then `rw`, rather than the CRDT's simp-heavy rewrite
  through the `mysel_c` / `contains` / ... abstractions.
- `docs/porting-op-based-crdts.md` — general recipe; Peritext is
  the worked example throughout.