# Research Log

Everything we tried, measured, and learned while building this library.

For the current compact browser-accuracy / benchmark snapshot, see `STATUS.md`.
For the current compact corpus / sweep snapshot, see `corpora/STATUS.md`.
For the shared mismatch vocabulary, see `corpora/TAXONOMY.md`.

## Current steering summary

This log is historical. The current practical steering picture is:

- Japanese has two real canaries (`羅生門`, `蜘蛛の糸`), both clean at anchor widths and both still exposing a small positive one-line field on broader Chrome sweeps.
- Chinese has two long-form canaries (`祝福`, `故鄉`) showing the same broad Chrome-positive / Safari-clean split, with real font sensitivity between `Songti SC` and `PingFang SC`.
- Myanmar still has two real canaries with residual Chrome/Safari disagreement around quote/follower-style classes, so it remains the main unresolved Southeast Asian frontier.
- Urdu has a real Nastaliq/Naskh canary (`چغد`) with the same narrow-width negative field in Chrome and Safari, so it is clearly a shaping/context class rather than dirty data or a browser-only quirk. It remains parked rather than actively tuned.
- Arabic coarse corpora are clean; the remaining work there is mostly a fine-width edge-fit class, not the old preprocessing/corpus-hygiene problems.
- Mixed app text still matters because it catches product-shaped classes that books miss, especially soft-hyphen and extractor-sensitive cases.

## The problem: DOM measurement interleaving

When UI components independently measure text heights with DOM reads like `getBoundingClientRect()`, each read can force synchronous layout. If those reads interleave with writes, the browser can end up relaying out the whole document repeatedly.

The goal here was always the same:
- do the expensive text work once in `prepare()`
- keep `layout()` arithmetic-only
- make resize-driven relayout cheap and coordination-free

## Approach 1: Canvas measureText + word-width caching

Canvas `measureText()` avoids DOM layout. It goes straight to the browser's font engine.

That led to the basic two-phase model:
- `prepare(text, font)` — segment text, measure segments, cache widths
- `layout(prepared, maxWidth, lineHeight)` — walk cached widths with pure arithmetic

That architecture held up. The broad browser sweeps are now clean in Chrome, Safari, and Firefox, and the hot `layout()` path is still the core product win.

## Rejected: DOM-based or string-reconstruction measurement in the hot path

Several alternatives were tried and rejected:

- measuring full candidate lines as strings during `layout()`
- moving measurement into hidden DOM elements during `prepare()`
- using SVG `getComputedTextLength()`

The pattern was consistent:
- they either reintroduced DOM reads
- or they were slower than the current two-phase model
- or they looked cleaner locally but regressed the actual benchmark path

The important keep was architectural, not algorithmic:
- `layout()` stayed arithmetic-only on cached widths

## Discovery: system-ui font resolution mismatch

Canvas and DOM resolve `system-ui` to different font variants on macOS at certain sizes:

Machine-readable scan:
- [research-data/system-ui-size-scan.json](research-data/system-ui-size-scan.json)

In the recorded scan, mismatches clustered at `10-12px`, `14px`, and `26px`.
`13px`, `15-25px`, and `27-28px` were exact.

macOS uses SF Pro Text at smaller sizes and SF Pro Display at larger sizes. Canvas and DOM switch between them at different thresholds.

Practical conclusion:
- use a named font if accuracy matters
- keep `system-ui` documented as unsafe
- if we ever support it properly, the believable path is a narrow prepare-time DOM fallback for detected bad tuples

What did **not** look trustworthy enough:
- lookup tables
- naive scaling
- guessed resolved-font substitution

## Discovery: word-by-word sum accuracy

Canvas is internally consistent enough that summing measured segments works very well, but not perfectly. Over a full paragraph, tiny adjacency differences can accumulate into a line-edge error.

The keeps were small and semantic:
- merge punctuation into the preceding word before measuring
- let trailing collapsible spaces hang instead of forcing a break

What did **not** survive:
- full-string verification in `layout()`
- uniform rescaling
- generic pair-level correction models

The broad lesson was that local semantic preprocessing paid off more than clever runtime correction.

## Discovery: text-shaper is a useful reference, not a runtime replacement

`text-shaper` was useful reference material, especially for Unicode coverage and bidi ideas, but not a replacement for the current browser-facing model.

What was worth taking:
- broader Unicode coverage, e.g. missing CJK extension blocks

What was not worth taking:
- its segmentation as a runtime replacement for `Intl.Segmenter`
- its paragraph breaker as a substitute for browser-parity layout

Bottom line:
- good reference material
- wrong runtime center of gravity for this repo

## Discovery: preserving ordinary spaces, hard breaks, and numeric tab stops is viable

The smallest honest second whitespace mode turned out to be:
- preserve ordinary spaces
- preserve `\n` hard breaks
- preserve tabs with default browser-style tab stops
- leave the other wrapping defaults alone

That became:
- `{ whiteSpace: 'pre-wrap' }`

What mattered:
- preserved spaces still hang at line end
- consecutive hard breaks keep empty lines
- a trailing final hard break does **not** invent an extra empty line
- tabs advance to the next default browser tab stop from the current line start

The mode now covers the textarea-like cases we cared about, and the broad browser sweeps plus the dedicated `pre-wrap` oracle are green.

One important tooling lesson also came out of this:
- keep a small permanent oracle suite
- justify it once with a broader brute-force validation pass
- do not keep the brute-force pass forever once it has done its job

## Discovery: emoji canvas/DOM width discrepancy

Chrome and Firefox on macOS can measure emoji wider in canvas than in DOM at small sizes. Safari does not share the same discrepancy.

What held up:
- detect the discrepancy by comparing canvas emoji width against actual DOM emoji width per font
- cache that correction
- keep it outside the hot layout path

This is now one of the small browser-profile shims that is actually justified.

## Retired HarfBuzz probe path

We briefly kept a headless HarfBuzz backend in the repo for server-side measurement probes.

What it taught us:
- it was useful for research and algorithm probes
- it was not close enough to our active browser-grounded path to justify keeping it in the main repo
- isolated Arabic words in that probe path needed explicit LTR direction to avoid misleading widths

So if HarfBuzz comes up again later, treat it as explored territory:
- useful as a research reference
- not the runtime direction for Pretext
- not a substitute for browser-oracle or browser-canvas validation

## Final browser sweep closure

The last browser mismatches were not fixed by moving more work into `layout()`. That regressed the hot path and was reverted.

What actually held up:
- better preprocessing in `prepare()`
- better browser diagnostics pages and scripts
- a tiny browser-specific line-fit tolerance

What did **not** change:
- `layout()` stayed arithmetic-only

That remains the right center of gravity for the project.

## Arabic frontier

Arabic took several passes, but the pattern is clearer now.

What survived:
- merge no-space Arabic punctuation clusters during `prepare()`
  - e.g. `فيقول:وعليك`, `همزةٌ،ما`
- treat Arabic punctuation-plus-mark clusters like `،ٍ` as left-sticky too
- split `" " + combining marks` into plain space plus marks attached to the following word
- use normalized slices and the exact corpus font during probe work
- trust the better RTL diagnostics path instead of reconstructing offsets from rendered line text
- clean obvious corpus/source artifacts instead of inventing new engine rules for them
- allow a tiny non-Safari line-fit tolerance bump for the remaining positive fine-width field

What did **not** survive:
- pair correction models at segment boundaries
- larger Arabic run-slice width models
- broad phrase-level heuristics derived from one good-looking probe

Those failed for the same reason in different sizes:
- pair corrections were too local to move the real misses
- run-slice widths were much heavier and still did not move the hard widths enough
- both made `prepare()` or `layout()` materially worse without buying a clean Arabic field

So the useful guardrail is:
- if an Arabic idea starts by adding more shaping-aware width caches inside the current segment-sum architecture, be skeptical early
- the Arabic keeps so far have been preprocessing, corpus cleanup, diagnostics, and tiny tolerance shims, not richer width-cache models

Current read:
- Arabic coarse corpora are healthy
- the remaining work is much narrower now
- the unresolved class looks like a mix of fine-width edge-fit and shaping/context, not another obvious preprocessing hole

## Long-form corpus canaries

Once the main browser sweep became a regression gate, the long-form corpora became the real steering canaries.

### Mixed app text

This is the most product-shaped canary.

What it has been good for:
- URL/query-string handling
- escaped quote clusters
- numeric expressions like `२४×७`
- time ranges like `7:00-9:00`
- emoji ZWJ runs
- manual soft hyphens

Important keep:
- model URL/query strings as narrow structured units, not one giant breakable blob

Current status:
- almost entirely clean
- one remaining extractor-sensitive soft-hyphen miss around `710px` still looks paragraph-scale or accumulation-sensitive rather than like a neat local bug

### Thai

Thai exposed a product-shaped ASCII quote issue more than a dictionary-segmentation failure.

The keep:
- contextual ASCII quote glue during preprocessing

Result:
- two Thai prose corpora are healthy at anchor widths
- maintained step10 sweeps stayed clean enough that Thai now looks broader than one lucky story

### Khmer

Khmer broadened the Southeast Asian class without immediately demanding new engine work.

The keep:
- preserve explicit zero-width separators from the source text

Result:
- anchor widths and the maintained step10 sweep were clean enough to keep Khmer as a real canary

### Lao (rejected)

The Lao corpus attempt was a source problem, not an engine problem.

The raw text was wrapped print/legal text, which made it a dirty `white-space: normal` canary. We rejected it instead of normalizing nonsense into the repo.

### Myanmar

Myanmar is still the main unresolved Southeast Asian frontier.

What survived:
- treat `၊` / `။` / `၍` / `၌` / `၏` as left-sticky during preprocessing
- treat `၏` as medial glue in clusters like `ကျွန်ုပ်၏လက်မ`

What did **not** survive:
- broad Myanmar grapheme breaking in ordinary wrapping
- quote-follower glue like closing-quote + `ဟု`

Current read:
- there are real recurring classes here
- but the obvious tempting heuristics improved one browser and hurt another
- that makes Myanmar a canary, not a license for more instinctive glue rules

### Japanese

Japanese gave us one real semantic keep:
- kana iteration marks like `ゝ` / `ゞ` / `ヽ` / `ヾ` should be treated as CJK line-start-prohibited

What remains:
- a small context-width class around punctuation/quote compression
- good evidence for the exactness ceiling of a width-independent grapheme-sum model in proportional Japanese fonts

So Japanese stays as a canary, not as a place to keep stacking narrow punctuation rules.

### Chinese

Chinese is now the clearest active CJK canary.

What we learned:
- Safari is clean on the maintained step10 sweep
- Chrome keeps a broader narrow-width positive field
- the field changes with font choice (`Songti SC` vs `PingFang SC`)

What did **not** survive:
- carrying closing punctuation forward
- coalescing repeated punctuation runs like `——` or `……`

Current read:
- the remaining Chinese field is real
- it is not another obvious punctuation bug
- it is best treated as a canary for the model’s current exactness ceiling

### Sampled cross-font corpus matrix

The first cross-font pass was reassuring:
- Korean, Thai, Khmer, Hindi, Arabic, and Hebrew all stayed exact across the sampled Chrome matrix on this machine

That does **not** mean font fragility is gone. It just means the next likely surprises are:
- new scripts
- finer width sweeps
- or product-shaped mixed text

## Segment metrics cache

The cache used to store just widths. It now stores richer per-segment metrics and computes the more expensive derived facts lazily.

Current useful cached facts include:
- width
- `containsCJK`
- lazily computed emoji count
- lazily computed grapheme widths

That improved repeated `prepare()` work without moving any live measurement back into `layout()`.

## Soft hyphen support

Soft hyphen became a real internal break kind instead of ordinary text.

What that bought us:
- unbroken lines keep it invisible
- broken lines can expose a visible trailing `-`
- rich APIs stay aligned with the actual break choice

This was a genuine model improvement, not just a cosmetic API change.

## What Sebastian already knew

Sebastian’s original prototype already had the right overall instinct:
- words/runs as the unit of caching
- browser-grounded measurement
- streamed greedy line breaking

What changed here was mostly engineering discipline:
- caching
- a clean `prepare()` / `layout()` split
- preprocessing
- browser diagnostics
- and a willingness to keep the hot path simple