# fable mode (`xonovex-skill-fable`) A user-invocable (`/fable`) behavioral overlay that nudges Opus 4.8 toward the observable working/writing voice of the Claude Fable 5 model. It restyles **how** the assistant communicates and paces work — it does **not** change correctness, safety, project-instruction adherence, or the underlying model's reasoning. **Honest scope:** a prompt can imitate Fable's surface voice and some working habits. It cannot reproduce Fable's reasoning, judgment, knowledge, or the model itself. Everything below was kept only where it _measurably_ moved Opus toward real Fable output; everything that didn't was dropped. ## Evidence base Derived by comparing a corpus of real Claude Fable 5 output against Claude Opus 4.8 on equivalent tasks, then characterizing the differences. Caveat carried through all conclusions: that corpus is **fixed**, **narrow** (a single technical domain, captured over a short window), and cannot grow. Coding style within it was governed by an opinionated C coding-style guide, not by the model — so code output is not a place the overlay can or should differ. Measured Fable fingerprint (vs Opus 4.8 on the same material): - **Narration** — terser per turn (median roughly ¾ the length); ~half the section headers; near-zero emoji; near-zero hedging; somewhat higher parallel tool-batching on mixed work. - **Documents** — plan/doc artifacts run as dense prose bullets: few headers, few tables, heavy inline code-span citations, occasional em-dash asides. - **Prep** — more investigation before the first code edit; later shown to be largely task-scope-driven, not a model trait. - **Framing** — denser "honest / caveat / failure-mode" language per turn, though no more _likely_ to caveat in a given turn. ## Method Blind A/B tournaments. The model is held **constant** (all arms are Opus 4.8 subagents), so the only variable is the prompt overlay. Each candidate "dimension" is a rule layered on the validated core; a baseline arm gets the core alone. Scoring, per output: 1. **Blind voice-judge** — given real Fable excerpts ("gold") and an anonymized candidate, rates same-author voice similarity 0–10. Judges score voice only, never topic/correctness, and never know which arm is which. 2. **Deterministic markers** — length, headers, hedging, em-dash, bullets, code-span density, verdict-first opener, and parallel-batch rate. **Keep rule (anti-overfit):** a dimension is added only if it beats baseline by a margin on the blind judge (judged against _real_ Fable text, not the surface rules) **and** passes a human gut-check. Margin + blind-vs-real-text guards against Goodharting the metric. 2 reps per arm to damp single-run noise. ## Results by round Scores are blind voice-judge same-author similarity (0–10). ### Round 1 — controlled coding A/B The C coding-style guide active in **both** arms; task = add a small C helper function + tests. - **Code** — the implementation file was byte-**identical** across arms; header and test near-identical. Coding style is the guide, not the overlay. _(inert)_ - **Report** — baseline opener _"Done. I added the helper…"_ (3 headers, ~1540 chars); fable opener _"Pass — tests green, 1/1…"_ (0 headers, ~1120 chars, −27%, verdict-first). _(voice win)_ ### Round 2 — prep front-loading Investigation calls before the first code edit: control 4 vs treatment 4. Task too small to exercise it; the signal is largely task-scope-confounded. **Inconclusive, dropped.** ### Narration tournament Task = read-only investigation report. | Dimension | Similarity | Same-author | vs baseline | | | ----------------------------- | ---------- | ----------- | ----------- | -------- | | D0 baseline (core) | 3.5 | 0/2 | — | | | D1 result+next in one breath | 6.0 | 2/2 | +2.5 | **KEPT** | | D2 exact-evidence verdicts | 4.5 | 1/2 | +1.0 | drop | | D4 pick-one decisive scope | 4.5 | 1/2 | +1.0 | drop | | D5 name+dismiss tooling noise | 4.5 | 1/2 | +1.0 | drop | | D3 name risk before verifying | 3.5 | 0/2 | +0.0 | drop | Baseline scored low here only because final **reports** were judged against Fable's clipped **in-flight** narration (register mismatch — a ceiling no prompt closes). Only D1 cleared the margin _and_ was same-author 2/2. ### Document / plan tournament Task = write an implementation plan. | Dimension | Similarity | Same-author | | ---------------------------------- | ---------- | ----------- | | DD0 baseline (core) | 9.0 | 2/2 | | DD1 dense what+files+rationale | 9.0 | 2/2 | | DD3 inline evidence asides | 9.0 | 2/2 | | DD5 dense bullets over scaffolding | 9.0 | 2/2 | | DD2 decision-first framing | 8.5 | 2/2 | | DD4 citation density | 8.5 | 2/2 | **Recommend none.** The core already produces the doc voice at 9/10 same-author; candidates were redundant or mildly regressive. (Markers showed generated plans run denser/longer than real Fable artifacts — a register/economy gap, not a missing trait; not actioned, would risk overfit.) ### Round 3 — judgment dimensions **Text** (judged vs gold): baseline 9.0; D1 honest-framing 9.0 (already in core, no lift, drop); D4 contract-referencing 8.5 (mildly regressive, drop). **Fork** (intentionally under-specified design task; judged "surfaced the fork?"): | Arm | Surfacing | Surfaced | vs baseline | | | ------------------ | --------- | -------- | ----------- | --------------------------- | | baseline | 4.5 | 1/2 | — | "a well-argued silent pick" | | D3 decision-gating | 7.5 | 2/2 | +3.0 | **KEPT** | **Batching** (transcript metric, request-grouped): control (no instruction) 0.71 batch rate; treatment (instruction) 1.0; Fable reference ~0.30. Opus already batches ≥ Fable on controlled work — not a gap. _(drop)_ ## What the skill encodes All validated against real Fable text: - Verdict-first openers; terse 1–3 sentence turns; no preamble/recap - No hedging; explicit noise-discounting; confident on verified facts - Exact naming (files/symbols/error-kinds in backticks) - Minimal in-flow formatting (prose, not headers/tables/emoji; structure reserved for genuine deliverables) - Parallel tool batching - One-breath result+next cadence — _narration tournament: +2.5_ - Decision-gating: surface genuine design forks for the user (options + recommendation) while staying decisive on the unambiguous — _Round 3: +3.0_ ### Tested and dropped Not guessed — measured inert, regressive, or not-a-gap: - prep front-loading (inconclusive / task-scope-confounded) - anticipatory risk-naming (+0.0) - exact-evidence / decisive-scope / noise-discount narration emphases (within n=2 noise) - honest-framing emphasis (already in the core) - contract-referencing emphasis (mildly regressive, −0.5) - parallel-batching dimension (Opus already batches ≥ Fable) ## Honest limitations - **Model identity is not promptable.** Reasoning quality, judgment, knowledge, and the model's own texture come from the weights; this skill cannot reach them. It restyles the seams. - The residual gap the judges kept citing is **register/length** (clipped in-flight narration vs full deliverables), which is task-bound, not a voice rule. - The reference corpus is fixed, short, and from a single technical domain. Conclusions may not generalize, and it cannot be grown. - Scores are proxies; the blind-judge-vs-real-text + margin design mitigates but does not eliminate overfitting risk. ## Use `/plugin` → install + enable `xonovex-skill-fable`, then `/fable` to toggle on.