# slop-cop — Calibration

slop-cop scores prose on **two parallel axes**:

- **AI-Slop** — does this read like AI wrote it? (texture, rhythm, vocabulary tells, formatting)
- **Comprehension** — can a fresh reader follow this? (acronyms, named-entity bombing, telegraphic compression, missing thesis, structure, readability)

Each axis has its own catalog, its own density formula, and its own verdict. A piece can fail one and pass the other:

- Dense academic prose → may PASS AI-Slop (no `delve`, no em-dash clusters) but CRITICAL Comprehension (jargon-bombed, no thesis)
- Sycophantic ChatGPT marketing → CRITICAL AI-Slop, MEDIUM Comprehension
- Hand-written cover letter → PASS both
- Twitter-thread summary written by a human in a hurry → LOW AI-Slop, HIGH Comprehension (telegraphic, named-entity bombing)

The audit reports both verdicts. The combined recommendation is driven by whichever is worse.

---

## Table of contents

### AI-Slop axis (sections 1–8)
1. [AI-Slop density scoring](#1-density-based-scoring)
2. [Severity tiers (AI-Slop)](#2-severity-tiers-explicit)
3. [Genre adjustments](#3-genre-adjustments)
4. [Model fingerprints](#4-model-fingerprints)
5. [Contested tells](#5-contested-tells)
6. [The sanding-off problem](#6-the-sanding-off-problem)
7. [The uncanny-valley rule](#7-the-uncanny-valley-rule)
8. [Burstiness approximation](#8-burstiness-approximation)

### Comprehension axis (sections 9–11)
9. [Comprehension density scoring](#9-comprehension-density-scoring)
10. [Audience calibration](#10-audience-calibration)
11. [Cross-axis recommendations](#11-cross-axis-recommendations)

---

## 1. Density-based scoring

A single tell is not a signal. Real writers use individual tells all the time. The signal is **how many show up per 500 words**, weighted by severity.

### The formula

For a draft of N words, normalize to 500-word units (`U = N / 500`). Count violations by severity:

- `H` = high-severity tells (always-cut items)
- `M` = medium-severity tells
- `L` = low-severity tells (informational only)

Compute the **density score**:

```
density = (H × 3) + (M × 1) + (L × 0.25)    per 500 words
        = ((H × 3) + (M × 1) + (L × 0.25)) / U   for the full draft
```

### Verdict thresholds

| Density score | Verdict | Action |
|---|---|---|
| 0–2 | PASS | Polish-pass at most |
| 2–5 | LOW | Spot-fix the listed items |
| 5–10 | MEDIUM | Spot-fix sufficient; significant cleanup needed |
| 10–18 | HIGH | Substantial revision required |
| 18+ | CRITICAL | Recommend rewrite from scratch |

### Compound triggers (escalate one tier)

- A high-severity rhetorical pattern + 5+ vocabulary hits + 1+ formatting tell within the same 500 words → escalate one tier
- Three or more H-severity tells in a single paragraph → escalate one tier
- The "uncanny valley" condition (see §7) → escalate one tier even when no individual tell is high-severity

### What density does and doesn't tell you

It tells you whether the prose reads as AI-shaped. It does not tell you whether AI wrote it — humans imitate AI, and AI imitates humans. Treat the verdict as "this prose has the shape of AI writing," not "this prose was generated by AI."

---

## 2. Severity tiers explicit

The patterns, vocabulary, and formatting-tells files all tag every item H/M/L. The definitions:

### High (H) — always cut

The phrase or pattern is essentially never the right choice. Even one instance in casual prose lowers the verdict tier. Examples:

- Em dashes in clusters (3+ per 500 words)
- Bold-first bullets in any short prose piece
- Sycophancy openers/closers ("Great question!", "I hope this helps!")
- "Delve" / "tapestry" / "showcasing"
- Grandiose framing ("stands as a testament to")
- Copula avoidance ("serves as", "boasts")
- Knowledge-cutoff disclaimer leakage ("As of my last update...")
- Vague-authority weasels with no citation

Cut without exception unless the phrase is being used in scare quotes or ironically.

### Medium (M) — usually cut

The phrase or pattern survives in narrow contexts. Default is to cut; keep only if the word/structure is doing specific work that nothing else can. Examples:

- "While X, Y" sentence opener (one is fine; three is a pattern)
- "actually" (survives only contrasting concrete reality with theory)
- Hedged superlatives (sometimes warranted in genuinely uncertain claims)
- Symmetrical sentence pairs (one is rhetoric; three is a tic)
- Two-word punchlines (once per piece is forgivable)

In an audit, M-severity items are listed with the question: is this doing specific work? If not, cut.

### Low (L) — context-dependent

Weak tell on its own. Note in the audit report but don't down-score the verdict. Examples:

- Em dashes alone (1–2 in a long piece, post-GPT-5.1)
- Absent contractions (formal register may justify it)
- Universal Oxford comma + American spelling
- "Actually" used to contrast theory and reality

L items inform the diagnosis ("this prose has these formal-register tells") but do not tip the verdict.

---

## 3. Genre adjustments

The same tell carries different weight depending on genre. The audit infers genre from the draft (or accepts a `--genre` flag for the scanner) and adjusts thresholds.

### Casual / first-person / blog / Reddit / email

Default thresholds. Every tell weighted at full. This is the strictest mode and the most common case — users invoking the skill on their own writing usually want this.

### Marketing / sales / landing copy

Marketing copy legitimately uses some intensifiers ("transformative", "groundbreaking") and some structure (TL;DRs, bulleted benefits). Adjust:

- Reduce buzzword-density penalty by 30%
- "Comprehensive", "robust", "seamless" allowed at 1 instance per 500 words before flagging
- Sycophancy still always-cut (no genre justifies "I hope this helps!")
- Performative openers still always-cut

### Academic / research / formal

Academic prose legitimately hedges ("studies show" with citations is fine), uses some passive constructions, and follows section conventions ("Methods", "Results"). Adjust:

- Vague-authority phrases: only flag when uncited. "Studies show [Smith 2024]" is not a tell.
- Hedge stacking: only flag when the hedging exceeds the genuine uncertainty (research catalog notes that calibrated hedging is fine; saturation hedging is the tell)
- "Comprehensive review", "novel approach" allowed in title position
- "Challenges and Future Directions" section is normal here — only flag in non-academic prose

### Encyclopedic / reference / Wikipedia-style

LLMs were trained heavily on Wikipedia. Encyclopedic prose triggers false positives across all detectors (GPTZero documents this). Adjust:

- Reduce all severity tiers by one for the duration of encyclopedic passages
- Copula avoidance ("serves as") still flagged — Wikipedia editors actually use "is" most of the time
- Synonym cycling more tolerated
- Burstiness threshold relaxed (encyclopedic prose runs uniform)

### Fiction / dialogue / character voice

Voice-aware judgment. The character's voice may legitimately use any of these patterns. Apply the rules to *narration* but not *dialogue or interior monologue*. The "tells" rule is a bias toward the AI model's house voice; a strong character voice can override it.

If unsure of genre, default to **casual** (strictest). Users can override.

---

## 4. Model fingerprints

When density indicates AI shape, identify the likely model. This serves diagnosis ("this looks like Gemini, not Claude — adjust your prompts") and prompt engineering.

### GPT-4 / 4o / 5

- **Verb signature:** delve, underscore, navigate, leverage, harness, showcase
- **Adjective signature:** noteworthy, commendable, intricate, meticulous, comprehensive
- **Power words:** supercharge, unleash, dive in, game-changing
- **Trigrams:** "individuals with diabetes", "characterized by elevated", "ranging from", "play a significant role"
- **Format signature:** heavy bullets and headers, em dashes (pre-5.1; opt-out exists since Nov 2025)
- **Register:** formal/clinical; reads like a slick consultancy deck

### Claude

- **Verb signature:** examine, consider, distinguish, illuminate (lighter touch than GPT)
- **Adjective signature:** meaningful, careful, specific, worth examining
- **Trigrams:** "the distinction is worth", "meaningfully reduces", "I notice that", "it's worth examining"
- **Format signature:** clean paragraphs over heavy formatting; less bullet-heavy than GPT
- **Sycophancy style:** softer — "I notice…", "I should be careful here…", "it's worth examining…" rather than "Great question!"
- **Register:** academic-but-approachable

### Gemini

- **Verb signature:** explore, navigate, understand (tutorial verbs)
- **Adjective signature:** simpler vocabulary than GPT (uses "high blood sugar" where GPT uses "elevated blood glucose levels")
- **Trigrams:** "the way for", "the cascade of", "is not a", "in the world of"
- **Format signature:** verbose; over-explains; longer paragraphs than necessary
- **Register:** "Google search result that learned to write paragraphs"

### Reporting

The scanner uses these clusters as a heuristic. The audit report includes a "Likely model fingerprint" line: none / GPT / Claude / Gemini, with 2-3 specific markers as evidence. When two clusters are equally likely (mixed-model edits, or human polish on top of AI output), report "mixed" rather than picking.

Per Scientific American (cross-stylometry across thousands of outputs): all three models cluster tightly in stylometric space, while humans spread broadly. So the fingerprint signal is strong when present — but only when the prose is unedited or lightly edited.

---

## 5. Contested tells

Some tells are contested in the research. The audit acknowledges contestation rather than pretending unanimity.

### Em dashes

The most-cited AI tell of 2024-2025. Rolling Stone, TechRadar, NYT all covered it. But OpenAI added an em-dash opt-out in GPT-5.1 (Nov 2025), and many human writers (Cory Doctorow, Cormac McCarthy estate, half of literary fiction) use them constantly.

**This skill's default:** em dashes in clusters (3+ per 500 words) = H severity. Em dashes alone (1–2 in a long piece) = L severity. Single em dashes are noted but don't down-score.

**User override:** if the user is Mahmoud or any writer with an explicit no-em-dash voice rule, treat ALL em dashes as H. Pass `--strict-em-dash` to the scanner.

### "Actually" and decorative adverbs

Most decorative adverbs ("genuinely", "truly", "honestly", "frankly", "ultimately") are H. But "actually" survives when contrasting concrete reality with theory ("the model actually works under load"). The rule: if removing the adverb leaves the sentence unchanged or stronger, it was filler. The scanner flags every instance; the audit uses judgment.

### Em-dashed asides vs comma asides

When a writer has been told "no em dashes" and they convert em dashes to comma asides, the prose can read awkwardly punctuated. The audit notes when comma-aside density spikes — sometimes that's a signal of em-dash conversion rather than natural rhythm.

### Tricolons in formal prose

Three-beat structures ("life, liberty, and the pursuit of happiness") are a literary tradition. They survive in speeches, formal essays, and explicitly rhetorical contexts. The pattern flag is for *unintended* tricolon abuse — three adjectives strung together because the model defaulted to it.

### "From X to Y" ranges

Sometimes a real range. Sometimes false. The audit asks: are X and Y genuinely the endpoints of a spectrum, or just two illustrative examples? If the latter, the construction is a tell.

When a tell is contested, the audit notes the contestation in the calibration section of the report.

---

## 6. The sanding-off problem

Sophisticated authors prompt-engineer around famous tells. After "delve" went viral in early 2024, arXiv frequency dropped sharply within months. The flagship vocabulary list is now less reliable than it was.

### Implication for the scanner

Newer / less-famous tells are weighted higher than the v1 vocabulary list. Specifically:

- **Boost by 1.5x:** copula avoidance ("serves as", "boasts"), present-participle "-ing" tails, anaphora abuse, false ranges, hedge stacking, "while X, Y" openers
- **Standard weight:** vocabulary tells from category 2A (verbs), 2B (metaphors), 2C (intensifiers) — the famous list
- **Standard weight, but flagged when present:** sycophancy openers/closers (RLHF artifact, hard to sand off)

### The rough heuristic

If a draft is *clean* on category 2A famous-vocabulary tells but *dirty* on category B sentence-level tells, that's a strong signal of sanded prose: the writer (or the prompt) removed the easy vocabulary tells but didn't catch the structural ones. The audit report flags this as "sanded-prose signature" when present.

Conversely, a draft heavy on famous vocabulary but clean on structural tells is more likely human imitation of AI than actual AI output.

---

## 7. The uncanny-valley rule

Multiple weak tells stacking causes "subliminal discomfort" — readers feel something is off before identifying why. Many sources describe this effect (LitHub, Pangram, The Ignorance Field Guide).

### The trigger

When all three of the following are true:

1. Zero high-severity tells
2. Eight or more medium-or-low-severity tells per 500 words
3. Burstiness ratio below 0.5 (sentence-length variance too uniform)

…escalate the verdict by one tier even though no individual violation is severe.

The diagnosis line in the audit reads: "Uncanny-valley pattern — multiple weak tells stacking. No single phrase reads as AI; the cumulative texture does."

This catches sanded prose (see §6) and well-prompted output where the writer removed the famous tells but didn't fix the underlying rhythm.

---

## 8. Burstiness approximation

Burstiness measures sentence-level variance — variation in length and structure. Humans cluster around 0.6-1.2 (standard deviation of sentence length / mean sentence length). LLMs cluster 0.2-0.4.

The scanner reports the burstiness ratio. The audit uses it as a hint, not a verdict — once a human edits AI output, burstiness rises and the signal weakens.

### How the scanner computes it

```
sentence_lengths = [word_count(s) for s in sentences]
mean = sum(sentence_lengths) / len(sentence_lengths)
std = sqrt(sum((x - mean)**2 for x in sentence_lengths) / len(sentence_lengths))
burstiness = std / mean
```

### Interpretation

- Below 0.3: strong AI rhythm signal
- 0.3 to 0.5: AI-leaning, but light editing can push prose into this range
- 0.5 to 0.8: ambiguous; light human polish on AI output, or naturally rhythmic AI prompting
- Above 0.8: strong human rhythm signal

### Caveats

- Encyclopedic / reference prose runs uniform regardless of source. Burstiness is unreliable in that genre.
- Very short drafts (<100 words) don't have enough sentences to compute burstiness reliably. The scanner reports `n/a` below 5 sentences.
- Burstiness alone is never the verdict. It's one signal among many.

GPTZero, Pangram, and Quillbot all document burstiness as a metric and its limitations; the metric is widely used but increasingly bypassed by prompt engineering. See `sources.md` for citations.

---

## How to use this file during an audit

1. Run the scanner. It produces raw counts of H / M / L tells per axis, plus burstiness, readability metrics, and density signals.
2. Compute both density scores (AI-Slop §1, Comprehension §9).
3. Apply genre adjustments (§3) and audience calibration (§10).
4. Check for compound triggers and the uncanny-valley condition (§7).
5. Check for sanded-prose signature (§6).
6. Identify the model fingerprint if present (§4).
7. Note contested tells in the calibration section of the report (§5).
8. Output **both verdicts** plus the cross-axis recommendation (§11).

The audit report template (`audit-report-template.md`) defines the exact output format, including the dual-verdict header and the Calibration Notes section that surfaces all of the above.

---

## 9. Comprehension density scoring

The comprehension axis uses the same density formula as the AI-Slop axis but counts a different catalog (the patterns in `comprehension.md`).

### The formula

```
comp_density = (compH × 3) + (compM × 1) + (compL × 0.25)    per 500 words
            = ((compH × 3) + (compM × 1) + (compL × 0.25)) / U   for the full draft
```

Where:
- `compH` = high-severity comprehension violations (acronym stacking, named-entity bombing, stat bombing, telegraphic colon-labeling, density-without-headings, long sentences, run-on sentences, coined terms used as known, curse of knowledge, buried lede, missing thesis, no topic sentence, first sentence doesn't hook)
- `compM` = medium-severity (wall of text, list-pretending-to-be-prose, definition-by-synonym, mixed audience, forward-reference, missing transitions, hierarchy collapse, no concrete examples, nut-graf missing, no skim layer, old-to-new inversion, parallelism failure, passive voice excess, nominalization, abstract noun stacking, hedge stacking, ambiguous pronoun, dangling modifier)
- `compL` = low-severity (glue-word bloat, prose-pretending-to-be-list, decorative qualifiers, negative construction)

### Verdict thresholds

Same scale as AI-Slop:

| comp_density | Verdict | Action |
|---|---|---|
| 0–2 | PASS | Reader can follow it; polish at most |
| 2–5 | LOW | Spot-fix listed items |
| 5–10 | MEDIUM | Significant cleanup; reader will struggle in places |
| 10–18 | HIGH | Substantial revision; reader will lose the thread |
| 18+ | CRITICAL | Recommend rewrite; cold reader has no chance |

### Compound triggers (escalate one tier)

- 3+ undefined acronyms in any 100-word window → escalate
- 5+ named entities introduced without context in any 100-word window → escalate
- 3+ numeric claims in a single sentence (no comparative anchor) → escalate
- 3+ telegraphic colon-labels in one paragraph → escalate
- Any sentence over 40 words → escalate
- Any paragraph over 150 words with no subheading → escalate
- Combined: any 100-word window with H-density > 5 → escalate

### Readability metric panel

The scanner also computes 8 readability metrics (Flesch RE, FK Grade, SMOG, Coleman-Liau, Dale-Chall, lexical density, avg sentence length, passive voice %) — see `readability-metrics.md`. They appear in the audit report as a diagnostic panel under the comprehension verdict, but they don't directly feed the verdict score. The catalog patterns drive the score; the metrics calibrate.

When a piece scores PASS on patterns but the metrics show grade 16 / lexical density 68% / avg sentence 35 words, the audit notes the disconnect — typically academic prose where every individual sentence is fine but the cumulative texture is opaque.

### Audience-aware scoring

The verdict is then adjusted by audience (see §10). A grade-12 score for technical docs is fine; for marketing copy it's HIGH.

---

## 10. Audience calibration

The same prose hits different verdicts depending on who's supposed to read it. Audience is the most important calibration input.

### Threshold table by audience

| Audience | Flesch RE target | FK Grade target | Avg sentence | Passive % | Acronyms |
|---|---|---|---|---|---|
| **General web / blog** | 60–70 | 7–9 | 15–18 | <10% | Define all on first use |
| **Marketing copy** | 65–80 | 6–8 | 12–16 | <5% | Avoid; spell out every term |
| **GOV.UK / civic / accessibility (WCAG AAA)** | 70+ | 4–6 | 12–15 | <5% | Spell out always |
| **Healthcare patient-facing** | 70–80 | 6–8 | 12–15 | <5% | Spell out always |
| **Tech blog (developer audience)** | 50–65 | 9–12 | 18–22 | <10% | Define non-obvious only; standard ones (API, JSON, HTML, CSS) OK |
| **Internal technical docs** | 40–55 | 11–14 | 18–25 | <15% | Industry-standard OK |
| **Academic / scientific** | 30–50 | 12–16 | 20–28 | 10–20% | Field-standard OK |

### How to apply

1. **Detect audience.** The scanner infers from cues (citation patterns, code blocks, marketing CTAs, persona pronouns). User can override with `--audience` flag.
2. **Map metrics to targets.** For each readability metric, compute distance from the audience-specific target.
3. **Adjust verdict.** A piece that scores HIGH on the catalog but well within the audience's metric band may be downgraded to MEDIUM. A piece that PASSes the catalog but blows the metric band by 50% may be upgraded.

### Reader-test simulations (qualitative)

When the scanner can't tell, fall back on these:

- **Smart 12-year-old test (Feynman):** Could a smart 12-year-old or someone outside the field follow this? If you can't explain it simply, you don't truly understand it.
- **Cold-reader test (Pinker):** Show the draft to someone who hasn't been working on it. Ask: *What's the main point? Where did you get confused? What terms did you not know?* This is the prescription for exorcising the curse of knowledge.
- **5-second skim test:** Show the page for 5 seconds. Ask "what did you see?" Tests whether the H1, first sentence, bolded keywords, and TL;DR convey the gist. (55% of web visitors leave within 15 seconds.)
- **Cloze test (empirical):** Delete every 6th word; have the reader fill in the blanks. Higher restoration rate = more comprehensible. >57% exact restoration = mastery.

### When the audience is unknown

Default to **casual** (general web/blog). It's the strictest practical baseline and produces the most actionable verdict for unspecified contexts.

---

## 11. Cross-axis recommendations

When both verdicts are computed, the audit produces a single combined recommendation based on whichever axis is worse. The matrix:

| AI-Slop | Comprehension | Combined recommendation |
|---|---|---|
| PASS | PASS | Ship it. Polish-pass at most. |
| PASS / LOW | LOW | Spot-fix the comprehension items. Reader will follow with minor friction. |
| PASS / LOW | MEDIUM | Significant comprehension cleanup. Define acronyms, break up paragraphs, add a thesis. |
| PASS / LOW | HIGH / CRITICAL | Comprehension rewrite. The texture is fine but the reader can't follow. |
| MEDIUM | PASS / LOW | Slop spot-fix. Replace `delve`/em-dashes/sycophancy. Reader can follow already. |
| MEDIUM | MEDIUM | Both cleanup. Often the same fixes (telegraphic em-dashes hurt both axes). |
| HIGH / CRITICAL | PASS / LOW | Slop rewrite. Replace AI texture; reader-friendly structure already exists. |
| HIGH / CRITICAL | HIGH / CRITICAL | Full rewrite. Both axes failing = the prose isn't salvageable through editing. |

### Top-fix combination

The audit's "Top 3 fixes" list pulls from both axes, ordered by impact:

1. The single highest-impact item from whichever axis scored worse
2. The highest-impact item from the other axis
3. The next highest-impact item from whichever axis scored worse

This way the reader gets the most leverage in the smallest read.

### What "passes" means in this dual-axis world

A piece "passes slop-cop" when **both** verdicts are PASS or LOW. A piece can technically pass the AI-Slop axis with HIGH Comprehension and still be unshippable for any audience that isn't already initiated.

This is the gap that drove v2: the tool used to say "PASS" on prose that no fresh reader could follow. Two axes fix the gap.

---

## Sources for comprehension calibration

- [CDC Clear Communication Index](https://www.cdc.gov/ccindex/tool/index.html) — reading-level benchmarks for healthcare
- [GOV.UK style guide](https://www.gov.uk/guidance/style-guide) — civic-content readability targets
- [Microsoft style guide](https://learn.microsoft.com/en-us/style-guide/) — technical-content guidance
- [WCAG 3.1.5 Reading Level (AAA)](https://www.w3.org/WAI/WCAG21/Understanding/reading-level.html) — accessibility threshold
- [Pinker on the curse of knowledge — Harvard](https://news.harvard.edu/gazette/story/2012/11/exorcising-the-curse-of-knowledge/)
- [Cloze test — NN/g](https://www.nngroup.com/articles/cloze-test-reading-comprehension/)
- [F-pattern reading — NN/g](https://www.nngroup.com/articles/f-shaped-pattern-reading-web-content/)

The full bibliography is in `sources.md`.