# slop-cop — Calibration slop-cop scores prose on **two parallel axes**: - **AI-Slop** — does this read like AI wrote it? (texture, rhythm, vocabulary tells, formatting) - **Comprehension** — can a fresh reader follow this? (acronyms, named-entity bombing, telegraphic compression, missing thesis, structure, readability) Each axis has its own catalog, its own density formula, and its own verdict. A piece can fail one and pass the other: - Dense academic prose → may PASS AI-Slop (no `delve`, no em-dash clusters) but CRITICAL Comprehension (jargon-bombed, no thesis) - Sycophantic ChatGPT marketing → CRITICAL AI-Slop, MEDIUM Comprehension - Hand-written cover letter → PASS both - Twitter-thread summary written by a human in a hurry → LOW AI-Slop, HIGH Comprehension (telegraphic, named-entity bombing) The audit reports both verdicts. The combined recommendation is driven by whichever is worse. --- ## Table of contents ### AI-Slop axis (sections 1–8) 1. [AI-Slop density scoring](#1-density-based-scoring) 2. [Severity tiers (AI-Slop)](#2-severity-tiers-explicit) 3. [Genre adjustments](#3-genre-adjustments) 4. [Model fingerprints](#4-model-fingerprints) 5. [Contested tells](#5-contested-tells) 6. [The sanding-off problem](#6-the-sanding-off-problem) 7. [The uncanny-valley rule](#7-the-uncanny-valley-rule) 8. [Burstiness approximation](#8-burstiness-approximation) ### Comprehension axis (sections 9–11) 9. [Comprehension density scoring](#9-comprehension-density-scoring) 10. [Audience calibration](#10-audience-calibration) 11. [Cross-axis recommendations](#11-cross-axis-recommendations) --- ## 1. Density-based scoring A single tell is not a signal. Real writers use individual tells all the time. The signal is **how many show up per 500 words**, weighted by severity. ### The formula For a draft of N words, normalize to 500-word units (`U = N / 500`). Count violations by severity: - `H` = high-severity tells (always-cut items) - `M` = medium-severity tells - `L` = low-severity tells (informational only) Compute the **density score**: ``` density = (H × 3) + (M × 1) + (L × 0.25) per 500 words = ((H × 3) + (M × 1) + (L × 0.25)) / U for the full draft ``` ### Verdict thresholds | Density score | Verdict | Action | |---|---|---| | 0–2 | PASS | Polish-pass at most | | 2–5 | LOW | Spot-fix the listed items | | 5–10 | MEDIUM | Spot-fix sufficient; significant cleanup needed | | 10–18 | HIGH | Substantial revision required | | 18+ | CRITICAL | Recommend rewrite from scratch | ### Compound triggers (escalate one tier) - A high-severity rhetorical pattern + 5+ vocabulary hits + 1+ formatting tell within the same 500 words → escalate one tier - Three or more H-severity tells in a single paragraph → escalate one tier - The "uncanny valley" condition (see §7) → escalate one tier even when no individual tell is high-severity ### What density does and doesn't tell you It tells you whether the prose reads as AI-shaped. It does not tell you whether AI wrote it — humans imitate AI, and AI imitates humans. Treat the verdict as "this prose has the shape of AI writing," not "this prose was generated by AI." --- ## 2. Severity tiers explicit The patterns, vocabulary, and formatting-tells files all tag every item H/M/L. The definitions: ### High (H) — always cut The phrase or pattern is essentially never the right choice. Even one instance in casual prose lowers the verdict tier. Examples: - Em dashes in clusters (3+ per 500 words) - Bold-first bullets in any short prose piece - Sycophancy openers/closers ("Great question!", "I hope this helps!") - "Delve" / "tapestry" / "showcasing" - Grandiose framing ("stands as a testament to") - Copula avoidance ("serves as", "boasts") - Knowledge-cutoff disclaimer leakage ("As of my last update...") - Vague-authority weasels with no citation Cut without exception unless the phrase is being used in scare quotes or ironically. ### Medium (M) — usually cut The phrase or pattern survives in narrow contexts. Default is to cut; keep only if the word/structure is doing specific work that nothing else can. Examples: - "While X, Y" sentence opener (one is fine; three is a pattern) - "actually" (survives only contrasting concrete reality with theory) - Hedged superlatives (sometimes warranted in genuinely uncertain claims) - Symmetrical sentence pairs (one is rhetoric; three is a tic) - Two-word punchlines (once per piece is forgivable) In an audit, M-severity items are listed with the question: is this doing specific work? If not, cut. ### Low (L) — context-dependent Weak tell on its own. Note in the audit report but don't down-score the verdict. Examples: - Em dashes alone (1–2 in a long piece, post-GPT-5.1) - Absent contractions (formal register may justify it) - Universal Oxford comma + American spelling - "Actually" used to contrast theory and reality L items inform the diagnosis ("this prose has these formal-register tells") but do not tip the verdict. --- ## 3. Genre adjustments The same tell carries different weight depending on genre. The audit infers genre from the draft (or accepts a `--genre` flag for the scanner) and adjusts thresholds. ### Casual / first-person / blog / Reddit / email Default thresholds. Every tell weighted at full. This is the strictest mode and the most common case — users invoking the skill on their own writing usually want this. ### Marketing / sales / landing copy Marketing copy legitimately uses some intensifiers ("transformative", "groundbreaking") and some structure (TL;DRs, bulleted benefits). Adjust: - Reduce buzzword-density penalty by 30% - "Comprehensive", "robust", "seamless" allowed at 1 instance per 500 words before flagging - Sycophancy still always-cut (no genre justifies "I hope this helps!") - Performative openers still always-cut ### Academic / research / formal Academic prose legitimately hedges ("studies show" with citations is fine), uses some passive constructions, and follows section conventions ("Methods", "Results"). Adjust: - Vague-authority phrases: only flag when uncited. "Studies show [Smith 2024]" is not a tell. - Hedge stacking: only flag when the hedging exceeds the genuine uncertainty (research catalog notes that calibrated hedging is fine; saturation hedging is the tell) - "Comprehensive review", "novel approach" allowed in title position - "Challenges and Future Directions" section is normal here — only flag in non-academic prose ### Encyclopedic / reference / Wikipedia-style LLMs were trained heavily on Wikipedia. Encyclopedic prose triggers false positives across all detectors (GPTZero documents this). Adjust: - Reduce all severity tiers by one for the duration of encyclopedic passages - Copula avoidance ("serves as") still flagged — Wikipedia editors actually use "is" most of the time - Synonym cycling more tolerated - Burstiness threshold relaxed (encyclopedic prose runs uniform) ### Fiction / dialogue / character voice Voice-aware judgment. The character's voice may legitimately use any of these patterns. Apply the rules to *narration* but not *dialogue or interior monologue*. The "tells" rule is a bias toward the AI model's house voice; a strong character voice can override it. If unsure of genre, default to **casual** (strictest). Users can override. --- ## 4. Model fingerprints When density indicates AI shape, identify the likely model. This serves diagnosis ("this looks like Gemini, not Claude — adjust your prompts") and prompt engineering. ### GPT-4 / 4o / 5 - **Verb signature:** delve, underscore, navigate, leverage, harness, showcase - **Adjective signature:** noteworthy, commendable, intricate, meticulous, comprehensive - **Power words:** supercharge, unleash, dive in, game-changing - **Trigrams:** "individuals with diabetes", "characterized by elevated", "ranging from", "play a significant role" - **Format signature:** heavy bullets and headers, em dashes (pre-5.1; opt-out exists since Nov 2025) - **Register:** formal/clinical; reads like a slick consultancy deck ### Claude - **Verb signature:** examine, consider, distinguish, illuminate (lighter touch than GPT) - **Adjective signature:** meaningful, careful, specific, worth examining - **Trigrams:** "the distinction is worth", "meaningfully reduces", "I notice that", "it's worth examining" - **Format signature:** clean paragraphs over heavy formatting; less bullet-heavy than GPT - **Sycophancy style:** softer — "I notice…", "I should be careful here…", "it's worth examining…" rather than "Great question!" - **Register:** academic-but-approachable ### Gemini - **Verb signature:** explore, navigate, understand (tutorial verbs) - **Adjective signature:** simpler vocabulary than GPT (uses "high blood sugar" where GPT uses "elevated blood glucose levels") - **Trigrams:** "the way for", "the cascade of", "is not a", "in the world of" - **Format signature:** verbose; over-explains; longer paragraphs than necessary - **Register:** "Google search result that learned to write paragraphs" ### Reporting The scanner uses these clusters as a heuristic. The audit report includes a "Likely model fingerprint" line: none / GPT / Claude / Gemini, with 2-3 specific markers as evidence. When two clusters are equally likely (mixed-model edits, or human polish on top of AI output), report "mixed" rather than picking. Per Scientific American (cross-stylometry across thousands of outputs): all three models cluster tightly in stylometric space, while humans spread broadly. So the fingerprint signal is strong when present — but only when the prose is unedited or lightly edited. --- ## 5. Contested tells Some tells are contested in the research. The audit acknowledges contestation rather than pretending unanimity. ### Em dashes The most-cited AI tell of 2024-2025. Rolling Stone, TechRadar, NYT all covered it. But OpenAI added an em-dash opt-out in GPT-5.1 (Nov 2025), and many human writers (Cory Doctorow, Cormac McCarthy estate, half of literary fiction) use them constantly. **This skill's default:** em dashes in clusters (3+ per 500 words) = H severity. Em dashes alone (1–2 in a long piece) = L severity. Single em dashes are noted but don't down-score. **User override:** if the user is Mahmoud or any writer with an explicit no-em-dash voice rule, treat ALL em dashes as H. Pass `--strict-em-dash` to the scanner. ### "Actually" and decorative adverbs Most decorative adverbs ("genuinely", "truly", "honestly", "frankly", "ultimately") are H. But "actually" survives when contrasting concrete reality with theory ("the model actually works under load"). The rule: if removing the adverb leaves the sentence unchanged or stronger, it was filler. The scanner flags every instance; the audit uses judgment. ### Em-dashed asides vs comma asides When a writer has been told "no em dashes" and they convert em dashes to comma asides, the prose can read awkwardly punctuated. The audit notes when comma-aside density spikes — sometimes that's a signal of em-dash conversion rather than natural rhythm. ### Tricolons in formal prose Three-beat structures ("life, liberty, and the pursuit of happiness") are a literary tradition. They survive in speeches, formal essays, and explicitly rhetorical contexts. The pattern flag is for *unintended* tricolon abuse — three adjectives strung together because the model defaulted to it. ### "From X to Y" ranges Sometimes a real range. Sometimes false. The audit asks: are X and Y genuinely the endpoints of a spectrum, or just two illustrative examples? If the latter, the construction is a tell. When a tell is contested, the audit notes the contestation in the calibration section of the report. --- ## 6. The sanding-off problem Sophisticated authors prompt-engineer around famous tells. After "delve" went viral in early 2024, arXiv frequency dropped sharply within months. The flagship vocabulary list is now less reliable than it was. ### Implication for the scanner Newer / less-famous tells are weighted higher than the v1 vocabulary list. Specifically: - **Boost by 1.5x:** copula avoidance ("serves as", "boasts"), present-participle "-ing" tails, anaphora abuse, false ranges, hedge stacking, "while X, Y" openers - **Standard weight:** vocabulary tells from category 2A (verbs), 2B (metaphors), 2C (intensifiers) — the famous list - **Standard weight, but flagged when present:** sycophancy openers/closers (RLHF artifact, hard to sand off) ### The rough heuristic If a draft is *clean* on category 2A famous-vocabulary tells but *dirty* on category B sentence-level tells, that's a strong signal of sanded prose: the writer (or the prompt) removed the easy vocabulary tells but didn't catch the structural ones. The audit report flags this as "sanded-prose signature" when present. Conversely, a draft heavy on famous vocabulary but clean on structural tells is more likely human imitation of AI than actual AI output. --- ## 7. The uncanny-valley rule Multiple weak tells stacking causes "subliminal discomfort" — readers feel something is off before identifying why. Many sources describe this effect (LitHub, Pangram, The Ignorance Field Guide). ### The trigger When all three of the following are true: 1. Zero high-severity tells 2. Eight or more medium-or-low-severity tells per 500 words 3. Burstiness ratio below 0.5 (sentence-length variance too uniform) …escalate the verdict by one tier even though no individual violation is severe. The diagnosis line in the audit reads: "Uncanny-valley pattern — multiple weak tells stacking. No single phrase reads as AI; the cumulative texture does." This catches sanded prose (see §6) and well-prompted output where the writer removed the famous tells but didn't fix the underlying rhythm. --- ## 8. Burstiness approximation Burstiness measures sentence-level variance — variation in length and structure. Humans cluster around 0.6-1.2 (standard deviation of sentence length / mean sentence length). LLMs cluster 0.2-0.4. The scanner reports the burstiness ratio. The audit uses it as a hint, not a verdict — once a human edits AI output, burstiness rises and the signal weakens. ### How the scanner computes it ``` sentence_lengths = [word_count(s) for s in sentences] mean = sum(sentence_lengths) / len(sentence_lengths) std = sqrt(sum((x - mean)**2 for x in sentence_lengths) / len(sentence_lengths)) burstiness = std / mean ``` ### Interpretation - Below 0.3: strong AI rhythm signal - 0.3 to 0.5: AI-leaning, but light editing can push prose into this range - 0.5 to 0.8: ambiguous; light human polish on AI output, or naturally rhythmic AI prompting - Above 0.8: strong human rhythm signal ### Caveats - Encyclopedic / reference prose runs uniform regardless of source. Burstiness is unreliable in that genre. - Very short drafts (<100 words) don't have enough sentences to compute burstiness reliably. The scanner reports `n/a` below 5 sentences. - Burstiness alone is never the verdict. It's one signal among many. GPTZero, Pangram, and Quillbot all document burstiness as a metric and its limitations; the metric is widely used but increasingly bypassed by prompt engineering. See `sources.md` for citations. --- ## How to use this file during an audit 1. Run the scanner. It produces raw counts of H / M / L tells per axis, plus burstiness, readability metrics, and density signals. 2. Compute both density scores (AI-Slop §1, Comprehension §9). 3. Apply genre adjustments (§3) and audience calibration (§10). 4. Check for compound triggers and the uncanny-valley condition (§7). 5. Check for sanded-prose signature (§6). 6. Identify the model fingerprint if present (§4). 7. Note contested tells in the calibration section of the report (§5). 8. Output **both verdicts** plus the cross-axis recommendation (§11). The audit report template (`audit-report-template.md`) defines the exact output format, including the dual-verdict header and the Calibration Notes section that surfaces all of the above. --- ## 9. Comprehension density scoring The comprehension axis uses the same density formula as the AI-Slop axis but counts a different catalog (the patterns in `comprehension.md`). ### The formula ``` comp_density = (compH × 3) + (compM × 1) + (compL × 0.25) per 500 words = ((compH × 3) + (compM × 1) + (compL × 0.25)) / U for the full draft ``` Where: - `compH` = high-severity comprehension violations (acronym stacking, named-entity bombing, stat bombing, telegraphic colon-labeling, density-without-headings, long sentences, run-on sentences, coined terms used as known, curse of knowledge, buried lede, missing thesis, no topic sentence, first sentence doesn't hook) - `compM` = medium-severity (wall of text, list-pretending-to-be-prose, definition-by-synonym, mixed audience, forward-reference, missing transitions, hierarchy collapse, no concrete examples, nut-graf missing, no skim layer, old-to-new inversion, parallelism failure, passive voice excess, nominalization, abstract noun stacking, hedge stacking, ambiguous pronoun, dangling modifier) - `compL` = low-severity (glue-word bloat, prose-pretending-to-be-list, decorative qualifiers, negative construction) ### Verdict thresholds Same scale as AI-Slop: | comp_density | Verdict | Action | |---|---|---| | 0–2 | PASS | Reader can follow it; polish at most | | 2–5 | LOW | Spot-fix listed items | | 5–10 | MEDIUM | Significant cleanup; reader will struggle in places | | 10–18 | HIGH | Substantial revision; reader will lose the thread | | 18+ | CRITICAL | Recommend rewrite; cold reader has no chance | ### Compound triggers (escalate one tier) - 3+ undefined acronyms in any 100-word window → escalate - 5+ named entities introduced without context in any 100-word window → escalate - 3+ numeric claims in a single sentence (no comparative anchor) → escalate - 3+ telegraphic colon-labels in one paragraph → escalate - Any sentence over 40 words → escalate - Any paragraph over 150 words with no subheading → escalate - Combined: any 100-word window with H-density > 5 → escalate ### Readability metric panel The scanner also computes 8 readability metrics (Flesch RE, FK Grade, SMOG, Coleman-Liau, Dale-Chall, lexical density, avg sentence length, passive voice %) — see `readability-metrics.md`. They appear in the audit report as a diagnostic panel under the comprehension verdict, but they don't directly feed the verdict score. The catalog patterns drive the score; the metrics calibrate. When a piece scores PASS on patterns but the metrics show grade 16 / lexical density 68% / avg sentence 35 words, the audit notes the disconnect — typically academic prose where every individual sentence is fine but the cumulative texture is opaque. ### Audience-aware scoring The verdict is then adjusted by audience (see §10). A grade-12 score for technical docs is fine; for marketing copy it's HIGH. --- ## 10. Audience calibration The same prose hits different verdicts depending on who's supposed to read it. Audience is the most important calibration input. ### Threshold table by audience | Audience | Flesch RE target | FK Grade target | Avg sentence | Passive % | Acronyms | |---|---|---|---|---|---| | **General web / blog** | 60–70 | 7–9 | 15–18 | <10% | Define all on first use | | **Marketing copy** | 65–80 | 6–8 | 12–16 | <5% | Avoid; spell out every term | | **GOV.UK / civic / accessibility (WCAG AAA)** | 70+ | 4–6 | 12–15 | <5% | Spell out always | | **Healthcare patient-facing** | 70–80 | 6–8 | 12–15 | <5% | Spell out always | | **Tech blog (developer audience)** | 50–65 | 9–12 | 18–22 | <10% | Define non-obvious only; standard ones (API, JSON, HTML, CSS) OK | | **Internal technical docs** | 40–55 | 11–14 | 18–25 | <15% | Industry-standard OK | | **Academic / scientific** | 30–50 | 12–16 | 20–28 | 10–20% | Field-standard OK | ### How to apply 1. **Detect audience.** The scanner infers from cues (citation patterns, code blocks, marketing CTAs, persona pronouns). User can override with `--audience` flag. 2. **Map metrics to targets.** For each readability metric, compute distance from the audience-specific target. 3. **Adjust verdict.** A piece that scores HIGH on the catalog but well within the audience's metric band may be downgraded to MEDIUM. A piece that PASSes the catalog but blows the metric band by 50% may be upgraded. ### Reader-test simulations (qualitative) When the scanner can't tell, fall back on these: - **Smart 12-year-old test (Feynman):** Could a smart 12-year-old or someone outside the field follow this? If you can't explain it simply, you don't truly understand it. - **Cold-reader test (Pinker):** Show the draft to someone who hasn't been working on it. Ask: *What's the main point? Where did you get confused? What terms did you not know?* This is the prescription for exorcising the curse of knowledge. - **5-second skim test:** Show the page for 5 seconds. Ask "what did you see?" Tests whether the H1, first sentence, bolded keywords, and TL;DR convey the gist. (55% of web visitors leave within 15 seconds.) - **Cloze test (empirical):** Delete every 6th word; have the reader fill in the blanks. Higher restoration rate = more comprehensible. >57% exact restoration = mastery. ### When the audience is unknown Default to **casual** (general web/blog). It's the strictest practical baseline and produces the most actionable verdict for unspecified contexts. --- ## 11. Cross-axis recommendations When both verdicts are computed, the audit produces a single combined recommendation based on whichever axis is worse. The matrix: | AI-Slop | Comprehension | Combined recommendation | |---|---|---| | PASS | PASS | Ship it. Polish-pass at most. | | PASS / LOW | LOW | Spot-fix the comprehension items. Reader will follow with minor friction. | | PASS / LOW | MEDIUM | Significant comprehension cleanup. Define acronyms, break up paragraphs, add a thesis. | | PASS / LOW | HIGH / CRITICAL | Comprehension rewrite. The texture is fine but the reader can't follow. | | MEDIUM | PASS / LOW | Slop spot-fix. Replace `delve`/em-dashes/sycophancy. Reader can follow already. | | MEDIUM | MEDIUM | Both cleanup. Often the same fixes (telegraphic em-dashes hurt both axes). | | HIGH / CRITICAL | PASS / LOW | Slop rewrite. Replace AI texture; reader-friendly structure already exists. | | HIGH / CRITICAL | HIGH / CRITICAL | Full rewrite. Both axes failing = the prose isn't salvageable through editing. | ### Top-fix combination The audit's "Top 3 fixes" list pulls from both axes, ordered by impact: 1. The single highest-impact item from whichever axis scored worse 2. The highest-impact item from the other axis 3. The next highest-impact item from whichever axis scored worse This way the reader gets the most leverage in the smallest read. ### What "passes" means in this dual-axis world A piece "passes slop-cop" when **both** verdicts are PASS or LOW. A piece can technically pass the AI-Slop axis with HIGH Comprehension and still be unshippable for any audience that isn't already initiated. This is the gap that drove v2: the tool used to say "PASS" on prose that no fresh reader could follow. Two axes fix the gap. --- ## Sources for comprehension calibration - [CDC Clear Communication Index](https://www.cdc.gov/ccindex/tool/index.html) — reading-level benchmarks for healthcare - [GOV.UK style guide](https://www.gov.uk/guidance/style-guide) — civic-content readability targets - [Microsoft style guide](https://learn.microsoft.com/en-us/style-guide/) — technical-content guidance - [WCAG 3.1.5 Reading Level (AAA)](https://www.w3.org/WAI/WCAG21/Understanding/reading-level.html) — accessibility threshold - [Pinker on the curse of knowledge — Harvard](https://news.harvard.edu/gazette/story/2012/11/exorcising-the-curse-of-knowledge/) - [Cloze test — NN/g](https://www.nngroup.com/articles/cloze-test-reading-comprehension/) - [F-pattern reading — NN/g](https://www.nngroup.com/articles/f-shaped-pattern-reading-web-content/) The full bibliography is in `sources.md`.