---
name: manuscript-provenance
description: >
  Computational provenance audit verifying that every number, table, figure,
  ordering, and terminology in a manuscript is derived from code and scripts —
  not manually entered. Cross-references LaTeX source against the codebase to
  detect hardcoded values, stale outputs, broken pipelines, and manual data
  entry. Companion to manuscript-review: that skill audits the document as
  prose; this skill audits whether the document is faithfully generated from
  code. Use when the user says "check provenance", "verify reproducibility",
  "audit my pipeline", "are my numbers from code", "check manuscript against
  scripts", "provenance audit", or any request to verify that manuscript
  content traces back to computational outputs.
metadata:
  version: 1.0.0
---

# Manuscript Provenance Audit

## Purpose

Verify that a manuscript is a faithful rendering of computational outputs.
Every number, table, figure, category label, ordering, and threshold in the
document must trace to a specific script, config file, or pipeline output.
Manual data entry in a manuscript is a reproducibility defect.

This skill produces a provenance map — a structured report linking each
manuscript artifact to its generating code — and flags every break in the
chain.

Companion skill: `manuscript-review` audits the document as prose (structure,
argumentation, citations). This skill audits whether the document content is
computationally grounded. Run both for complete pre-publication coverage.

## Boundary Agreement with manuscript-review

| Concern                   | manuscript-review                                           | This skill (manuscript-provenance)                             |
| ------------------------- | ----------------------------------------------------------- | -------------------------------------------------------------- |
| Reproducibility           | Does the paper describe enough to reproduce? (§6)           | Does the code actually produce what the paper claims? (§1, §7) |
| Figures/Tables            | Legible, accessible, well-formatted? (§12)                  | Generated by scripts, not manual entry? (§2, §3)               |
| Rendered visuals          | Readable at print scale? Floats near references? (§23)      | Figure generation script produces correct format? (§3)         |
| Hyperparameters           | Listed in the paper with rationale? (§6)                    | Values trace to config files, not hardcoded? (§1, §8)          |
| Code availability         | Statement exists in the paper? (§17)                        | Repo URL valid, README accurate, pipeline works? (§11)         |
| Terminology               | Abbreviations consistent within document? (§14)             | Terms match code identifiers? (§5)                             |
| Significant figures       | Consistent precision within document? (§12)                 | Precision matches script output? (§2)                          |
| Figure format             | Appropriate format for document quality? (§12)              | Format generated by script, not manually exported? (§3)        |
| Computational cost        | Reported in the paper? (§7)                                 | Values trace to benchmarking scripts? (§1)                     |
| Macro-prose coherence     | Prose framing appropriate for injected value? (§24)         | Value traced to code, macro manifest produced? (§4)            |
| Cross-element consistency | Prose, captions, figures, tables mutually consistent? (§24) | All elements from same run/pipeline output? (§9)               |

**Rule:** This skill never judges prose quality. manuscript-review never opens
the codebase. Each reads the other's report when available.

**Integration point — Macro Manifest:** This skill produces a **macro manifest**
as part of the §4 audit: a structured list of every macro-injected value with:

- Macro name (e.g., `\bestf`)
- Resolved value (e.g., `0.847`)
- Source (script + output file that generates it)
- Location(s) in manuscript text (file, line number, surrounding sentence)
- Classification (TRACED / MACRO-TRACED / CONFIG-TRACED / UNTRACED / STALE)

manuscript-review's Pass 13 (Cross-Element Coherence, §24) consumes this manifest
to check whether the prose surrounding each injected value is appropriate for the
actual numeric value. Provenance owns "is this value computationally grounded?"
Review owns "does the text wrapping this value make sense given what the value is?"

## Scope

**In scope:**

- Numbers, metrics, percentages in manuscript text
- Tables (content, ordering, formatting)
- Figures (generation scripts, data sources)
- LaTeX macros (`\newcommand`, `\def`, `\pgfmathsetmacro`)
- Terminology, mode names, mechanism labels, category names
- Ordering of items in enumerations, tables, discussion
- Config values (thresholds, hyperparameters, model names)
- Pipeline completeness (raw data → final PDF)
- Timestamp consistency (scripts vs outputs)

**Out of scope:**

- Prose quality (→ manuscript-review)
- Citation hygiene (→ manuscript-review)
- Argumentation structure (→ manuscript-review)
- Code quality/style (separate concern)

## Inputs

This audit requires TWO artifacts:

1. **Manuscript source** — LaTeX `.tex` files (preferred), or PDF/DOCX as fallback
2. **Codebase** — the scripts, configs, and pipeline that generate manuscript content

If the user provides only one, ask for the other. LaTeX source is strongly
preferred over compiled PDF — provenance auditing requires seeing the raw
markup, macros, and input commands.

## Workflow

### Phase 1 — Inventory

**1a. Manuscript Artifact Extraction**

Read all `.tex` files (main + included via `\input`/`\include`). Extract:

- **Inline values**: bare numbers in running text (percentages, counts, metrics,
  p-values, confidence intervals, thresholds, sizes)
- **LaTeX macros**: all `\newcommand`, `\def`, `\pgfmathsetmacro`, and custom
  command definitions that carry data values
- **Tables**: full content of every `tabular`/`table` environment — cell values,
  row/column ordering, headers
- **Figures**: `\includegraphics` paths, caption content, referenced data
- **Input files**: any `\input{generated/*.tex}` patterns that pull from
  script-generated LaTeX fragments
- **Labels and references**: `\label`/`\ref` pairs for cross-referencing
- **Terminology**: named modes, mechanisms, strategies, categories, method names
  used in prose
- **Ordered lists**: any enumerated or ranked items (methods compared, features
  listed, results ordered)

Build an **artifact registry** — a flat list of every data-carrying element in
the manuscript with its location (file, line number).

**1b. Codebase Mapping**

Scan the project directory. Identify:

- **Pipeline entry points**: `Makefile`, `snakemake`, `dvc.yaml`, `run.sh`,
  `main.py`, or equivalent orchestration
- **Analysis scripts**: files that produce numbers, tables, figures
- **Config files**: `config.toml`, `config.yaml`, `.env`, `params.yaml`,
  hyperparameter files
- **Output directories**: where scripts write results (`results/`, `output/`,
  `figures/`, `tables/`, `generated/`)
- **Generated LaTeX fragments**: `.tex` files in output directories that scripts
  produce for `\input` inclusion
- **Data files**: CSVs, JSON, HDF5, pickles that intermediate results flow through

Build a **source registry** — a flat list of every code artifact that produces
or configures manuscript content.

### Phase 2 — Provenance Tracing

For each entry in the artifact registry, attempt to establish a **provenance
chain**: manuscript value → generated output → script → input data/config.

**2a. Value Provenance**

For every number in the manuscript:

1. Search for the value in script outputs (logs, result files, generated LaTeX)
2. Trace the output back to the script that produces it
3. Verify the script reads from data/config (not hardcoded)
4. Record the full chain or flag as **UNTRACED**

Classification:

- **TRACED** — full chain from manuscript value to generating code
- **MACRO-TRACED** — value defined in a LaTeX macro that is generated by a script
- **CONFIG-TRACED** — value comes from a config file read by scripts
- **UNTRACED** — no provenance chain found; manually entered
- **STALE** — provenance chain exists but output is older than generating script

**2b. Table Provenance**

For each table:

1. Is the table content generated by a script (CSV → LaTeX, or direct LaTeX generation)?
2. Is the row/column ordering determined by code (sorted by metric, alphabetical,
   grouped by category) or manually arranged?
3. Are header labels matching code-defined names?
4. Are formatting choices (bold for best, significant figures) applied by code?

Classification:

- **GENERATED** — entire table produced by script
- **PARTIAL** — some cells generated, some manual
- **MANUAL** — no generation script found
- **ORDER-MANUAL** — content generated but ordering is manually set

**2c. Figure Provenance**

For each figure:

1. Does a script produce the exact file referenced by `\includegraphics`?
2. Does the script use a deterministic seed for reproducibility?
3. Is the figure output path in the script consistent with the LaTeX reference?
4. Are figure parameters (colors, labels, axis ranges) set in code or manually edited post-generation?

Classification:

- **GENERATED** — script produces the exact file
- **POST-EDITED** — script generates base figure, but manual edits detected
  (e.g., Illustrator metadata, different checksum than script output)
- **MANUAL** — no generating script found
- **STALE** — generating script modified after figure file

**2d. Terminology Provenance**

For each named mode, mechanism, category, or method label:

1. Is the term defined in code (enum, constant, config key, class name)?
2. Does the manuscript term match the code term exactly?
3. If the manuscript uses a display-friendly name, is there an explicit mapping
   in code or config?

Classification:

- **CODE-DEFINED** — term matches code definition
- **MAPPED** — explicit code→display mapping exists
- **UNMAPPED** — term appears in manuscript but not in code
- **INCONSISTENT** — term appears in both but differs (e.g., code says
  `greedy_search`, manuscript says "Greedy Search" in some places and
  "greedy approach" in others)

**2e. Ordering Provenance**

For each ordered list, ranked comparison, or sequenced enumeration:

1. Does code determine the ordering (sort by metric, alphabetical, enum order)?
2. Does the manuscript ordering match the code-determined order?
3. Are there items in the manuscript list not present in code output, or vice versa?

Classification:

- **CODE-ORDERED** — ordering matches code output
- **MANUAL-ORDER** — ordering differs from code output or no ordering logic in code
- **SUBSET-MISMATCH** — manuscript lists different items than code produces

### Phase 3 — Infrastructure Audit

**3a. LaTeX Macro Hygiene**

- Every data-carrying macro should be generated by a script, not hand-typed
  in the preamble
- Pattern to detect: `\newcommand{\someMetric}{42.7}` defined directly in `.tex`
  files (bad) vs `\input{generated/metrics.tex}` where that file is script output (good)
- Flag macros whose values appear nowhere in script outputs
- Flag macros defined in main `.tex` files that carry numeric/data values

**3b. Pipeline Completeness**

- Does a single command reproduce all manuscript artifacts from raw data?
- Is the pipeline documented (Makefile, README, CI config)?
- Are intermediate steps cached or do they require full re-execution?
- Are random seeds fixed for reproducibility?
- Are software versions pinned (requirements.txt, environment.yml, lock files)?

**3c. Config/Code Separation**

- Are hyperparameters, thresholds, model names in config files?
- Are file paths relative (portable) or absolute (fragile)?
- Are credentials, API keys, or machine-specific paths absent from committed code?
- Is there a single config entry point or are settings scattered across scripts?

**3d. Stale Output Detection**

- Compare modification timestamps: script vs its output files
- Flag outputs that are older than their generating scripts (stale)
- Flag outputs with no corresponding script (orphaned)
- Flag scripts with no corresponding output (dead code or unrun)

**3e. Version Pinning**

- Are dependencies locked (requirements.txt with versions, conda environment.yml,
  poetry.lock, package-lock.json)?
- Are data versions tracked (DVC, git-lfs, data checksums)?
- Is the manuscript itself versioned alongside code (same repo, tagged releases)?

### Phase 4 — Cross-Reference and Manifest Generation

**4a. Macro Manifest Generation**

Produce the **macro manifest** — the primary handoff artifact to manuscript-review.
For every data-carrying macro identified in Phase 1a and traced in Phase 2a:

```
Macro: \bestf
Value: 0.847
Source: results/metrics.json → scripts/generate_latex_macros.py → generated/metrics.tex
Locations:
  - paper.tex:142 — "achieving an F1 score of \bestf{}"
  - paper.tex:287 — "The \bestf{} result represents a substantial improvement"
  - abstract.tex:8 — "...with \bestf{} F1 score"
Classification: MACRO-TRACED
```

Also include every **bare number** (not a macro) found in Phase 1a that
carries data (metrics, counts, parameters) — these are values that SHOULD
be macros but aren't:

```
Bare value: 50
Location: paper.tex:198 — "convergence after 50 epochs"
Should-be-macro: YES — this is a training parameter, should trace to config
Classification: UNTRACED (no macro, no provenance)
```

Save the manifest as `[manuscript-name]-macro-manifest.json` alongside the
provenance report. This file is consumed by manuscript-review Pass 13
(Cross-Element Coherence) to verify prose-value appropriateness.

**4b. Cross-Reference with manuscript-review**

If a manuscript-review report exists for this manuscript, load it and:

- Map UNTRACED values to manuscript-review §6 (Methodology) and §7 (Results)
  findings — provenance gaps often co-occur with reproducibility concerns
- Flag terminology inconsistencies as potential §14 (Abbreviations) or §15
  (Notation) issues in the manuscript-review framework
- Feed HIGH-priority provenance issues as §6/§7 failures
- Feed macro manifest into manuscript-review §24 (Cross-Element Coherence)
  findings — macro values whose surrounding prose uses inappropriate qualitative
  language ("marginal" for 14.3%, "dramatic" for 0.3%) are §24 failures

If no manuscript-review report exists, recommend running it as a companion audit
and note that the macro manifest is available for its Pass 13.

### Phase 5 — Report Generation

Load `references/checklist.md` and `references/report-template.md`.

```
Read references/checklist.md
Read references/report-template.md
```

Generate the provenance report following the template structure:

1. **Provenance Summary** — overall score, breakdown by category
2. **Provenance Map** — each manuscript artifact linked to its source
3. **Defect Registry** — every UNTRACED, STALE, MANUAL, INCONSISTENT finding
4. **Infrastructure Assessment** — pipeline, config, versioning status
5. **Remediation Queue** — prioritized fixes
6. **Checklist Status** — full checklist with pass/fail per checkpoint

### Phase 6 — Output

Save two files in the manuscript directory:

1. `[manuscript-name]-provenance-report.md` — the full provenance report
2. `[manuscript-name]-macro-manifest.json` — the structured macro manifest
   for consumption by manuscript-review Pass 13

The macro manifest JSON structure:

```json
{
  "macros": [
    {
      "name": "\\bestf",
      "value": "0.847",
      "source_chain": "results/metrics.json → scripts/gen_macros.py → generated/metrics.tex",
      "locations": [
        {
          "file": "paper.tex",
          "line": 142,
          "context": "achieving an F1 score of \\bestf{}"
        },
        {
          "file": "paper.tex",
          "line": 287,
          "context": "The \\bestf{} result represents a substantial improvement"
        }
      ],
      "classification": "MACRO-TRACED"
    }
  ],
  "bare_numbers": [
    {
      "value": "50",
      "location": {
        "file": "paper.tex",
        "line": 198,
        "context": "convergence after 50 epochs"
      },
      "section": "methodology",
      "should_be_macro": true,
      "rationale": "Training parameter — should trace to config",
      "classification": "UNTRACED"
    }
  ]
}
```

Present to the user:

- Provenance coverage percentage (TRACED / total artifacts)
- Count of UNTRACED / STALE / MANUAL findings by severity
- Count of bare numbers that should be macros
- Top 5 remediation actions
- Pipeline completeness verdict
- Note that macro manifest is available for manuscript-review Pass 13

## Severity Classification

- **CRITICAL** — Value in manuscript has no provenance chain AND is a key result
  (main finding, abstract metric, table headline number). This means the paper's
  core claims cannot be verified from code.

- **HIGH** — Value/table/figure is untraced or stale, and appears in results or
  methodology sections. Reproducibility gap.

- **MEDIUM** — Terminology mismatch, manual ordering, partial table generation,
  config values hardcoded in scripts. Maintenance and consistency risk.

- **LOW** — Minor issues: display-name mapping missing but terms are close,
  non-critical figures without generation scripts, cosmetic post-editing of
  generated figures.

## Core Principles

- **Binary provenance.** Every artifact is either traced or not. No "partially
  reproducible" — partial means broken.

- **Code is truth.** When manuscript and code disagree, the manuscript is wrong
  until proven otherwise. Flag the disagreement; do not assume the manuscript
  author "meant to" override code output.

- **Macros over magic numbers.** Every data value in LaTeX should be a macro.
  Every macro should be generated. No exceptions for "obvious" values.

- **Pipeline as proof.** If `make` (or equivalent) does not produce the PDF from
  raw data, the manuscript is not reproducible. Partial pipelines get partial
  credit, not a pass.

- **Config is not code.** Hyperparameters, thresholds, model names, file paths —
  all belong in config files, not scattered through script bodies.

- **Ordering is data.** The sequence of items in a table or enumeration is an
  assertion. It must come from code (sort order, enum definition) not from the
  author's sense of what "looks right."

- **Timestamps matter.** A figure generated last month from a script modified
  yesterday is suspect. Stale outputs are provenance failures.

- **Companion, not replacement.** This audit checks computational grounding.
  manuscript-review checks document quality. Both are needed. Neither subsumes
  the other.

## Example Invocation Patterns

User says any of:

- "Check provenance"
- "Are my numbers from code"
- "Audit my pipeline"
- "Verify reproducibility"
- "Check manuscript against scripts"
- "Provenance audit"
- "Are my tables generated"
- "Do my figures come from scripts"
- "/manuscript-provenance"

All trigger this skill.