# Ingest Pipeline

This is the part `LLM-wiki` was missing for too long: raw intake that does not feel like spreadsheet punishment.

## Goal

Turn this:

- a pile of PDFs
- a folder full of Excel files
- screenshots, archives, docs, random client junk

into this:

- a current `manifests/raw_sources.csv`
- a structural lock file with hashes and cheap metadata
- structured diff hints for table-shaped raw
- a report that shows what is new, changed, archived, or duplicated
- a stale report that tells you what wiki pages are now suspect
- optional draft stubs for manual recompilation

Without burning LLM tokens on clerical work.

## The two scripts

### 1. `python3 scripts/ingest_raw.py`

What it does:

1. scans the local raw root
2. computes a SHA-256 prefix for each tracked file
3. detects duplicates by content hash
4. guesses file kind locally
5. updates `manifests/raw_sources.csv`
6. writes `manifests/raw_index.json`
7. writes `manifests/intake_report.md`
8. records compact change summaries for changed CSV/XLSX/XLSM sources

It is local, deterministic, and cheap.

It does **not** call an LLM.

### 2. `python3 scripts/stale_report.py`

What it does:

1. reads wiki frontmatter
2. reads manifest rows
3. reads the raw lock / current raw files
4. compares `source_hash`
5. reports:
   - fresh pages
   - stale pages
   - missing hashes
   - unresolved sources
   - archived source references
   - manifest rows still stuck at `status=new`

This is the default freshness layer.

### 3. `python3 scripts/delta_compile.py --write-drafts`

What it does:

1. reads stale pages and uncompiled raw rows
2. chooses a target page or keeps the existing stale page as target
3. writes a manual draft stub under `docs/wiki/drafts/`
4. pre-fills `source`, `source_hash`, `compiled_at`, and optional `compiled_from`
5. emits `manifests/delta_compile_report.md`

It does **not** overwrite live wiki pages.

## Why this matters

Before this, most projects did one of two dumb things:

- hand-edit the manifest forever
- ignore raw freshness and pretend the wiki was still current

Now the boring parts are local automation:

- raw registration
- hash tracking
- duplicate detection
- stale reporting

LLM tokens can go to synthesis, not janitorial work.

## Supported local parsers

Current local parsing is intentionally cheap:

- `csv/tsv` → row count + headers
- `csv/tsv` changed files → row count, headers, key/value column hints, tracked row-change summary
- `xlsx/xlsm` → workbook sheet names, dimensions, header hints, and per-sheet structural change summary
- `xls` → legacy workbook marker (no fake precision)
- `docx` → paragraph blocks
- `pptx` → slide count
- `pdf` → rough page count
- `image` → dimensions when detectable
- `zip/tar/gz` → archive entry count
- plaintext → first non-empty line

This is not “full semantic understanding”.

Good. It should not be. The goal is to make raw intake cheaper and more reliable before any LLM touches it.

## Default workflow

When a batch of files lands:

```bash
python3 scripts/ingest_raw.py
python3 scripts/raw_manifest_check.py
python3 scripts/stale_report.py
python3 scripts/delta_compile.py --write-drafts
```

Then:

- compile the new or changed sources into the wiki
- add verified examples if the material confirms exact mappings
- rerun checks

## Design stance

- intake is local-first
- parsing is structural-first
- writeback still matters
- stale detection should be routine, not heroics
- LLM work starts **after** the raw surface is cleaned up

If the wiki is the brain, `ingest_raw.py` and `stale_report.py` are the eyes and pulse check.