--- name: ingest-pdf-to-normalized description: Ingest KUKA PDFs (manuals, application notes, training material, error code refs) from kuka_dataset/raw_sources/ into normalized knowledge entries under kuka_dataset/normalized/. Use when new PDFs are added or when re-ingesting after schema updates. --- # Ingest PDF to Normalized Turn a directory of raw KUKA PDFs into typed, schema-validated knowledge entries the agent cell can reliably cite. ## When to Use - New PDFs added to `kuka_dataset/raw_sources/`. - `kuka_dataset/INGESTION_SCHEMA.md` changed (re-ingest to conform). - `kuka_knowledge` MCP search quality is poor (gap indicates missing normalization). ## Prerequisites - Python 3.11+ with `pdfplumber` or `pypdf` installed (for text extraction). - Optional: `ocrmypdf` if any PDF is image-only. - `kuka_dataset/INGESTION_SCHEMA.md` read in context. - `cowork/schemas/dataset_entry.schema.json` read in context. - `cowork/templates/INGESTION_ENTRY_TEMPLATE.md` read in context. ## Steps ### 1. Inventory List every PDF in `kuka_dataset/raw_sources/` with its current file name and size. For each PDF, determine: - **Document kind** — `vendor_manual`, `application_note`, `training`, `error_code_ref`, `white_paper`, `third_party_integrator`, `community`. - **Platform** — KR C4, KR C5, KR C2 (legacy), iiQKA, Sunrise. - **KSS version(s)** — from title page or metadata. - **Primary topic(s)** — motion, safety, fieldbus, KRL programming, KAREL-equivalent, etc. Output this inventory to `kuka_dataset/_ingestion_log.md` as a table. ### 2. Extract Text For each PDF, extract text per page. Preserve page numbers — they are needed for citation. If a PDF is image-only, OCR it first: ```bash ocrmypdf input.pdf output.pdf ``` Keep extracted text in a scratch directory (`kuka_dataset/raw_sources/_scratch/`, gitignored). ### 3. Chunk Per `kuka_dataset/INGESTION_SCHEMA.md`: - One concept per file. Chunks should be coherent units (e.g., "PTP Motion Instruction," not the entire motion chapter). - Max ~400 lines per normalized file. Split longer chunks. - Preserve section hierarchy in markdown headings. ### 4. Categorize For each chunk, decide the target subdirectory: - `articles/` — conceptual explanations, how-to. Prefix `ONE__.md`. - `reference/` — syntax, instruction reference, parameter tables. Prefix `KUKA_REF_.md`. - `examples/` — code examples with context. Prefix `EG_.md`. - `protocols/` — fieldbus, EKI, RSI, mxAutomation. Prefix `KUKA__.md`. - `safety/` — safety content only. Prefix `KUKA_SAFETY_.md`. ### 5. Emit Normalized Entries For each chunk, write a file under the chosen subdirectory with: **YAML frontmatter** (required; see `INGESTION_SCHEMA.md` for full spec): ```yaml --- id: KUKA_REF_PTP_Motion title: KRL PTP Motion Instruction topic: motion kuka_platform: [KR C4, KR C5] controller: [KSS 8.3, KSS 8.6] language: KRL source: type: vendor_manual title: "KUKA System Software 8.x Operating and Programming Instructions" tier: T1 pages: [412, 418] access_date: 2026-04-21 license: reference-only revision_date: 2026-04-21 related: [KUKA_REF_LIN_Motion, ONE_motion_termination] difficulty: intermediate tags: [motion, ptp, asynchronous] --- ``` **Body** — summary first, syntax/details next, examples last. Cite by page range. Do NOT reproduce more than a short quote verbatim — summarize in your own words. ### 6. Validate For each file, validate the frontmatter block against `cowork/schemas/dataset_entry.schema.json`. If a field is missing or typed wrong, fix and re-validate. ### 7. Update Manifests - Append an entry to `kuka_dataset/_manifest.json` with the file's `id`, path, frontmatter summary. - Add a row to `kuka_dataset/DATASET_INDEX.md` under the appropriate topic section. ### 8. Reindex Trigger `kuka_knowledge.reindex()` via the MCP tool so new entries are searchable. ### 9. QA Gate Hand every new entry to the QA agent for validation: - Frontmatter schema-valid? - Citations present and correct? - No verbatim copyright violation? - Topic / category correct? QA issues a `REVIEW_ingestion_.md`. Fix any findings before declaring the ingestion done. ### 10. Log Update `kuka_dataset/_ingestion_log.md` with which PDFs produced which normalized entries, date, and agent. ## Notes - Raw PDFs stay in `raw_sources/` and remain git-lfs tracked. Normalized entries are the citable product. - When a PDF is updated (new KSS version), produce a *new* normalized entry with incremented `revision_date`; keep the old for provenance if still relevant. - The Architect and Motion agents will cite normalized entries via `kuka_knowledge.search`; a good normalization schema means they find the right entry on the first search.