--- name: addon-rag-ingestion-pipeline description: Use when adding multi-format RAG ingest, chunk, embed, and retrieval pipelines; pair with architect-python-uv-batch or architect-python-uv-fastapi-sqlalchemy. --- # Add-on: Multi-Format RAG Ingestion Pipeline Use this skill when an existing project needs RAG ingestion/retrieval across multiple document formats. ## Compatibility - Works with `architect-python-uv-batch`. - Works with `architect-python-uv-fastapi-sqlalchemy`. - Can back a Next.js app via a Python worker service. ## Inputs Collect: - `SOURCE_FORMATS`: one or more of `pdf`, `markdown`, `txt`, `html`, `csv`. - `EMBED_PROVIDER`: `openai` or `sentence-transformers`. - `VECTOR_STORE`: `pgvector`, `chroma`, or existing vector layer. - `CHUNK_SIZE`: default `1000`. - `CHUNK_OVERLAP`: default `150`. - `TOP_K`: default `5`. ## Integration Workflow 1. Add dependencies (Python worker path): ```bash uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters ``` - If `EMBED_PROVIDER=openai`: `uv add openai` - If `EMBED_PROVIDER=sentence-transformers`: `uv add sentence-transformers` - If `VECTOR_STORE=chroma`: `uv add chromadb` 2. Add modules: ```text src/{{MODULE_NAME}}/rag/ loaders/pdf_loader.py loaders/markdown_loader.py loaders/text_loader.py loaders/html_loader.py loaders/csv_loader.py normalize.py chunking.py embeddings.py indexer.py retriever.py ``` 3. Use a normalized document contract: - `document_id` - `source_path` - `source_type` - `content` - `metadata` (filename/page/section/checksum/ingested_at/model_version) 4. Implement ingestion entrypoint: ```bash uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt ``` 5. Implement retrieval entrypoint: ```bash uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5 ``` - Ensure both commands are wired in the project CLI/script entrypoint. - `rag-query` depends on an existing index from `rag-ingest`; do not run these validation commands in parallel. ## Loader Notes - PDF: extract per page and keep `page_number` in metadata. - Markdown: keep heading hierarchy and section anchors in metadata. - Text: detect encoding fallback (`utf-8`, then `latin-1`). - HTML: strip script/style tags and preserve title/headings where possible. - CSV: convert rows into stable textual records with row identifiers. ## Minimal Defaults ### `normalize.py` ```python import re import unicodedata def normalize_text(raw: str) -> str: text = unicodedata.normalize("NFKC", raw) text = text.replace("\r\n", "\n") text = re.sub(r"[ \t]+", " ", text) text = re.sub(r"\n{3,}", "\n\n", text) return text.strip() ``` ### `chunking.py` ```python from langchain_text_splitters import RecursiveCharacterTextSplitter def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]: splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=overlap, separators=["\n\n", "\n", ". ", " ", ""], ) return splitter.split_text(text) ``` ## Guardrails - Documentation contract for generated code: - Python: write module docstrings and docstrings for public classes, methods, and functions. - Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers. - Add concise rationale comments only for non-obvious logic, invariants, or safety constraints. - Apply this contract even when using template snippets below; expand templates as needed. - Deduplicate ingestion by checksum to keep re-runs idempotent. - Store embedding model/version so re-indexing can be reasoned about. - Never interpolate user queries into raw SQL vector search. - Keep ingestion async/offline for large corpora; do not block request-response paths. - Preserve citation metadata for retrieval (`source_path`, section, page, row id). ## Validation Checklist - Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic. ```bash uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5 uv run pytest -q ``` ## Decision Justification Rule - Every non-trivial decision must include a concrete justification. - Capture the alternatives considered and why they were rejected. - State tradeoffs and residual risks for the chosen option. - If justification is missing, treat the task as incomplete and surface it as a blocker.