--- name: liteparse description: Local document and PDF parsing with spatial text and bounding boxes. Use for extracting text from PDFs, DOCX, Office files, and images; OCR on scans; layout-preserved JSON for RAG; batch-ingesting paper folders; or page screenshots for multimodal agents — even when the user does not name liteparse. Prefer over MarkItDown when you need bboxes, fast local parsing, or PNG page renders; prefer over the pdf skill for merge/split/forms. license: Apache-2.0 allowed-tools: Read Write Edit Bash compatibility: Python 3.10+. Optional LibreOffice (Office formats) and ImageMagick (images). Bundled Tesseract for OCR. All processing is local — no cloud API required. metadata: {"version": "1.0", "skill-author": "K-Dense Inc."} --- # LiteParse — Local Document Parsing ## Overview LiteParse is a fast, open-source document parser (Rust core, Python/Node bindings) focused on **local, layout-aware text extraction** with bounding boxes. It does not produce Markdown and does not call cloud LLMs. Outputs are **plain text** (layout-preserved) or **structured JSON** with per-page `text_items` (position, font metadata, optional confidence). **Version note:** Examples target **liteparse 2.0.0** (PyPI, May 2026). The upstream V1 branch is legacy; this skill documents **V2 / main** only. For parser selection vs MarkItDown, the `pdf` skill, or LlamaParse, see `references/choosing_a_parser.md`. ## When to Use This Skill Use LiteParse when you need: - **Fast local parsing** of PDFs or converted Office/image files without cloud dependencies - **Spatial text** with bounding boxes for layout-aware RAG, citation grounding, or figure/table region logic - **OCR** on scanned PDFs or images (bundled Tesseract, or a user-run HTTP OCR server) - **Page screenshots** (PNG) for multimodal agents that must see charts, figures, or handwriting - **Batch ingestion** of literature folders, supplementary PDFs, or protocol libraries - **Page subsets** or **password-protected** PDFs ## When Not to Use | Task | Use instead | |------|-------------| | Markdown for LLM ingestion (EPUB, audio, YouTube, HTML) | `markitdown` skill | | Merge/split PDFs, forms, watermarks, rotation | `pdf` skill | | Dense tables, handwriting, production cloud pipelines | [LlamaParse](https://docs.cloud.llamaindex.ai/llamaparse/overview) (cloud; sign up separately) | ## Installation ```bash uv pip install "liteparse==2.0.0" ``` This installs the Python bindings and the **`lit`** CLI. Verify: ```bash lit --help python -c "import liteparse; print(liteparse.__version__)" ``` **Optional system tools** (for non-PDF inputs): - **LibreOffice** — Word, Excel, PowerPoint, OpenDocument, CSV/TSV - **ImageMagick** — PNG, JPEG, TIFF, WebP, SVG, etc. Install commands are in `references/ocr_and_formats.md`. **Node.js / TypeScript** (optional): `npm i @llamaindex/liteparse` — see `references/api_reference.md`. --- ## Quick Start ### Python ```python from liteparse import LiteParse parser = LiteParse(quiet=True) result = parser.parse("paper.pdf") print(result.text) for page in result.pages: print(f"Page {page.page_num}: {len(page.text_items)} items") ``` ### CLI ```bash # Layout-preserved text (default) lit parse paper.pdf # Structured JSON with bounding boxes lit parse paper.pdf --format json -o paper.json # Disable OCR on text-native PDFs (faster) lit parse paper.pdf --no-ocr ``` --- ## Core Workflows ### 1. Parse to layout-preserved text Best for quick full-document text or feeding chunkers that do not need coordinates. ```python parser = LiteParse(ocr_enabled=True, quiet=True) result = parser.parse("document.pdf") full_text = result.text ``` ```bash lit parse document.pdf -o output.txt ``` ### 2. Parse to structured JSON (bounding boxes) Use when building layout-aware RAG, highlighting source regions, or joining text with screenshots. ```python import json from liteparse import LiteParse parser = LiteParse(output_format="json", quiet=True) result = parser.parse("document.pdf") # Programmatic access for page in result.pages: for item in page.text_items: bbox = (item.x, item.y, item.width, item.height) # item.text, item.confidence, item.font_name, item.font_size ``` ```bash lit parse document.pdf --format json -o document.json ``` JSON field layout: `references/output_formats.md`. ### 3. Parse specific pages ```python parser = LiteParse(target_pages="1-5,10,15-20", quiet=True) result = parser.parse("long_paper.pdf") ``` ```bash lit parse long_paper.pdf --target-pages "1-5,10" ``` ### 4. Parse from bytes or stdin Useful for uploads, S3 downloads, or piping remote PDFs. ```python with open("document.pdf", "rb") as f: result = parser.parse(f.read()) ``` ```bash curl -sL https://example.com/report.pdf | lit parse - ``` ### 5. Page screenshots for multimodal agents Screenshots capture visual content that text extraction alone misses (figures, complex tables, handwriting). ```python from pathlib import Path parser = LiteParse(dpi=150, quiet=True) shots = parser.screenshot("document.pdf", page_numbers=[1, 2, 3]) out = Path("screenshots") out.mkdir(exist_ok=True) for s in shots: (out / f"page_{s.page_num}.png").write_bytes(s.image_bytes) ``` ```bash lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots lit screenshot document.pdf --dpi 300 -o ./screenshots ``` Combine **JSON parse + screenshots** when an agent needs both coordinates and pixels for the same pages. ### 6. Batch-parse a directory For large corpora, prefer the CLI (parallel OCR workers) or the bundled script. ```bash lit batch-parse ./papers ./parsed --format json --recursive lit batch-parse ./papers ./parsed --extension .pdf --no-ocr ``` ```bash python scripts/batch_parse_dir.py ./papers ./parsed --format json --recursive ``` See `scripts/batch_parse_dir.py` for a Python batch wrapper without network calls. ### 7. OCR configuration OCR is **on by default**. Tesseract is bundled; no extra install for basic English OCR. ```python parser = LiteParse( ocr_enabled=True, ocr_language="eng", # Tesseract codes: fra, deu, etc. num_workers=4, # parallel OCR (default: CPU cores - 1) dpi=150, # higher DPI → better OCR, slower ) ``` ```bash lit parse scan.pdf --ocr-language fra lit parse scan.pdf --no-ocr lit parse scan.pdf --ocr-server-url http://localhost:8080/ocr ``` **Offline / air-gapped:** set `TESSDATA_PREFIX` to a directory of `.traineddata` files, or pass `--tessdata-path`. Details: `references/ocr_and_formats.md`. ### 8. Encrypted PDFs ```python parser = LiteParse(password="secret", quiet=True) result = parser.parse("protected.pdf") ``` ```bash lit parse protected.pdf --password secret ``` ### 9. Search text items by phrase Merge adjacent items and return combined bounding boxes for a phrase (e.g. section titles). ```python from liteparse import search_items page = result.get_page(1) matches = search_items(page.text_items, "Materials and Methods", case_sensitive=False) ``` --- ## Multi-Format Inputs | Category | Extensions (examples) | Requirement | |----------|----------------------|-------------| | PDF | `.pdf` | Native | | Office | `.docx`, `.xlsx`, `.pptx`, `.doc`, `.odt`, … | LibreOffice | | Images | `.png`, `.jpg`, `.tiff`, `.webp`, `.svg`, … | ImageMagick | Files are converted to PDF internally, then parsed. If conversion tools are missing, parsing fails with an actionable error — install the dependency and retry. --- ## Performance Tips - **`--no-ocr`** on born-digital PDFs — largest speedup - **`target_pages`** — parse only methods/supplement sections - **`num_workers`** — scale OCR across CPU cores - **`max_pages`** — cap very large files (default 1000) - **`lit batch-parse`** — directory-scale jobs with `--recursive` and `--extension` - Lower **`dpi`** (e.g. 100) when OCR quality is already sufficient --- ## Reference Files | File | Read when | |------|-----------| | `references/choosing_a_parser.md` | Unsure whether to use LiteParse, MarkItDown, pdf, or LlamaParse | | `references/api_reference.md` | Python/TypeScript API, types, `search_items` | | `references/cli_reference.md` | Full `lit` command flags | | `references/output_formats.md` | JSON schema, bboxes, confidence scores | | `references/ocr_and_formats.md` | Tesseract, HTTP OCR, LibreOffice, ImageMagick | --- ## Troubleshooting | Issue | Fix | |-------|-----| | Office file fails | Install LibreOffice; ensure `soffice` is on PATH (Windows: add LibreOffice `program` dir) | | Image fails | Install ImageMagick; verify `convert` or `magick` works | | OCR poor quality | Increase `--dpi`; try `--ocr-language`; or HTTP OCR server | | OCR slow | `--no-ocr` if not needed; reduce pages; increase `num_workers` | | Air-gapped OCR | `export TESSDATA_PREFIX=/path/to/tessdata` or `--tessdata-path` | | `ParseError` on bytes | Ensure input is valid PDF bytes (Office bytes need a file path + conversion) | --- ## Resources - **GitHub**: https://github.com/run-llama/liteparse - **Docs**: https://developers.llamaindex.ai/liteparse/ - **PyPI**: https://pypi.org/project/liteparse/2.0.0/ - **npm**: https://www.npmjs.com/package/@llamaindex/liteparse - **OCR API spec**: https://github.com/run-llama/liteparse/blob/main/OCR_API_SPEC.md