---
name: large-document-processing
description: Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Use when working with complex formatted documents, multi-level hierarchies, or when you need to extract structured data from large files like PDFs, DOCX, or text files.
---

# Large Document Processing

## Overview

A comprehensive skill for processing large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Designed for documents with complex formatting, hierarchical structures, and multi-level indentation.

## Capabilities

- **Multi-format Support**: DOCX, PDF, and text files
- **Structure Preservation**: Maintains document hierarchy, indentation, and formatting
- **Memory Efficiency**: Chunked processing to handle very large documents
- **Intelligent Parsing**: Recognizes headings, lists, dictionary entries, and semantic boundaries
- **Progress Tracking**: Real-time processing status and error recovery
- **Metadata Extraction**: Comprehensive document analysis and statistics

## Core Components

### 1. Advanced Document Parser

Parse complex document structures while preserving formatting and hierarchy.

**Key Features**:

- Hierarchical structure detection (levels 1-10)
- Formatting preservation (bold, italic, fonts, sizes)
- Page-by-page processing for memory efficiency
- Intelligent content classification
- Multi-language support with accent character handling

### 2. Implementation Pattern

```python
from .large_document_processor import LargeDocumentProcessor, ProcessingConfig

# Configure processing
config = ProcessingConfig(
    chunk_size_pages=50,
    parallel_workers=4,
    preserve_formatting=True
)

# Initialize processor
processor = LargeDocumentProcessor(config)

# Process document
results = processor.process_large_document(
    input_file="large_document.docx",
    output_dir="output/processed"
)
```

### 3. Intelligent Text Chunking

```python
from .intelligent_chunker import IntelligentTextChunker, ChunkType

chunker = IntelligentTextChunker(
    max_chunk_size=1024,
    overlap_ratio=0.15,
    preserve_sentences=True
)

chunks = chunker.chunk_document(text, ChunkType.SEMANTIC)
```

## Output Formats

- **Structured JSON**: Complete document hierarchy and metadata
- **Plain text**: Clean extracted text with optional formatting markers
- **Chunked data**: AI-ready text segments with overlap and metadata
- **Statistics report**: Processing metrics and quality analysis

## Best Practices

1. **Memory Management**: Use chunked processing for documents >100MB
2. **Parallel Processing**: Leverage multiple workers for batch operations
3. **Structure Validation**: Verify hierarchy detection accuracy
4. **Progress Tracking**: Provide user feedback for long-running operations

## Dependencies

- `python-docx`: DOCX file processing
- `PyMuPDF`: Advanced PDF processing
- `Pillow`: Image processing for embedded content
- `pathlib`: Cross-platform path handling