--- title: "File Processors Guide" description: Comprehensive file processing system supporting 20+ file types with intelligent content extraction, security sanitization, and provider-agnostic formatting keywords: [ file-processors, multimodal, content-extraction, file-types, processor-registry, document-processing, security-sanitization, ] --- # File Processors Guide NeuroLink includes a comprehensive file processing system that supports 20+ file types with intelligent content extraction, security sanitization, and provider-agnostic formatting. This system enables seamless multimodal AI interactions across all 13 supported providers. ## Overview The file processor system is organized into a modular architecture: ``` src/lib/processors/ ├── base/ # BaseFileProcessor abstract class and types ├── registry/ # ProcessorRegistry singleton for processor selection ├── config/ # MIME types, extensions, language maps, size limits ├── errors/ # FileErrorCode enum and error helpers ├── document/ # Excel, Word, RTF, OpenDocument processors ├── media/ # Video and Audio processors (metadata extraction) ├── archive/ # ZIP, TAR, GZ archive processors (file listing + content extraction) ├── markup/ # SVG, HTML, Markdown, Text processors ├── code/ # SourceCode, Config processors ├── data/ # JSON, YAML, XML processors ├── integration/ # FileProcessorIntegration for registry usage └── cli/ # CLI helpers for file processing ``` ## Quick Start ```typescript import { NeuroLink } from "@juspay/neurolink"; const neurolink = new NeuroLink(); const result = await neurolink.generate({ input: { text: "Summarize this document", files: ["./report.pdf"], }, }); console.log(result.content); ``` ## Supported File Types ### Documents | Type | Extensions | Processor | Features | | ---------------- | ---------------------- | ----------------------- | ---------------------------------------------------- | | **Excel** | `.xlsx`, `.xls` | `ExcelProcessor` | Multi-sheet extraction, cell formatting, data tables | | **Word** | `.docx`, `.doc` | `WordProcessor` | Text extraction, paragraph preservation | | **RTF** | `.rtf` | `RtfProcessor` | Rich text to plain text conversion | | **OpenDocument** | `.odt`, `.ods`, `.odp` | `OpenDocumentProcessor` | LibreOffice/OpenOffice format support | ### Data Files | Type | Extensions | Processor | Features | | -------- | --------------- | --------------- | ------------------------------------------------ | | **JSON** | `.json` | `JsonProcessor` | Validation, pretty-printing, syntax highlighting | | **YAML** | `.yaml`, `.yml` | `YamlProcessor` | Validation, formatting, multi-document support | | **XML** | `.xml` | `XmlProcessor` | Parsing, validation, entity handling | ### Markup Files | Type | Extensions | Processor | Features | | ------------ | ------------------ | ------------------- | --------------------------------------------- | | **HTML** | `.html`, `.htm` | `HtmlProcessor` | OWASP-compliant sanitization, text extraction | | **SVG** | `.svg` | `SvgProcessor` | XSS prevention, text injection (not binary) | | **Markdown** | `.md`, `.markdown` | `MarkdownProcessor` | Formatting preservation, metadata extraction | | **Text** | `.txt` | `TextProcessor` | Plain text handling, encoding detection | ### Source Code | Type | Extensions | Processor | Features | | ------------------- | -------------------------- | --------------------- | ----------------------------------- | | **TypeScript** | `.ts`, `.tsx` | `SourceCodeProcessor` | Language detection, syntax metadata | | **JavaScript** | `.js`, `.jsx`, `.mjs` | `SourceCodeProcessor` | Module detection | | **Python** | `.py` | `SourceCodeProcessor` | Docstring preservation | | **Java** | `.java` | `SourceCodeProcessor` | Package detection | | **Go** | `.go` | `SourceCodeProcessor` | Module awareness | | **Rust** | `.rs` | `SourceCodeProcessor` | Crate detection | | **C/C++** | `.c`, `.cpp`, `.h`, `.hpp` | `SourceCodeProcessor` | Header handling | | **C#** | `.cs` | `SourceCodeProcessor` | Namespace detection | | **Ruby** | `.rb` | `SourceCodeProcessor` | Gem awareness | | **PHP** | `.php` | `SourceCodeProcessor` | Tag handling | | **Swift** | `.swift` | `SourceCodeProcessor` | Framework detection | | **Kotlin** | `.kt`, `.kts` | `SourceCodeProcessor` | Android/JVM awareness | | **Scala** | `.scala` | `SourceCodeProcessor` | SBT integration | | **Shell** | `.sh`, `.bash`, `.zsh` | `SourceCodeProcessor` | Shebang detection | | **SQL** | `.sql` | `SourceCodeProcessor` | Dialect hints | | **And 35+ more...** | Various | `SourceCodeProcessor` | Automatic language detection | ### Configuration Files | Type | Extensions | Processor | Features | | --------------- | ---------------- | ----------------- | ---------------------------------- | | **Environment** | `.env`, `.env.*` | `ConfigProcessor` | Secret masking option | | **INI** | `.ini`, `.cfg` | `ConfigProcessor` | Section parsing | | **TOML** | `.toml` | `ConfigProcessor` | Cargo.toml, pyproject.toml support | | **Properties** | `.properties` | `ConfigProcessor` | Java properties format | ### Media Files | Type | Extensions | Processor | Features | | --------- | ------------------------------------------------------- | ---------------- | -------------------------------------------------------------------------------- | | **Video** | `.mp4`, `.mkv`, `.webm`, `.avi`, `.mov`, `.m4v` | `VideoProcessor` | Duration, resolution, codec, frame rate, bitrate extraction via `music-metadata` | | **Audio** | `.mp3`, `.wav`, `.ogg`, `.flac`, `.aac`, `.m4a`, `.wma` | `AudioProcessor` | Codec, bitrate, sample rate, channels, duration extraction via `music-metadata` | Video and audio files are **not** sent as binary to the AI provider. Instead, the processors extract structured metadata and return it as formatted text, keeping token usage minimal (~50-200 tokens per file). **Example video output:** ``` Video File: presentation.mp4 Duration: 13s | Resolution: 640x360 | Video Codec: h264 Frame Rate: 29.97 fps | Bitrate: 345 kbps Audio: aac, 48000 Hz, 2 channels ``` **Example audio output:** ``` Audio File: recording.mp3 Codec: MPEG 1 Layer 3 | Bitrate: 128 kbps Sample Rate: 44100 Hz | Channels: 2 (Stereo) | Duration: 1:46 ``` ### Archives | Type | Extensions | Processor | Features | | ------- | ------------------------ | ------------------ | ---------------------------------------------------------------------- | | **ZIP** | `.zip` | `ArchiveProcessor` | File listing with sizes, nested content extraction, ZIP bomb detection | | **TAR** | `.tar` | `ArchiveProcessor` | File listing with sizes | | **GZ** | `.gz`, `.tar.gz`, `.tgz` | `ArchiveProcessor` | Gzip decompression, tar content listing | Archive files return a structured listing of their contents with file sizes and optionally extract text from contained files (routing through existing processors). **Example archive output:** ``` Archive: project.tar.gz Total entries: 6 Files: - code/sample.json (60 B) - code/sample.py (195 B) - document/sample.txt (607 B) ``` **Security:** Archive processing includes ZIP bomb detection (compression ratio limits), path traversal prevention, symlink blocking, entry count limits, and aggregate decompression size limits. ## Usage ### SDK Usage ```typescript import { NeuroLink } from "@juspay/neurolink"; const neurolink = new NeuroLink(); // Process multiple file types in a single request const result = await neurolink.generate({ input: { text: "Analyze these files and summarize the key information", files: [ "./data/report.xlsx", // Excel spreadsheet "./config/settings.yaml", // YAML configuration "./src/main.ts", // TypeScript source "./docs/architecture.svg", // SVG diagram (injected as text) "./api/schema.json", // JSON schema ], }, provider: "vertex", }); console.log(result.content); ``` ### CLI Usage ```bash # Single file neurolink generate "Analyze this spreadsheet" --file ./data.xlsx # Multiple files neurolink generate "Compare these configs" \ --file ./config.yaml \ --file ./settings.json \ --file ./app.toml # Mixed with images and PDFs neurolink generate "Explain this codebase" \ --file ./src/main.ts \ --file ./docs/diagram.svg \ --pdf ./docs/spec.pdf \ --image ./screenshot.png ``` ### Stream Mode ```typescript // Streaming with file processing const result = await neurolink.stream({ input: { text: "Walk me through this code step by step", files: ["./src/algorithm.py"], }, }); for await (const chunk of result.stream) { if ("content" in chunk) { process.stdout.write(chunk.content); } } ``` ## Architecture ### ProcessorRegistry The `ProcessorRegistry` is a singleton that manages all file processors with priority-based selection: ```typescript import { ProcessorRegistry } from "@juspay/neurolink"; // Get the singleton instance const registry = ProcessorRegistry.getInstance(); // Register a custom processor (lower priority = higher precedence) registry.register(new MyCustomProcessor(), 50); // Find processor for a file const processor = registry.findProcessor({ filename: "data.xlsx", mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", size: 1024, }); // Process a file const result = await processor.process(fileInfo, fileContent); ``` ### BaseFileProcessor All processors extend the abstract `BaseFileProcessor` class: ```typescript import { BaseFileProcessor, FileInfo, ProcessedFile } from "@juspay/neurolink"; export class MyProcessor extends BaseFileProcessor { readonly name = "my-processor"; readonly supportedMimeTypes = ["application/x-my-format"]; readonly supportedExtensions = [".myf"]; canProcess(file: FileInfo): boolean { return this.supportedExtensions.includes(file.extension); } async process(file: FileInfo, content: Buffer): Promise { const text = this.extractText(content); return { type: "text", content: text, metadata: { processor: this.name, originalFilename: file.filename, }, }; } getInfo(): ProcessorInfo { return { name: this.name, description: "Processes MY format files", supportedMimeTypes: this.supportedMimeTypes, supportedExtensions: this.supportedExtensions, }; } } ``` ### FileDetector The `FileDetector` utility automatically identifies file types: ```typescript import { FileDetector } from "@juspay/neurolink"; const detector = new FileDetector(); // Detect by extension const type1 = detector.detect("report.xlsx"); // Returns: { type: "xlsx", mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" } // Detect by content (magic bytes) const type2 = detector.detectFromContent(buffer); // SVG special handling - returns "svg" type, not "image" const type3 = detector.detect("diagram.svg"); // Returns: { type: "svg", mimeType: "image/svg+xml" } ``` ## Security Features ### OWASP-Compliant Sanitization The markup processors include security sanitization to prevent XSS and injection attacks: #### HTML Sanitization ```typescript // HtmlProcessor automatically sanitizes HTML content // - Removes