---
name: office-to-md
description: Convert Office documents (Word, Excel, PowerPoint, PDF) to Markdown format. ONLY use this skill when the user explicitly requests to CONVERT, TRANSFORM or PARSE a specific office file into Markdown. Do NOT trigger for general questions, documentation reading, or discussions about files.
---

# Office Document to Markdown Converter

Convert various Office document formats to structured Markdown with text, table, and image extraction.

## File Description

- `enhanced_parser.py` - Core document parser
- `doc_converter.py` - DOC to DOCX converter (requires LibreOffice)
- `requirements.txt` - Python dependencies

## Install Dependencies

```bash
pip install -r requirements.txt
```

### Additional Dependencies for DOC Format

.doc format requires LibreOffice:

```bash
# Windows: Install LibreOffice from official website
# https://www.libreoffice.org/download/

# Linux
sudo apt install libreoffice

# Mac
brew install --cask libreoffice
```

## Quick Start

### Python Code

```python
from enhanced_parser import EnhancedDocumentParser

# Initialize parser
parser = EnhancedDocumentParser(
    image_base_url="http://localhost:5000",
    image_save_dir="./static/images",
    filter_headers_footers=True  # Filter headers and footers
)

# Parse document
result = parser.parse_document("document.docx")

if result["success"]:
    print(result["markdown"])
    print(f"Extracted {result['images_count']} images")
```

### Start API Service

```bash
# Start service using app.py from project root
python app.py

# Visit http://localhost:5000/analyzer to upload files
```

## Supported Formats

| Format | Extensions | Notes |
|--------|-----------|-------|
| Word | .docx, .doc | .doc requires LibreOffice |
| Excel | .xlsx, .xls | Supports multiple worksheets and date formats |
| PowerPoint | .pptx | Extracts slide text and images |
| PDF | .pdf | Auto-detects tables and images |

## Features

### Word Documents
- Automatic heading level detection
- Convert tables to Markdown tables
- Extract inline images
- Filter headers and footers
- Preserve list formatting

### Excel Workbooks
- Support for multiple worksheets
- Automatic date format detection (prevents display as numbers)
- Convert to Markdown tables
- Extract embedded images

### PowerPoint Presentations
- Extract content by slide
- Extract images and text boxes
- Preserve slide order

### PDF Documents
- Auto-detect tables (line detection + text position detection)
- Extract page images
- Intelligently identify headings and lists
- Output content in original order

## Advanced Options

### DOC Conversion

```bash
# Test LibreOffice configuration
python doc_converter.py
```

### PDF Table Strategy

```python
parser = EnhancedDocumentParser(
    pdf_table_strategy="lines_strict"  # Default: strict line detection, fastest
    # "lines": Normal line detection
    # "text": Based on text position, more accurate but slower
)
```

### Image Processing

```python
parser = EnhancedDocumentParser(
    image_base_url="https://your-domain.com",  # Image access URL
    image_save_dir="./static/images"           # Image save directory
)
```

## Return Format

```json
{
  "success": true,
  "markdown": "# Document Title\n\nContent...",
  "images_count": 2,
  "images": [
    {
      "filename": "uuid.png",
      "url": "http://localhost:5000/static/images/uuid.png",
      "size": 12345
    }
  ],
  "file_type": "docx",
  "file_info": {
    "name": "document.docx",
    "size": 45678,
    "paragraphs": 50,
    "tables": 3
  }
}
```

## Common Issues

### DOC Conversion Failed
- Ensure LibreOffice is installed
- Run `python doc_converter.py` to test configuration

### Dates Display as Numbers
- Excel parsing automatically handles date formats
- Ensure you're using the latest version of enhanced_parser.py

### PDF Table Recognition Inaccurate
- Try different pdf_table_strategy parameters
- Use "lines_strict" for standard tables
- Use "text" for complex tables

## File Limitations

- Maximum file size: 160MB
- Supported extensions: docx, doc, pdf, xlsx, xls, pptx
- Automatic cleanup of temporary files