---
name: excel-parser
description: Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed.
---

# Excel Parser

## Table of Contents

- [Overview](#overview)
- [Core Philosophy: Scout Pattern](#core-philosophy-scout-pattern)
- [When to Use This Skill](#when-to-use-this-skill)
- [Processing Workflow](#processing-workflow)
- [Complexity Scoring Rules](#complexity-scoring-rules)
- [Path A: Pandas Standard Mode](#path-a-pandas-standard-mode)
- [Path B: HTML Semantic Mode](#path-b-html-semantic-mode)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)
- [Dependencies](#dependencies)
- [Resources](#resources)

## Overview

Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).

## Core Philosophy: Scout Pattern

Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:

1. **Metadata Scanning** - Use `openpyxl` to scan file structure without loading data
2. **Complexity Scoring** - Calculate score based on merged cells, row count, and layout
3. **Path Selection** - Choose between Pandas (fast) or HTML (accurate) processing
4. **Optimized Execution** - Execute with the most appropriate tool for the file type

**Key Principle**: "LLM handles metadata decisions, Pandas/HTML processes bulk data"

## When to Use This Skill

**Use excel-parser when:**
- Processing Excel/CSV files with unknown structure or varying complexity
- Handling files ranging from simple data tables to complex financial reports
- Need to optimize between processing speed and extraction accuracy
- Working with files that may contain merged cells, multi-level headers, or irregular layouts

**Skip this skill when:**
- File structure is already known and documented
- Processing simple, well-structured tables with confirmed format
- Using predefined scripts for specific file formats

## Processing Workflow

### Step 1: Analyze File Complexity

Use the `scripts/complexity_analyzer.py` to scan file metadata:

```bash
python scripts/complexity_analyzer.py <file_path> [sheet_name]
```

**What it analyzes** (without loading data):
- Merged cell distribution (shallow vs deep in the table)
- Row count and data continuity
- Empty row interruptions (indicates multi-table layouts)

**Output** (JSON format):
```json
{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}
```

### Step 2: Route to Optimal Strategy

Based on complexity analysis:

- **is_complex = false** → Use **Path A (Pandas Standard Mode)**
- **is_complex = true** → Use **Path B (HTML Semantic Mode)**

### Step 3: Execute Processing

Follow the selected path's workflow to extract data.

## Complexity Scoring Rules

### Rule 1: Deep Merged Cells
- **Condition**: Merged cells appearing beyond row 5
- **Interpretation**: Complex table structure (not just header formatting)
- **Decision**: Mark as complex if >2 deep merges detected
- **Example**: Financial reports with merged category labels in data region

### Rule 2: Empty Row Interruptions
- **Condition**: Multiple empty rows within the table
- **Interpretation**: Multiple sub-tables in single sheet
- **Decision**: Mark as complex if >2 empty row interruptions found
- **Example**: Summary table + detail table in one sheet

### Rule 3: Row Count Override
- **Condition**: Total rows >1000
- **Interpretation**: Too large for HTML processing (token explosion)
- **Decision**: Force Pandas mode regardless of complexity
- **Rationale**: HTML conversion would exceed token limits

### Rule 4: Default (Standard Table)
- **Condition**: No deep merges, continuous data, moderate size
- **Interpretation**: Standard data table
- **Decision**: Use Pandas for optimal speed

## Path A: Pandas Standard Mode

**When**: Simple/large tables (most common case)

**Strategy**: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.

**Workflow**:

1. **Sample First 20 Rows**
   - Read only the first 20 rows using `pd.read_excel(..., nrows=20)`
   - Convert to CSV format for analysis

2. **Determine Header Position**
   - Examine the sampled rows to identify which row contains the actual column headers
   - Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names

3. **Read Full Data**
   - Use `pd.read_excel(..., header=<detected_row>)` to load complete data
   - The header parameter ensures proper column naming

**Token Cost**: ~500 tokens (only 20 rows analyzed)
**Processing Speed**: Very fast (Pandas native speed)

> For implementation details, see `references/smart_excel_router.py`

## Path B: HTML Semantic Mode

**When**: Complex/irregular tables (merged cells, multi-level headers)

**Strategy**: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.

**Workflow**:

1. **Convert to Semantic HTML**
   - Load workbook with `openpyxl`
   - Build HTML table preserving merged cell spans
   - Use `rowspan` and `colspan` attributes to maintain structure

2. **Extract Structured Data**
   - Analyze HTML table structure
   - Identify hierarchical headers from merged cells
   - Extract data preserving semantic relationships

**Token Cost**: Higher (full HTML structure analyzed)
**Processing Speed**: Slower (semantic extraction)
**Use Case**: Only for small (<1000 rows), complex files where Pandas would fail

> For implementation details, see `references/smart_excel_router.py`

## Best Practices

### 1. Trust the Scout
Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.

### 2. Respect the Row Count Rule
Never attempt HTML mode on files >1000 rows. Token limits will cause failures.

### 3. Pandas First for Unknown Files
When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.

### 4. Cache Analysis Results
If processing multiple sheets from same file, run analysis once and cache results.

### 5. Preserve Original Files
Never modify the original Excel file during analysis or processing.

## Troubleshooting

### File Cannot Be Opened
- **Symptom**: `FileNotFoundError` or permission errors
- **Causes**: Invalid path, file locked by another process, insufficient permissions
- **Solutions**:
  - Verify file path is correct and file exists
  - Close the file if open in Excel or another application
  - Check read permissions on the file

### Corrupted File Errors
- **Symptom**: `BadZipFile` or `InvalidFileException`
- **Causes**: Incomplete download, file corruption, wrong file extension
- **Solutions**:
  - Re-download or obtain fresh copy of the file
  - Verify file is actual Excel format (not CSV with .xlsx extension)
  - Try opening in Excel to confirm file integrity

### Memory Issues with Large Files
- **Symptom**: `MemoryError` or system slowdown
- **Causes**: File too large for available RAM
- **Solutions**:
  - Use `read_only=True` mode in openpyxl
  - Process file in chunks using Pandas `chunksize` parameter
  - Increase system memory or use machine with more RAM

### Encoding Problems
- **Symptom**: Garbled text or `UnicodeDecodeError`
- **Causes**: Non-UTF8 encoding in source data
- **Solutions**:
  - Specify encoding when reading CSV: `pd.read_csv(..., encoding='gbk')`
  - For Excel, data is usually UTF-8; check source data generation

### HTML Mode Token Overflow
- **Symptom**: Truncated output or API errors
- **Causes**: Complex file exceeds token limits despite row count check
- **Solutions**:
  - Force Pandas mode even for complex files
  - Split sheet into smaller ranges and process separately
  - Extract only essential columns before HTML conversion

### Incorrect Header Detection
- **Symptom**: Wrong columns or data shifted
- **Causes**: Unusual header patterns not caught by sampling
- **Solutions**:
  - Manually specify header row if known
  - Increase sample size beyond 20 rows
  - Use HTML mode for better structure understanding

## Dependencies

Required Python packages:
- `openpyxl` - Metadata scanning and Excel file manipulation
- `pandas` - High-speed data reading and manipulation

## Resources

This skill includes:
- `scripts/complexity_analyzer.py` - Standalone executable for complexity analysis
- `references/smart_excel_router.py` - Complete implementation reference with both processing paths