---
name: data-structure-checker
description: This skill should be used when reading any tabular data file (Excel, CSV, Parquet, ODS). It automatically detects and fixes common data issues including multi-level headers, encoding problems, empty rows/columns, and data type mismatches. Returns a clean DataFrame ready for analysis with zero user intervention.
---

# Data Structure Checker

## Overview

A comprehensive skill for automatically detecting and fixing common data file issues. Input any messy file and receive a clean DataFrame ready for analysis - zero intervention required.

**Auto-fixes**:
- Multi-level/hierarchical headers (flattens with separator)
- Encoding issues (utf-8, cp949, euc-kr, etc.)
- Empty rows and columns
- Data type inference and conversion
- Duplicate column names
- Unicode path issues (Korean/CJK filenames)

## When to Use

This skill should be triggered when:
- Reading any Excel files (.xlsx, .xls)
- Reading any CSV/TSV files
- Reading ODS files (OpenDocument)
- Reading Parquet files
- Encountering "Unnamed:" columns in data
- Dealing with multi-level or hierarchical headers
- Processing Korean/Asian language data files

## Usage

### Reading Data Files

To read any tabular data file with automatic issue detection and fixing, execute the `scripts/checker.py` script:

```python
import sys
sys.path.insert(0, 'skills/data-structure-checker/scripts')
from checker import smart_read

# Read file - handles all issues automatically
df = smart_read('data.xlsx')

# Read with report of fixes applied
df, report = smart_read('data.xlsx', return_report=True)
```

### Diagnosing Files

To analyze a file's structure without reading full data:

```python
from checker import diagnose

result = diagnose('data.xlsx')
# Returns: {'issues': ['multi_level_headers'], 'recommendations': [...]}
```

### Command Line

```bash
# Read and display summary
uv run python skills/data-structure-checker/scripts/checker.py data.xlsx

# Diagnose without reading
uv run python skills/data-structure-checker/scripts/checker.py data.xlsx --diagnose
```

## API Reference

### `smart_read(file_path, separator='_', return_report=False, sheet_name=0)`

Main entry point for reading files with automatic issue resolution.

**Parameters**:
- `file_path`: Path to the file (handles Korean/Unicode filenames)
- `separator`: Character(s) for joining multi-level headers (default: `'_'`)
- `return_report`: If True, return `(DataFrame, report)` tuple
- `sheet_name`: Sheet name or index for Excel files

**Returns**:
- `DataFrame` - Clean data ready for analysis
- Or `(DataFrame, report)` if `return_report=True`

### `diagnose(file_path, sheet_name=0)`

Analyze file structure without reading full data.

**Returns**: Dictionary with detected issues and recommendations.

## Report Structure

When `return_report=True`, the report contains:

```python
{
    'file_path': 'data/file.xlsx',
    'timestamp': '2024-12-17T10:30:00',
    'issues_detected': ['multi_level_headers', 'empty_rows_or_columns'],
    'fixes_applied': [
        'Flattened 3-level headers with "_" separator',
        'Removed 2 empty rows and 0 empty columns'
    ],
    'original_shape': (52, 111),
    'final_shape': (49, 111),
    'header_rows': [0, 1, 2],
    'type_conversions': {'score': 'object -> float64'}
}
```

## Issues Handled

### Multi-Level Headers

Detects and flattens hierarchical headers:

**Before**: `응시자 정보 | Unnamed: 1 | Unnamed: 2`

**After**: `응시자 정보_응시코드 | 응시자 정보_성명 | 응시자 정보_부서`

### Encoding Issues

Auto-detects encoding for CSV files:
- UTF-8 (with/without BOM)
- CP949 (Korean Windows)
- EUC-KR (Korean legacy)
- GBK/GB2312 (Chinese)
- Shift_JIS/EUC-JP (Japanese)

### Empty Rows/Columns

Removes rows and columns where all values are NaN.

### Data Type Inference

Converts string columns to appropriate types:
- Numeric strings → float64/int64
- Date strings → datetime64

### Duplicate Columns

Renames duplicates with suffixes: `['score', 'score', 'score']` → `['score', 'score_1', 'score_2']`

### Unicode Path Issues

Handles Korean/CJK filenames with different Unicode normalizations (NFC/NFD).

## Supported Formats

| Extension | Format | Notes |
|-----------|--------|-------|
| `.xlsx` | Excel | Modern Excel format |
| `.xls` | Excel | Legacy Excel format |
| `.csv` | CSV | Auto-detects encoding |
| `.tsv` | TSV | Tab-separated values |
| `.ods` | ODS | OpenDocument Spreadsheet |
| `.parquet` | Parquet | Columnar format |

## Dependencies

Ensure these packages are installed:

```bash
uv pip install openpyxl xlrd odfpy pyarrow
```

## Integration with Deep Insight

To integrate with the coder agent, replace standard pandas read:

```python
# Instead of:
import pandas as pd
df = pd.read_excel('data.xlsx')

# Use:
sys.path.insert(0, 'skills/data-structure-checker/scripts')
from checker import smart_read
df = smart_read('data.xlsx')
```

## Troubleshooting

### File not found with Korean filename
The skill handles Unicode normalization automatically. Verify the file path is correct.

### Unexpected column names
Check the report's `header_rows` field. To specify header rows explicitly:

```python
sys.path.insert(0, 'skills/data-structure-checker/scripts')
from reader import read_multi_level
df = read_multi_level('data.xlsx', header_rows=[0, 1])
```

### Preserve original types
To skip type inference, create `DataStructureChecker` with `infer_types=False`.