---
name: schema-discoverer
description: Infer schema from sample data files (CSV, JSON, Parquet) and generate type definitions.
allowed-tools: Read Grep Glob Bash
---

# Schema Discoverer

**Audience:** Data engineers and analysts working with unfamiliar data files.

**Goal:** Analyze data files to infer schema and generate type definitions in multiple output formats.

## Workflow

1. **Identify file format** from extension and content
2. **Sample the data** (first 1000 rows for large files)
3. **Infer column types** with confidence scores
4. **Detect patterns** (dates, emails, IDs, categories)
5. **Generate output** in requested format

## Type Inference Rules

### Numeric Detection
- Integer: All values match `^\d+$` with no leading zeros (except "0")
- Float: Values match `^\d+\.\d+$` or scientific notation
- Currency: Values match `^\$?\d{1,3}(,\d{3})*(\.\d{2})?$`

### String Patterns
- Email: `^[\w\.-]+@[\w\.-]+\.\w+$`
- URL: `^https?://`
- UUID: `^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$`
- Phone: `^\+?[\d\s\-\(\)]+$` with 10+ digits

### Date Patterns
- ISO: `YYYY-MM-DD` or `YYYY-MM-DDTHH:MM:SS`
- US: `MM/DD/YYYY`
- EU: `DD/MM/YYYY`
- Timestamp: Unix epoch (10 or 13 digits)

### Categorical Detection
- Unique ratio < 5% of total rows
- Repeated values dominate

## Output Formats

### Pydantic Model
```python
from pydantic import BaseModel, Field
from datetime import date
from typing import Literal

class Record(BaseModel):
    id: int = Field(gt=0)
    email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$')
    status: Literal['active', 'pending', 'inactive']
    created_at: date
    amount: float = Field(ge=0)
```

### Pandera Schema
```python
import pandera as pa

schema = pa.DataFrameSchema({
    "id": pa.Column(int, pa.Check.gt(0), unique=True),
    "email": pa.Column(str, pa.Check.str_matches(r'^[\w\.-]+@')),
    "status": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])),
    "created_at": pa.Column("datetime64[ns]"),
    "amount": pa.Column(float, pa.Check.ge(0)),
})
```

### SQL CREATE TABLE
```sql
CREATE TABLE records (
    id INTEGER PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')),
    created_at DATE NOT NULL,
    amount DECIMAL(10, 2) CHECK (amount >= 0)
);
```

### TypeScript Interface
```typescript
interface Record {
    id: number;
    email: string;
    status: 'active' | 'pending' | 'inactive';
    created_at: string;
    amount: number;
}
```

## Execution Steps

1. Read file sample:
   - CSV: `pd.read_csv(path, nrows=1000)`
   - JSON: `pd.read_json(path, lines=True, nrows=1000)`
   - Parquet: `pd.read_parquet(path).head(1000)`

2. For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches

3. Generate confidence score (0-100) for each type inference

4. Output schema in requested format with comments explaining inference

## Report Format

```markdown
## Schema Discovery Report

**File:** data.csv
**Rows sampled:** 1000 of 50000
**Columns:** 5

| Column | Inferred Type | Confidence | Notes |
|--------|--------------|------------|-------|
| id | integer | 100% | All positive, unique |
| email | string(email) | 98% | 2% invalid format |
| status | categorical | 100% | 3 unique values |
| created_at | date | 95% | ISO format |
| amount | float | 100% | 2 decimals, no negatives |

### Recommendations
- Add NOT NULL constraint to: id, email, created_at
- Consider UNIQUE constraint on: id, email
- Status should be ENUM: active, pending, inactive
```