--- name: schema-discoverer description: Infer schema from sample data files (CSV, JSON, Parquet) and generate type definitions. allowed-tools: Read Grep Glob Bash --- # Schema Discoverer **Audience:** Data engineers and analysts working with unfamiliar data files. **Goal:** Analyze data files to infer schema and generate type definitions in multiple output formats. ## Workflow 1. **Identify file format** from extension and content 2. **Sample the data** (first 1000 rows for large files) 3. **Infer column types** with confidence scores 4. **Detect patterns** (dates, emails, IDs, categories) 5. **Generate output** in requested format ## Type Inference Rules ### Numeric Detection - Integer: All values match `^\d+$` with no leading zeros (except "0") - Float: Values match `^\d+\.\d+$` or scientific notation - Currency: Values match `^\$?\d{1,3}(,\d{3})*(\.\d{2})?$` ### String Patterns - Email: `^[\w\.-]+@[\w\.-]+\.\w+$` - URL: `^https?://` - UUID: `^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$` - Phone: `^\+?[\d\s\-\(\)]+$` with 10+ digits ### Date Patterns - ISO: `YYYY-MM-DD` or `YYYY-MM-DDTHH:MM:SS` - US: `MM/DD/YYYY` - EU: `DD/MM/YYYY` - Timestamp: Unix epoch (10 or 13 digits) ### Categorical Detection - Unique ratio < 5% of total rows - Repeated values dominate ## Output Formats ### Pydantic Model ```python from pydantic import BaseModel, Field from datetime import date from typing import Literal class Record(BaseModel): id: int = Field(gt=0) email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$') status: Literal['active', 'pending', 'inactive'] created_at: date amount: float = Field(ge=0) ``` ### Pandera Schema ```python import pandera as pa schema = pa.DataFrameSchema({ "id": pa.Column(int, pa.Check.gt(0), unique=True), "email": pa.Column(str, pa.Check.str_matches(r'^[\w\.-]+@')), "status": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])), "created_at": pa.Column("datetime64[ns]"), "amount": pa.Column(float, pa.Check.ge(0)), }) ``` ### SQL CREATE TABLE ```sql CREATE TABLE records ( id INTEGER PRIMARY KEY, email VARCHAR(255) NOT NULL, status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')), created_at DATE NOT NULL, amount DECIMAL(10, 2) CHECK (amount >= 0) ); ``` ### TypeScript Interface ```typescript interface Record { id: number; email: string; status: 'active' | 'pending' | 'inactive'; created_at: string; amount: number; } ``` ## Execution Steps 1. Read file sample: - CSV: `pd.read_csv(path, nrows=1000)` - JSON: `pd.read_json(path, lines=True, nrows=1000)` - Parquet: `pd.read_parquet(path).head(1000)` 2. For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches 3. Generate confidence score (0-100) for each type inference 4. Output schema in requested format with comments explaining inference ## Report Format ```markdown ## Schema Discovery Report **File:** data.csv **Rows sampled:** 1000 of 50000 **Columns:** 5 | Column | Inferred Type | Confidence | Notes | |--------|--------------|------------|-------| | id | integer | 100% | All positive, unique | | email | string(email) | 98% | 2% invalid format | | status | categorical | 100% | 3 unique values | | created_at | date | 95% | ISO format | | amount | float | 100% | 2 decimals, no negatives | ### Recommendations - Add NOT NULL constraint to: id, email, created_at - Consider UNIQUE constraint on: id, email - Status should be ENUM: active, pending, inactive ```