--- name: importing-data description: Systematic CSV import process - discover structure, design schema, standardize formats, import to database, detect quality issues (component skill for DataPeeker analysis sessions) --- # Importing Data - Component Skill ## Purpose Use this skill when: - Starting a new analysis with CSV data files - Need to import CSV into relational database systematically - Want to ensure proper schema design and type inference - Need quality assessment before cleaning/analysis begins - Replacing ad-hoc import workflows This skill is a **prerequisite** for all DataPeeker analysis workflows and is referenced by all process skills. **Note:** DataPeeker uses SQLite (`data/analytics.db`), but this process applies to any SQL database. For SQLite-specific commands, see the `using-sqlite` skill. ## Prerequisites - **CSV file accessible** on local filesystem - **Database** with SQL support (DataPeeker uses SQLite at `data/analytics.db`) - **Workspace created** for analysis session (via `just start-analysis` or equivalent) - **Understanding** that this skill creates `raw_*` tables, not final tables (cleaning-data handles finalization) ## Data Import Process Create a TodoWrite checklist for the 5-phase data import process: ``` Phase 1: CSV Discovery & Profiling - pending Phase 2: Schema Design & Type Inference - pending Phase 3: Basic Standardization - pending Phase 4: Import Execution - pending Phase 5: Quality Assessment & Reporting - pending ``` Mark each phase as you complete it. Document all findings in numbered markdown files (`01-csv-profile.md` through `05-quality-report.md`) within your analysis workspace directory. --- ## Phase 1: CSV Discovery & Profiling **Goal:** Understand CSV file structure, detect encoding and delimiter, capture sample data for schema design. ### File Discovery Identify the CSV file(s) to import: - Ask user for CSV file path(s) - Verify file exists and is readable - Note file size for sampling strategy (>100K rows = sample-based profiling) **Document:** Record CSV file path, size, timestamp in `01-csv-profile.md` ### Encoding Detection Detect file encoding to prevent import errors: ```bash file -I /path/to/file.csv ``` Common encodings: - `charset=us-ascii` or `charset=utf-8` → Standard, no conversion needed - `charset=iso-8859-1` or `charset=windows-1252` → May need conversion to UTF-8 **Document:** Record detected encoding. If non-UTF-8, note conversion requirement. ### Delimiter Detection Analyze first few lines to detect delimiter: ```bash head -n 5 /path/to/file.csv ``` Check for: - Comma (`,`) - most common - Tab (`\t`) - TSV files - Pipe (`|`) - less common - Semicolon (`;`) - European CSV files **Document:** Record detected delimiter character. ### Header Detection Determine if first row contains headers: - Read first row - Check if row contains text labels vs data values - If ambiguous, ask user to confirm **Document:** Record whether headers present, list header names if found. ### Sample Data Capture Capture representative samples for schema inference: ```bash # First 10 rows head -n 11 /path/to/file.csv > /tmp/csv_head_sample.txt # Last 10 rows tail -n 10 /path/to/file.csv > /tmp/csv_tail_sample.txt # Row count wc -l /path/to/file.csv ``` **Document:** Include head and tail samples in `01-csv-profile.md` for reference during schema design. ### Phase 1 Documentation Template Create `analysis/[session-name]/01-csv-profile.md` with: ./templates/phase-1.md **CHECKPOINT:** Before proceeding to Phase 2, you MUST have: - [ ] CSV file path confirmed and file accessible - [ ] Encoding, delimiter, headers detected - [ ] Sample data captured (head + tail) - [ ] `01-csv-profile.md` created with all sections filled - [ ] Column overview completed with initial observations --- ## Phase 2: Schema Design & Type Inference **Goal:** Design database schema by inferring types from CSV samples, propose table structure with rationale. ### Analyze Column Types For each column from Phase 1 profiling, infer appropriate data type: **Type Inference Rules** (adapt to your database): 1. **Integer types** - Use when: - All non-NULL values are whole numbers - No decimal points observed - Typical for: IDs, counts, years, quantities - Examples: INTEGER (SQLite), INT/BIGINT (PostgreSQL/MySQL) 2. **Decimal/Float types** - Use when: - Values contain decimal points - Monetary amounts, measurements, percentages - Examples: REAL (SQLite), NUMERIC/DECIMAL (PostgreSQL), DECIMAL (MySQL) 3. **Text/String types** - Use when: - Mixed alphanumeric content - Dates/datetimes stored as text (ISO 8601: YYYY-MM-DD or YYYY-MM-DD HH:MM:SS) - Categories, names, descriptions - Examples: TEXT (SQLite), VARCHAR/TEXT (PostgreSQL/MySQL) - **Default choice when unsure** **Note:** Date/time handling varies by database. SQLite stores dates as TEXT. PostgreSQL/MySQL have native DATE/TIMESTAMP types. **Document:** For each column, record inferred type with rationale and sample values. ### Handle NULL Representations Identify NULL representations in CSV: - Empty cells → `NULL` in database - Literal strings: "N/A", "null", "NULL", "None", "#N/A" → Convert to `NULL` - Numeric codes: -999, -1 (if documented as NULL) → Convert to `NULL` **Document:** List all NULL representations found and conversion strategy. ### Design Table Schema Propose CREATE TABLE statement: ```sql CREATE TABLE raw_[table_name] ( [column_1_name] [TYPE], -- Rationale for type choice [column_2_name] [TYPE], -- Rationale for type choice ... ); ``` **Table naming convention:** - Prefix with `raw_` to indicate unprocessed data - Use lowercase with underscores: `raw_sales_data`, `raw_customers` - Derive from CSV filename or ask user for preferred name **Document:** Full CREATE TABLE statement with inline comments explaining each type choice. ### Present Schema to User Use AskUserQuestion tool to present schema proposal: - Show proposed table name - Show each column with type and rationale - Ask user to confirm or request changes **Document:** User's approval and any requested modifications. ### Phase 2 Documentation Template Create `analysis/[session-name]/02-schema-design.md` with: ./templates/phase-2.md **CHECKPOINT:** Before proceeding to Phase 3, you MUST have: - [ ] All columns analyzed with type inference rationale - [ ] NULL representations identified and mapped - [ ] CREATE TABLE statement drafted - [ ] User approved schema (via AskUserQuestion) - [ ] `02-schema-design.md` created with all sections filled --- ## Phase 3: Basic Standardization **Goal:** Define transformation rules for dates, numbers, whitespace, and text formatting to ensure clean, consistent data in raw_* table. ### Date Format Standardization **Target format:** ISO 8601 - Dates: `YYYY-MM-DD` (e.g., 2025-01-15) - Datetimes: `YYYY-MM-DD HH:MM:SS` (e.g., 2025-01-15 14:30:00) **Common source formats to convert:** - `MM/DD/YYYY` or `M/D/YYYY` → `YYYY-MM-DD` - `DD/MM/YYYY` or `D/M/YYYY` → `YYYY-MM-DD` (verify with user which is month vs day) - `YYYY/MM/DD` → `YYYY-MM-DD` (slash to hyphen) - `Mon DD, YYYY` → `YYYY-MM-DD` (e.g., "Jan 15, 2025" → "2025-01-15") - Timestamps with T separator: `YYYY-MM-DDTHH:MM:SS` → Keep as-is (valid ISO 8601) **Document:** List each date column, source format detected, target format, conversion logic. ### Number Format Normalization **Remove non-numeric characters:** - Currency symbols: `$123.45` → `123.45` - Comma separators: `1,234.56` → `1234.56` - Percentage signs: `45%` → `45` or `0.45` (document choice) - Units: `25kg`, `100m` → `25`, `100` (document unit in column name/comments) **Decimal handling:** - Preserve decimal points - Convert European format if detected: `1.234,56` → `1234.56` (verify with user) **Document:** List each numeric column, formatting issues found, normalization rules. ### Whitespace and Text Normalization **Whitespace cleaning:** - Trim leading/trailing whitespace from all TEXT columns - Normalize internal whitespace: multiple spaces → single space - Normalize line endings: `\r\n` or `\r` → `\n` **Text case standardization** (optional, apply selectively): - IDs/codes: Often uppercase for consistency - Names: Title case or preserve original - Free text: Preserve original case - **Document choice per column type** **Document:** Which columns get whitespace cleaning, which get case normalization. ### NULL Standardization Apply NULL representation mapping from Phase 2: - Convert all identified NULL representations to actual SQLite `NULL` - Empty strings → `NULL` for numeric/date columns - Empty strings → Preserve as empty string `''` for TEXT columns (document choice) **Document:** NULL conversion applied, count of conversions per column. ### Phase 3 Documentation Template Create `analysis/[session-name]/03-standardization-rules.md` with: ./templates/phase-3.md **CHECKPOINT:** Before proceeding to Phase 4, you MUST have: - [ ] Date standardization rules defined for all date columns - [ ] Number normalization rules defined for all numeric columns - [ ] Whitespace/text rules defined - [ ] NULL conversion mapping finalized - [ ] `03-standardization-rules.md` created with verification queries --- ## Phase 4: Import Execution **Goal:** Execute import with standardization rules, verify row count and data integrity. ### Generate CREATE TABLE Statement From Phase 2 schema design, finalize CREATE TABLE statement: ```sql CREATE TABLE IF NOT EXISTS raw_[table_name] ( [column_1] [TYPE], [column_2] [TYPE], ... ); ``` **Execute:** ```bash sqlite3 data/analytics.db < create_table.sql ``` **Verify table created:** ```sql -- Check table exists .tables -- Inspect schema PRAGMA table_info(raw_[table_name]); ``` **Document:** Paste table creation confirmation and schema output. ### Import CSV with Standardization **Import method options:** **Option 1: SQLite `.import` command** (for simple cases): ```bash sqlite3 data/analytics.db <