--- name: cleaning-data description: Systematic data quality remediation - detect duplicates/outliers/inconsistencies, design cleaning strategy, execute transformations, verify results (component skill for DataPeeker analysis sessions) --- # Cleaning Data - Component Skill ## Purpose Use this skill when: - Have completed the `importing-data` skill with quality report generated - Need to address data quality issues before analysis (duplicates, outliers, NULL handling, free text categorization) - Want systematic approach to cleaning decisions with documented rationale - Need to create clean tables ready for process skills (exploratory-analysis, guided-investigation, etc.) - Following DataPeeker principle: **cleaning is ALWAYS mandatory** even if minimal issues found This skill is a **prerequisite** for all DataPeeker analysis workflows and consumes the quality report from importing-data. ## Prerequisites Before using this skill, you MUST: 1. Have completed the `importing-data` skill successfully 2. Have access to `05-quality-report.md` generated by importing-data 3. Have `raw_*` table(s) in `data/analytics.db` 4. Be familiar with basic SQLite for validation queries 5. Understand data quality concepts: duplicates, outliers, NULL handling, categorical standardization ## Data Cleaning Process Create a TodoWrite checklist for the 5-phase data cleaning process: ``` Phase 1: Quality Report Review - pending Phase 2: Issue Detection (Agent-Delegated) - pending Phase 3: Cleaning Strategy Design - pending Phase 4: Cleaning Execution - pending Phase 5: Verification & Documentation - pending ``` Mark each phase as you complete it. Document all findings in numbered markdown files (`01-cleaning-scope.md` through `05-verification-report.md`) within your analysis workspace directory. --- ## Phase 1: Quality Report Review **Goal:** Read quality report from importing-data, understand detected issues, prioritize for cleaning based on impact and severity. ### Read Quality Report from importing-data **Locate the quality report:** ``` analysis/[session-name]/05-quality-report.md ``` This report (generated by importing-data Phase 5) contains: - Table schema and row counts - NULL percentages per column - Duplicate counts and examples - Outlier flags (3 MAD threshold) per numeric column - Free text candidates (columns with >50% uniqueness) - Summary of quality concerns **Extract key information:** - Which columns have >10% NULLs? - How many duplicate rows exist (exact duplicates)? - Which numeric columns have outliers? - Which text columns need categorization? **Document:** Summarize findings from quality report. ### Prioritize Issues Using Framework **Issue Prioritization Matrix:** Evaluate each issue on three dimensions: **1. Impact (% of rows/columns affected)** - **High:** >10% of rows affected - **Medium:** 1-10% of rows affected - **Low:** <1% of rows affected **2. Severity (effect on analysis validity)** - **Critical:** Makes analysis invalid or misleading (e.g., key column >50% NULL) - **Significant:** Reduces data quality for important columns (e.g., duplicates, inconsistent categories) - **Minor:** Affects edge cases only (e.g., outliers that are legitimate) **3. Effort (complexity to resolve)** - **Low:** Simple removal, exclusion, or standardization (1-2 SQL queries) - **Medium:** Requires sub-agent for categorization or pattern analysis (3-5 queries) - **High:** Complex deduplication, manual review, or domain expertise needed (>5 queries) **Combine dimensions to assign priority:** | Impact | Severity | Effort | Priority | Action Timing | |--------|-----------|--------|--------------|---------------| | High | Critical | Any | **CRITICAL** | Must address | | High | Significant| Low/Med| **HIGH** | Must address | | Medium | Critical | Low/Med| **HIGH** | Must address | | Any | Any | High | **MEDIUM** | Address if time permits | | Low | Minor | Any | **LOW** | Document, may skip | **Document:** Create prioritized issue table in `01-cleaning-scope.md`. ### Define Cleaning Scope and Objectives Create `analysis/[session-name]/01-cleaning-scope.md` with: ./templates/phase-1.md **CHECKPOINT:** Before proceeding to Phase 2, you MUST have: - [ ] Read and understood `05-quality-report.md` from importing-data - [ ] Extracted all detected issues (NULLs, duplicates, outliers, free text, FK orphans) - [ ] Applied prioritization matrix (impact × severity × effort) - [ ] Reviewed FK relationships and orphaned records (if multiple tables) - [ ] Defined cleaning objectives with success criteria - [ ] `01-cleaning-scope.md` created with all sections filled --- ## Phase 2: Issue Detection (Agent-Delegated) **Goal:** Deep-dive investigation of prioritized data quality issues using sub-agents to prevent context pollution. **CRITICAL:** This phase MUST use sub-agent delegation. DO NOT analyze data in main agent context. ### Detection 1: Duplicate Records **Use dedicated duplicate detection agents** **For exact duplicates:** Invoke the `detect-exact-duplicates` agent: ``` Task tool with agent: detect-exact-duplicates Parameters: - table_name: raw_[actual_table_name] - key_columns: [columns that define uniqueness, from Phase 1 scope] ``` **For near-duplicates (fuzzy matching):** Invoke the `detect-near-duplicates` agent: ``` Task tool with agent: detect-near-duplicates Parameters: - table_name: raw_[actual_table_name] - text_column: [specific text column flagged in Phase 1] ``` Repeat for each text column requiring fuzzy matching. **Document findings in `02-detected-issues.md` using template below.** --- ### Detection 2: Outliers (MAD-Based) **Use dedicated outlier detection agent** For each numeric column flagged in Phase 1: Invoke the `detect-outliers` agent: ``` Task tool with agent: detect-outliers Parameters: - table_name: raw_[actual_table_name] - numeric_col: [specific numeric column from Phase 1 scope] ``` Repeat for each numeric column requiring outlier analysis. **Document findings in `02-detected-issues.md` using template below.** --- ### Detection 4: Referential Integrity Validation **If multiple tables exist with FK relationships identified in Phase 1:** **Use dedicated FK validation agent** For each FK relationship flagged in Phase 1: Invoke the `detect-foreign-keys` agent (focused validation mode): ``` Task tool with agent: detect-foreign-keys Parameters: - database_path: data/analytics.db - child_table: [specific child table] - child_column: [FK column] - parent_table: [specific parent table] - parent_column: [PK column] ``` The agent will: - Confirm value overlap percentage (validate Phase 1 findings) - Identify specific orphaned record IDs - Assess orphan patterns (recent vs old, specific categories, etc.) - Quantify impact on analysis (% of records affected in joins) **If single table:** Skip this detection, document "N/A - Single table" in detected issues report. **Document findings in `02-detected-issues.md` using template below.** --- ### Review Sub-Agent Findings After all sub-agents return findings: **For duplicates:** - Are exact duplicates truly identical (all columns match)? - Are near-duplicates legitimate variations or data entry errors? - Which duplicate groups should be merged vs kept separate? **For outliers:** - Are outliers data errors or legitimate extreme values? - Do outliers follow any pattern (seasonal, geographic, product-specific)? - Which outliers should be excluded vs capped vs flagged? **For FK orphans (if multiple tables):** - Are orphaned records recent (may resolve soon) or old (permanent issue)? - Do orphans follow a pattern (specific categories, time periods)? - Can orphans be matched to parent records through fuzzy matching? - Should orphans be excluded, flagged, or have placeholder parents created? **Document:** Observations and preliminary decisions for Strategy Phase. ### Create Detected Issues Report Create `analysis/[session-name]/02-detected-issues.md` with: ./templates/phase-2.md **CHECKPOINT:** Before proceeding to Phase 3, you MUST have: - [ ] Delegated duplicate detection to sub-agent (exact and near-duplicates) - [ ] Delegated outlier detection to sub-agent (MAD-based, all flagged columns) - [ ] Delegated FK validation to sub-agent (if multiple tables with relationships) - [ ] Reviewed all sub-agent findings and documented observations - [ ] Created `02-detected-issues.md` with all sections filled (including FK orphans if applicable) - [ ] Identified specific records/issues for Phase 3 strategy decisions - [ ] Listed implications for Phase 3 (what user needs to decide) --- ## Phase 3: Cleaning Strategy Design **Goal:** Design cleaning approach for each detected issue type, present options with trade-offs, get user confirmation before execution. ### Review Detected Issues from Phase 2 From `02-detected-issues.md`, identify all issue types requiring decisions: - Exact duplicates: [count] - Near-duplicates: [count] - Outliers per column: [counts] - Free text categorization: [columns] ### Decision Framework: Duplicates **For exact duplicates, choose ONE approach:** **Option A: Keep First Occurrence** - **Method:** ORDER BY rowid, keep lowest rowid per duplicate group - **Pros:** Simple, deterministic, fast - **Cons:** First may not be most complete/accurate - **Use when:** No quality difference between duplicates **Option B: Keep Most Complete** - **Method:** Rank by completeness (fewest NULLs), keep best per group - **Pros:** Preserves maximum information - **Cons:** More complex, requires completeness scoring - **Use when:** Duplicates have varying data quality **Option C: Merge Records** - **Method:** Combine non-NULL values from all duplicates - **Pros:** No data loss - **Cons:** Complex, may create inconsistencies - **Use when:** Duplicates have complementary information **For near-duplicates, choose ONE approach:** **Option A: Auto-Merge High Confidence (>95%)** - **Method:** Apply fuzzy matching agent's high confidence mappings automatically - **Pros:** Efficient, addresses most obvious issues - **Cons:** Small risk of incorrect merges - **Use when:** Trust fuzzy matching agent's assessment **Option B: Manual Review All** - **Method:** Review every near-duplicate group before merging - **Pros:** Zero incorrect merges - **Cons:** Time-consuming - **Use when:** Data quality is critical **Option C: Skip Near-Duplicates** - **Method:** Only handle exact duplicates, leave fuzzy matches as-is - **Pros:** Conservative, no risk - **Cons:** Misses data quality improvements - **Use when:** Near-duplicates are legitimate variations **Document chosen approach in `03-cleaning-strategy.md`** --- ### Decision Framework: Outliers **For each numeric column with outliers, choose ONE approach:** **Option A: Exclude Outliers** - **Method:** Filter out rows where value > 3 MAD from median - **Pros:** Clean dataset, no extreme values skewing analysis - **Cons:** Data loss, may exclude legitimate extremes - **Use when:** Outliers are clearly data errors **Option B: Cap at Threshold** - **Method:** Set outliers to 3 MAD threshold (winsorization) - **Pros:** Preserves row count, reduces extreme influence - **Cons:** Distorts actual values - **Use when:** Want to preserve rows but limit extreme influence **Option C: Flag and Keep** - **Method:** Add outlier_flag column, keep all data - **Pros:** No data loss, analysts can filter if needed - **Cons:** Outliers may still skew analysis if not filtered - **Use when:** Outliers might be legitimate, need analyst judgment **Option D: Keep As-Is** - **Method:** No transformation - **Pros:** Preserves true data - **Cons:** Extremes may dominate analysis - **Use when:** Outliers are legitimate (VIPs, seasonal spikes) **Document chosen approach per column in `03-cleaning-strategy.md`** --- ### Decision Framework: Free Text Categorization **If free text columns flagged in Phase 2:** **Step 1: Invoke categorization agent** ``` Task tool with agent: categorize-free-text Parameters: - column_name: [specific text column] - unique_values: [list from Phase 2 detection] - business_context: [optional context about what values represent] ``` **Step 2: Review agent's proposed categories** Agent will return: - Proposed category schema (3-10 categories) - Value mappings with confidence levels - Ambiguous/uncategorizable values flagged **Step 3: Choose categorization approach:** **Option A: Accept Agent Proposal** - **Method:** Use agent's categories and mappings as-is - **Pros:** Fast, leverages semantic analysis - **Cons:** May miss business context - **Use when:** Agent's categories make sense for analysis **Option B: Modify Categories** - **Method:** Adjust agent's proposal (rename, merge, split categories) - **Pros:** Incorporates business knowledge - **Cons:** Requires manual refinement - **Use when:** Agent's categories are close but need tweaking **Option C: Manual Categorization** - **Method:** Define categories and mappings from scratch - **Pros:** Full control, perfect fit for business needs - **Cons:** Time-consuming - **Use when:** Agent's proposal doesn't fit business model **Option D: Keep As-Is** - **Method:** No categorization - **Pros:** Preserves original data - **Cons:** High-cardinality text column harder to analyze - **Use when:** Free text values are inherently unique (IDs, descriptions) **Document chosen approach in `03-cleaning-strategy.md`** --- ### Decision Framework: Referential Integrity (If Multiple Tables) **For FK orphaned records identified in Phase 2:** For each FK relationship with orphaned records, choose ONE approach: **Option A: Exclude Orphaned Records** - **Method:** Filter out child records where FK doesn't match any parent PK - **Pros:** Clean referential integrity, INNER JOINs work correctly - **Cons:** Data loss - **Use when:** Orphaned records are data errors with no business value **Option B: Preserve with NULL** - **Method:** Set orphaned FK values to NULL (retain child records) - **Pros:** Preserves child row count, makes orphans explicit - **Cons:** Loses relationship information, NULL handling required in queries - **Use when:** Child records have value even without parent context **Option C: Flag and Keep** - **Method:** Add `[fk_column]_orphan_flag` column, keep original FK value - **Pros:** No data loss, analysts can filter as needed - **Cons:** Referential integrity violated until analyst filters - **Use when:** Need investigation before deciding, orphans may resolve **Option D: Create Placeholder Parent** - **Method:** Insert synthetic parent record (e.g., id=-1, name="Unknown"), map orphans to it - **Pros:** Preserves referential integrity AND child rows, INNER JOINs work - **Cons:** Creates synthetic data, may skew parent-level aggregations - **Use when:** JOINs required but can't lose child records (e.g., orders with unknown customer) **Document chosen approach per FK relationship in `03-cleaning-strategy.md`** --- ### Decision Framework: Business Rules (Optional) **If business rules were defined in Phase 1 scope:** For each rule, choose approach for violations: **Option A: Exclude Violating Records** - **Method:** Filter out rows that fail validation - **Pros:** Clean dataset, only valid data - **Cons:** Data loss - **Use when:** Invalid data cannot be corrected **Option B: Cap/Coerce to Valid Range** - **Method:** Adjust values to meet constraints - **Pros:** Preserves rows - **Cons:** Changes actual data - **Use when:** Violations are minor (e.g., age 150 → cap at 120) **Option C: Flag and Keep** - **Method:** Add validation_flag column - **Pros:** No data loss, transparent - **Cons:** Invalid data present - **Use when:** Need to investigate violations before deciding **Document chosen approach per rule in `03-cleaning-strategy.md`** --- ### Create Cleaning Strategy Document Create `analysis/[session-name]/03-cleaning-strategy.md` with: ./templates/phase-3.md **CHECKPOINT:** Before proceeding to Phase 4, you MUST have: - [ ] Reviewed all detected issues from Phase 2 - [ ] Chosen approach for duplicates (exact and near) - [ ] Chosen approach for outliers (per numeric column) - [ ] Reviewed free text categorization agent proposal (if applicable) - [ ] Chosen approach for free text categorization - [ ] Chosen approach for FK orphans (if multiple tables with relationships) - [ ] Defined business rule handling (if applicable) - [ ] User confirmed all strategies via checkpoint review - [ ] `03-cleaning-strategy.md` created with all decisions documented --- ## Phase 4: Cleaning Execution **Goal:** Execute approved cleaning strategies, create clean_* tables, track all exclusions and transformations. **CRITICAL:** All transformations use CREATE TABLE AS SELECT pattern. Keep raw_* tables intact. ### Transformation 1: Remove Duplicates **Based on Strategy from Phase 3:** **For Exact Duplicates (Keep First approach):** ```sql -- Create clean table without exact duplicates CREATE TABLE clean_[table_name] AS WITH ranked_records AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY [key_columns] ORDER BY rowid ) as rn FROM raw_[table_name] ) SELECT [all_columns] -- Exclude rn column FROM ranked_records WHERE rn = 1; ``` **For Exact Duplicates (Keep Most Complete approach):** ```sql -- Create clean table keeping most complete record per duplicate group CREATE TABLE clean_[table_name] AS WITH completeness_scored AS ( SELECT *, ([count non-NULL columns formula]) as completeness_score, ROW_NUMBER() OVER ( PARTITION BY [key_columns] ORDER BY completeness_score DESC, rowid ) as rn FROM raw_[table_name] ) SELECT [all_columns] -- Exclude rn and completeness_score FROM completeness_scored WHERE rn = 1; ``` **For Near-Duplicates (Auto-Merge High Confidence approach):** ```sql -- Create mapping table from categorization agent CREATE TABLE [column]_near_dup_mapping AS SELECT original_value, canonical_value FROM (VALUES ('[value1]', '[canonical]'), ('[value2]', '[canonical]'), ... ) AS mapping(original_value, canonical_value); -- Apply mapping UPDATE clean_[table_name] SET [text_column] = ( SELECT canonical_value FROM [column]_near_dup_mapping WHERE original_value = [text_column] ) WHERE [text_column] IN (SELECT original_value FROM [column]_near_dup_mapping); ``` **Verification:** ```sql -- Verify duplicate removal SELECT COUNT(*) as clean_count FROM clean_[table_name]; SELECT COUNT(*) - COUNT(DISTINCT [key_columns]) as remaining_dups FROM clean_[table_name]; -- Expected: 0 remaining duplicates -- Exclusion count SELECT [raw_count] - [clean_count] as excluded_duplicates; ``` **Document in `04-cleaning-execution.md`:** - SQL executed - Before count: [N] rows - After count: [N] rows - Duplicates removed: [N] rows ([X]%) --- ### Transformation 2: Handle Outliers **Based on Strategy from Phase 3:** **For Outliers (Exclude approach):** ```sql -- Calculate thresholds WITH stats AS ( SELECT [median_calculation] as median, [mad_calculation] * 1.4826 as mad FROM raw_[table_name] WHERE [numeric_col] IS NOT NULL ) -- Create clean table excluding outliers CREATE TABLE clean_[table_name] AS SELECT r.* FROM raw_[table_name] r CROSS JOIN stats s WHERE ABS(r.[numeric_col] - s.median) <= 3 * s.mad OR r.[numeric_col] IS NULL; -- Keep NULLs ``` **For Outliers (Cap at Threshold approach - Winsorization):** ```sql WITH stats AS ( SELECT [median_calculation] as median, [mad_calculation] * 1.4826 as mad FROM raw_[table_name] WHERE [numeric_col] IS NOT NULL ) CREATE TABLE clean_[table_name] AS SELECT [other_columns], CASE WHEN [numeric_col] > median + 3 * mad THEN median + 3 * mad WHEN [numeric_col] < median - 3 * mad THEN median - 3 * mad ELSE [numeric_col] END as [numeric_col] FROM raw_[table_name] CROSS JOIN stats; ``` **For Outliers (Flag and Keep approach):** ```sql WITH stats AS ( SELECT [median_calculation] as median, [mad_calculation] * 1.4826 as mad FROM raw_[table_name] WHERE [numeric_col] IS NOT NULL ) CREATE TABLE clean_[table_name] AS SELECT r.*, CASE WHEN ABS(r.[numeric_col] - s.median) > 3 * s.mad THEN 1 ELSE 0 END as [numeric_col]_outlier_flag FROM raw_[table_name] r CROSS JOIN stats s; ``` **Verification:** ```sql -- Verify outlier handling SELECT COUNT(*) as clean_count FROM clean_[table_name]; -- For Exclude approach: check no outliers remain WITH stats AS (...) SELECT COUNT(*) as remaining_outliers FROM clean_[table_name], stats WHERE ABS([numeric_col] - median) > 3 * mad; -- Expected: 0 -- For Cap approach: check values at thresholds SELECT MIN([numeric_col]), MAX([numeric_col]) FROM clean_[table_name]; -- For Flag approach: check flag distribution SELECT [numeric_col]_outlier_flag, COUNT(*) FROM clean_[table_name] GROUP BY [numeric_col]_outlier_flag; ``` **Document in `04-cleaning-execution.md`:** - Approach used per column - SQL executed - Before/after row counts (if excluding) - Outliers affected: [N] rows ([X]%) --- ### Transformation 3: Categorize Free Text **Based on Strategy from Phase 3:** **If using agent's proposed categories:** ```sql -- Create category mapping from agent output CREATE TABLE [column]_category_mapping AS VALUES ('[value]', '[Category 1]'), ('[value]', '[Category 1]'), ('[value]', '[Category 2]'), ... ) AS mapping(original_value, category); -- Apply categorization CREATE TABLE clean_[table_name] AS SELECT r.[other_columns], COALESCE(m.category, 'Other') as [column]_category FROM raw_[table_name] r LEFT JOIN [column]_category_mapping m ON r.[text_column] = m.original_value; ``` **If handling uncategorizable values:** ```sql -- Option A: Exclude uncategorizable CREATE TABLE clean_[table_name] AS SELECT r.*, m.category as [column]_category FROM raw_[table_name] r INNER JOIN [column]_category_mapping m ON r.[text_column] = m.original_value; -- INNER JOIN excludes unmapped values -- Option B: Map to "Other" category (shown in previous query with COALESCE) ``` **Verification:** ```sql -- Verify categorization coverage SELECT [column]_category, COUNT(*) as count, ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct FROM clean_[table_name] GROUP BY [column]_category ORDER BY count DESC; -- Check for unmapped values (if kept) SELECT COUNT(*) as unmapped FROM clean_[table_name] WHERE [column]_category IS NULL OR [column]_category = 'Other'; ``` **Document in `04-cleaning-execution.md`:** - Category schema used - SQL executed - Distribution by category - Unmapped/excluded values: [N] rows ([X]%) --- ### Transformation 4: Business Rule Validation (if applicable) **Based on Strategy from Phase 3:** ```sql -- Exclude rows violating business rules CREATE TABLE clean_[table_name] AS SELECT * FROM raw_[table_name] WHERE [rule_1_validation] AND [rule_2_validation] ...; -- Example rules: -- WHERE age BETWEEN 0 AND 120 -- AND amount > 0 -- AND date BETWEEN '2020-01-01' AND '2025-12-31' ``` **Verification:** ```sql -- Verify no violations remain SELECT COUNT(*) as violations FROM clean_[table_name] WHERE NOT ([rule_1_validation] AND [rule_2_validation] ...); -- Expected: 0 ``` **Document in `04-cleaning-execution.md`:** - Rules enforced - SQL executed - Violations excluded: [N] rows ([X]%) --- ### Transformation 5: Referential Integrity Enforcement (if multiple tables) **Based on Strategy from Phase 3:** **For Orphaned Records (Exclude approach):** ```sql -- Remove orphaned child records (Option A from Phase 3) CREATE TABLE clean_child_table AS SELECT c.* FROM raw_child_table c INNER JOIN raw_parent_table p ON c.fk_column = p.pk_column; -- INNER JOIN automatically excludes orphans ``` **For Orphaned Records (Preserve with NULL approach):** ```sql -- Set orphaned FK values to NULL (Option B from Phase 3) CREATE TABLE clean_child_table AS SELECT c.*, CASE WHEN p.pk_column IS NULL THEN NULL ELSE c.fk_column END as fk_column FROM raw_child_table c LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column; ``` **For Orphaned Records (Flag and Keep approach):** ```sql -- Add orphan flag column (Option C from Phase 3) CREATE TABLE clean_child_table AS SELECT c.*, CASE WHEN p.pk_column IS NULL AND c.fk_column IS NOT NULL THEN 1 ELSE 0 END as fk_column_orphan_flag FROM raw_child_table c LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column; ``` **For Orphaned Records (Create Placeholder Parent approach):** ```sql -- Step 1: Create placeholder parent record (Option D from Phase 3) INSERT INTO raw_parent_table (pk_column, name, other_fields) VALUES (-1, 'Unknown', NULL, ...); -- Step 2: Remap orphans to placeholder CREATE TABLE clean_child_table AS SELECT c.*, CASE WHEN p.pk_column IS NULL THEN -1 ELSE c.fk_column END as fk_column FROM raw_child_table c LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column; ``` **Verification:** ```sql -- Verify no orphans remain (for Exclude approach) SELECT COUNT(*) as orphans FROM clean_child_table c LEFT JOIN clean_parent_table p ON c.fk_column = p.pk_column WHERE p.pk_column IS NULL AND c.fk_column IS NOT NULL; -- Expected: 0 -- Verify NULL remapping (for Preserve with NULL approach) SELECT COUNT(*) as nulled_fks FROM clean_child_table WHERE fk_column IS NULL; -- Expected: [count of orphans from Phase 2] -- Verify flag accuracy (for Flag and Keep approach) SELECT fk_column_orphan_flag, COUNT(*) FROM clean_child_table GROUP BY fk_column_orphan_flag; -- Expected: flag=1 count matches orphan count from Phase 2 ``` **Document in `04-cleaning-execution.md`:** - FK relationship handled - Approach used (Exclude/Preserve/Flag/Placeholder) - SQL executed - Orphans affected: [N] rows ([X]%) - JOIN behavior after transformation [If single table: "N/A - Single table analysis"] --- ### Combined Transformation Approach **If multiple transformations needed, use CTE chain:** ```sql CREATE TABLE clean_[table_name] AS WITH -- Step 1: Remove duplicates deduped AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY [key_cols] ORDER BY rowid) as rn FROM raw_[table_name] ), no_dups AS ( SELECT [all_columns] FROM deduped WHERE rn = 1 ), -- Step 2: Handle outliers outliers_removed AS ( SELECT d.* FROM no_dups d CROSS JOIN (SELECT [median], [mad] FROM ...) stats WHERE ABS(d.[numeric_col] - stats.median) <= 3 * stats.mad ), -- Step 3: Apply categorization categorized AS ( SELECT o.*, COALESCE(m.category, 'Other') as [column]_category FROM outliers_removed o LEFT JOIN [column]_category_mapping m ON o.[text_col] = m.original_value ), -- Step 4: Enforce business rules final AS ( SELECT * FROM categorized WHERE [rule_validations] ) SELECT * FROM final; ``` **Verification of combined transformations:** ```sql -- Row count reconciliation SELECT (SELECT COUNT(*) FROM raw_[table_name]) as raw_count, (SELECT COUNT(*) FROM clean_[table_name]) as clean_count, (SELECT COUNT(*) FROM raw_[table_name]) - (SELECT COUNT(*) FROM clean_[table_name]) as total_excluded; ``` --- ### Create Cleaning Execution Log Create `analysis/[session-name]/04-cleaning-execution.md` with: ./templates/phase-4.md **CHECKPOINT:** Before proceeding to Phase 5, you MUST have: - [ ] Executed all transformations from Phase 3 strategy - [ ] Created clean_[table_name] table in data/analytics.db - [ ] Verified each transformation with specific checks - [ ] Reconciled row counts (raw = clean + exclusions) - [ ] Documented all exclusions with reasons and counts - [ ] Spot-checked sample records before/after - [ ] `04-cleaning-execution.md` created with all results documented --- ## Phase 5: Verification & Documentation **Goal:** Validate cleaning results, quantify quality improvements, document complete audit trail from raw to clean. ### Verify Row Count Reconciliation **Critical validation - MUST match exactly:** ```sql -- Count raw table SELECT COUNT(*) as raw_count FROM raw_[table_name]; -- Count clean table SELECT COUNT(*) as clean_count FROM clean_[table_name]; -- Calculate exclusions from Phase 4 log -- Expected: raw_count = clean_count + total_exclusions ``` **Document:** Confirm reconciliation passes. If mismatch, investigate before proceeding. --- ### Verify Transformation Results **For each transformation applied in Phase 4:** **Duplicate Removal Verification:** ```sql -- Confirm no duplicates remain SELECT [key_columns], COUNT(*) as occurrences FROM clean_[table_name] GROUP BY [key_columns] HAVING COUNT(*) > 1; -- Expected: 0 rows returned ``` **Outlier Handling Verification:** ```sql -- For Exclude approach: confirm no outliers remain WITH stats AS ( SELECT [median], [mad] FROM ... ) SELECT COUNT(*) as remaining_outliers FROM clean_[table_name], stats WHERE ABS([numeric_col] - median) > 3 * mad; -- Expected: 0 rows -- For Cap approach: confirm values at thresholds SELECT MIN([numeric_col]) as min_val, MAX([numeric_col]) as max_val FROM clean_[table_name]; -- Expected: min >= (median - 3*MAD), max <= (median + 3*MAD) -- For Flag approach: check flag accuracy SELECT [numeric_col]_outlier_flag, COUNT(*) FROM clean_[table_name] GROUP BY [numeric_col]_outlier_flag; -- Expected: distribution matches Phase 4 execution log ``` **Free Text Categorization Verification:** ```sql -- Confirm all values categorized SELECT COUNT(*) as uncategorized FROM clean_[table_name] WHERE [column]_category IS NULL; -- Expected: 0 (unless "keep uncategorized" was strategy) -- Verify category distribution SELECT [column]_category, COUNT(*) as count, ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct FROM clean_[table_name] GROUP BY [column]_category ORDER BY count DESC; -- Expected: matches Phase 4 execution results ``` **Business Rule Verification (if applicable):** ```sql -- Confirm no violations remain SELECT COUNT(*) as violations FROM clean_[table_name] WHERE NOT ([rule_1_condition] AND [rule_2_condition] ...); -- Expected: 0 rows ``` **Document all verification results:** - ✓ Pass: Expected result confirmed - ✗ Fail: Unexpected result, requires investigation --- ### Compare Data Quality Metrics **Before vs After comparison:** ```sql -- Completeness comparison SELECT 'raw' as table_name, COUNT(*) as total_rows, COUNT([col1]) as [col1]_non_null, ROUND(100.0 * COUNT([col1]) / COUNT(*), 2) as [col1]_completeness_pct FROM raw_[table_name] UNION ALL SELECT 'clean' as table_name, COUNT(*), COUNT([col1]), ROUND(100.0 * COUNT([col1]) / COUNT(*), 2) FROM clean_[table_name]; ``` **Create quality improvement table:** | Metric | Raw Table | Clean Table | Improvement | |--------|-----------|-------------|-------------| | Total rows | [N] | [N] | -[X]% (exclusions) | | Completeness ([col1]) | [X]% | [X]% | +[X] pct points | | Duplicate groups | [N] | 0 | -[N] (100%) | | Outliers ([col2]) | [N] | 0 | -[N] (100%) | | Free text unique values | [N] | [N categories] | -[X]% (categorization) | **Document:** Quality improvements quantified with specific deltas. --- ### Spot Check Sample Records **Select representative samples to manually verify:** ```sql -- Records that were in duplicate groups SELECT * FROM clean_[table_name] WHERE rowid IN ([IDs from Phase 2]); -- Verify: correct record kept per strategy -- Records with outliers (if flagged/capped, not excluded) SELECT * FROM clean_[table_name] WHERE [numeric_col]_outlier_flag = 1; -- Verify: flag accurate, values capped if applicable -- Records with categorized free text SELECT [original_col], [col]_category FROM clean_[table_name] LIMIT 20; -- Verify: categories make sense, mapping correct ``` **Document:** Manual verification confirms automated transformations worked correctly. --- ### Assess Limitations and Confidence **Document what this cleaning did NOT address:** - **Scope limitations:** [Issues identified but not addressed - e.g., "Date range not validated"] - **Data coverage:** [Time periods, geographies, categories not covered] - **Assumptions made:** [Business rules assumed without domain validation] - **Edge cases:** [Unusual values handled a specific way] **Confidence assessment:** - **High confidence:** [Transformations with clear validation - e.g., "Exact duplicate removal"] - **Medium confidence:** [Transformations with some subjectivity - e.g., "Free text categorization"] - **Low confidence / Needs review:** [Transformations requiring domain expertise - e.g., "Outliers might be legitimate"] **Document:** Honest assessment of what was cleaned and what wasn't, with confidence levels. --- ### Create Verification Report Create `analysis/[session-name]/05-verification-report.md` with: ./templates/phase-5.md **CHECKPOINT:** Before concluding cleaning-data skill, you MUST have: - [ ] Verified row count reconciliation (raw = clean + exclusions) - [ ] Validated all transformations with specific queries - [ ] Quantified quality improvements with before/after metrics - [ ] Spot-checked sample records manually - [ ] Documented limitations and assumptions honestly - [ ] Assessed confidence level for each transformation - [ ] Created complete exclusion accounting table - [ ] `05-verification-report.md` created with all sections filled - [ ] clean_* table ready for analysis process skills --- ## Common Rationalizations ### "The data looks clean after Phase 2, I can skip Phase 3 strategy design" **Why this is wrong:** Detecting issues isn't the same as deciding how to fix them. Different approaches (exclude vs cap vs flag) have different analytical implications. **Do instead:** Always complete Phase 3 with explicit decision frameworks. Document why you chose each approach with user confirmation. ### "I'll just exclude all outliers automatically, no need to review them" **Why this is wrong:** Some outliers are legitimate (VIP customers, seasonal spikes, rare events). Automatic exclusion loses valuable data. **Do instead:** Complete Phase 2 detection with agent analysis. Review patterns in Phase 3. Choose approach based on business context, not just statistical threshold. ### "The fuzzy matching agent found near-duplicates, I'll merge them all" **Why this is wrong:** 90-95% similarity doesn't mean identical. "John Smith" vs "John Smyth" might be the same person OR two different people. **Do instead:** Review confidence levels in Phase 3. Auto-merge only high confidence (>95%). Manual review medium confidence. Document decisions. ### "I don't need to document exclusions, I can remember what I removed" **Why this is wrong:** Undocumented exclusions break audit trail. When results are questioned, you can't explain what data was excluded or why. **Do instead:** Complete Phase 4 execution log with exclusion summary table. Document every excluded record with reason and count. Reconcile in Phase 5. ### "Verification is just running the same queries again, waste of time" **Why this is wrong:** Phase 5 verification checks RESULTS, not execution. Queries can run without errors but produce wrong results (logic bugs, wrong thresholds, incorrect mappings). **Do instead:** Always complete Phase 5 with before/after comparisons, spot checks, and manual inspection. Verification catches transformation bugs. ### "I found one issue, fixed it, done with cleaning" **Why this is wrong:** Data quality issues cluster. If you found duplicates, likely also have outliers, NULLs, and inconsistencies. One fix doesn't make data "clean". **Do instead:** Complete all 5 phases systematically. Phase 2 detects ALL issue types. Address all prioritized issues in Phases 3-4. ### "The clean table has fewer rows, that's proof it's better" **Why this is wrong:** Smaller isn't always better. Excluding 50% of data might remove all the interesting variation. Quality ≠ quantity reduction. **Do instead:** Complete Phase 5 with quality improvement quantification. Measure completeness, consistency, validity improvements - not just row count reduction. ### "I'll categorize free text myself, faster than using the agent" **Why this is wrong:** Manual categorization is inconsistent, misses patterns, and pollutes main agent context with hundreds of unique values. **Do instead:** Always delegate free text analysis to categorize-free-text agent in Phase 3. Agent provides structured mapping with confidence levels. Review and adjust if needed. ### "Business rules failed for 2%, I'll just delete those rows and move on" **Why this is wrong:** 2% violations might indicate systematic data quality issue (bad data entry, import error, logic flaw). Deleting hides the problem. **Do instead:** Investigate violations in Phase 2. Document why they violate rules in Phase 3. Consider whether to exclude, correct, or flag in strategy. Document pattern in Phase 5. ### "Phase 5 validation passed, I'm done - no need to document limitations" **Why this is wrong:** All cleaning has limitations and assumptions. Pretending otherwise misleads analysts who use the clean data. **Do instead:** Complete Phase 5 limitations section honestly. Document what was NOT cleaned, assumptions made, edge cases, confidence levels. Transparency builds trust. --- ## Summary This skill ensures systematic, documented data cleaning with quality validation by: 1. **Prioritized scope definition:** Read quality report from importing-data, apply impact × severity × effort framework - ensures high-value issues addressed first, not random fixes. 2. **Structured decision-making:** Present options with trade-offs for duplicates, outliers, free text, business rules - gets user confirmation before execution, prevents undocumented assumptions. 3. **Agent-delegated detection:** Use dedicated sub-agents (detect-exact-duplicates, detect-near-duplicates, detect-outliers, categorize-free-text) - prevents context pollution while ensuring thorough analysis. 4. **Explicit strategy approval:** Document chosen approach per issue type in Phase 3 with rationale - creates decision audit trail, enables strategy review if results questioned. 5. **Transformation transparency:** Execute cleaning with CREATE TABLE AS SELECT, preserve raw_* tables, track all exclusions - maintains complete audit trail from raw to clean. 6. **Rigorous verification:** Validate transformations, quantify quality improvements, spot-check samples, document limitations - ensures clean data is actually clean and limitations are known. Follow this process and you'll create well-documented clean tables with validated quality improvements, complete audit trail from raw data to analysis-ready data, and honest assessment of what was cleaned and what limitations remain. ---