--- name: data-cleaning-standards description: Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize". allowed-tools: Read, Edit, Grep, Glob --- # Data Cleaning Standards Clean and standardize vehicle insurance data following established business rules. ## When to Activate Use this skill when the user: - Says "clean the data" or "standardize data" - Mentions "missing values", "null handling", or "fill missing" - Asks "remove duplicates" or "deduplicate" - Wants to "normalize dates" or "standardize formats" - Mentions data preparation or preprocessing ## Quick Start Workflow ``` Step 1: Handle Missing Values ↓ Step 2: Remove Duplicates ↓ Step 3: Standardize Formats ↓ Step 4: Handle Outliers ``` --- ## Step 1: Handle Missing Values ### 1.1 Missing Value Strategy by Field Type | Field Type | Strategy | Fill Value | Why | |-----------|----------|------------|-----| | 三级机构 | Lookup from mapping | From mapping table | Authoritative source | | 团队简称 | Keep null | `''` (empty string) | Optional field | | 签单/批改保费 | Delete row | N/A | Critical metric | | 手续费含税 | Fill zero | `0` | Legitimate zero commission | | 是否续保 | Keep null | `''` | Display as "Unknown" | | String fields | Fill empty | `''` | Avoid None errors | | Numeric fields | Delete or 0 | Depends on field | Case by case | ### 1.2 Implementation ```python def handle_missing_values(df, staff_mapping): """Apply missing value strategy""" # 1. 三级机构 - lookup from mapping if '三级机构' in df.columns: for idx in df[df['三级机构'].isnull()].index: staff = df.at[idx, '业务员'] mapped_info = lookup_staff_info(staff, staff_mapping) if mapped_info: df.at[idx, '三级机构'] = mapped_info['三级机构'] # 2. 手续费 - fill zero if '手续费含税' in df.columns: df['手续费含税'] = df['手续费含税'].fillna(0) # 3. 签单保费 - delete missing rows before = len(df) df = df[df['签单/批改保费'].notnull()] after = len(df) if before > after: print(f"Deleted {before - after} rows with missing premium") # 4. String fields - fill empty string_cols = df.select_dtypes(include=['object']).columns df[string_cols] = df[string_cols].fillna('') return df ``` --- ## Step 2: Remove Duplicates ### 2.1 Deduplication Rules **Composite Key**: `保单号` + `投保确认时间` **Keep Strategy**: `keep='last'` (most recent record) **Why**: Same policy may have multiple updates (批改) ### 2.2 Implementation ```python def remove_duplicates(df): """Remove duplicate records""" # Ensure date is datetime df['投保确认时间'] = pd.to_datetime(df['投保确认时间'], errors='coerce') # Use duplicated() to avoid type issues before = len(df) dup_mask = df.duplicated(subset=['保单号', '投保确认时间'], keep='last') df = df[~dup_mask] after = len(df) if before > after: print(f"Removed {before - after} duplicates") return df ``` --- ## Step 3: Standardize Formats ### 3.1 Date Standardization **Target Format**: `datetime64[ns]` ```python def standardize_dates(df): """Convert all date fields to datetime""" date_cols = ['刷新时间', '投保确认时间', '保险起期'] for col in date_cols: if col in df.columns: df[col] = pd.to_datetime(df[col], errors='coerce') invalid = df[col].isnull().sum() if invalid > 0: print(f"⚠️ {col}: {invalid} invalid dates converted to NaT") return df ``` ### 3.2 Numeric Standardization **Target Format**: `float64` ```python def standardize_numerics(df): """Convert numeric fields to float""" numeric_cols = ['签单/批改保费', '签单数量', '手续费', '手续费含税', '增值税'] for col in numeric_cols: if col in df.columns: df[col] = pd.to_numeric(df[col], errors='coerce') return df ``` ### 3.3 String Standardization **Rules**: - Strip leading/trailing whitespace - Map variations (Y/N → 是/否) ```python def standardize_strings(df): """Clean string fields""" # Strip whitespace string_cols = df.select_dtypes(include=['object']).columns for col in string_cols: df[col] = df[col].astype(str).str.strip() # Standardize yes/no fields yes_no_cols = ['是否续保', '是否新能源', '是否过户车', '是否异地车'] for col in yes_no_cols: if col in df.columns: df[col] = df[col].map({ 'Y': '是', 'N': '否', 'y': '是', 'n': '否', '1': '是', '0': '否' }).fillna(df[col]) return df ``` --- ## Step 4: Handle Outliers ### 4.1 Outlier Detection Rules **Important**: **NEVER delete negative premiums** (legitimate business data) | Outlier Type | Detection Rule | Action | |-------------|---------------|---------| | Negative premium | `< 0` | ✅ **KEEP** (refunds/adjustments) | | Extreme premium | `< -1M or > 100K` | ⚠️ Flag only | | Zero commission | `== 0` | ✅ KEEP (normal) | | Negative amount | `保额 < 0` | ⚠️ Flag only | ### 4.2 Implementation ```python def detect_outliers(df): """Detect outliers but DO NOT delete""" outliers = {} # 1. Extreme premium (flag only) if '签单/批改保费' in df.columns: extreme = df[ (df['签单/批改保费'] < -1000000) | (df['签单/批改保费'] > 100000) ] if len(extreme) > 0: outliers['extreme_premium'] = { 'count': len(extreme), 'samples': extreme['保单号'].head(5).tolist() } # 2. Negative amounts (flag only) if '签单/批改保额' in df.columns: negative_amt = df[df['签单/批改保额'] < 0] if len(negative_amt) > 0: outliers['negative_amount'] = { 'count': len(negative_amt), 'samples': negative_amt['保单号'].head(5).tolist() } # Report outliers without deleting if outliers: print("⚠️ Detected outliers (kept in data):") for key, info in outliers.items(): print(f" - {key}: {info['count']} records") return df, outliers ``` **Critical Rule**: Never filter out negative premiums: ```python # ❌ WRONG - DO NOT DO THIS df = df[df['签单/批改保费'] > 0] # ✅ CORRECT - Keep all values total_premium = df['签单/批改保费'].sum() # May be negative ``` --- ## Complete Cleaning Pipeline ### All-in-One Function ```python def clean_data(df, staff_mapping): """Complete cleaning pipeline""" print("Starting data cleaning pipeline...") initial_count = len(df) # Step 1: Handle missing values df = handle_missing_values(df, staff_mapping) print(f"✓ Step 1: Handled missing values") # Step 2: Remove duplicates df = remove_duplicates(df) print(f"✓ Step 2: Removed duplicates") # Step 3: Standardize formats df = standardize_dates(df) df = standardize_numerics(df) df = standardize_strings(df) print(f"✓ Step 3: Standardized formats") # Step 4: Detect outliers (no deletion) df, outliers = detect_outliers(df) print(f"✓ Step 4: Detected outliers") final_count = len(df) print(f"\nCleaning complete: {initial_count} → {final_count} records") return df, outliers ``` --- ## Common Use Cases ### Case 1: "Clean my CSV file" ```python import pandas as pd import json # Load data df = pd.read_csv('data.csv', encoding='utf-8-sig') mapping = json.load(open('staff_mapping.json')) # Run full pipeline df_clean, outliers = clean_data(df, mapping) # Save cleaned data df_clean.to_csv('data_cleaned.csv', index=False, encoding='utf-8-sig') ``` ### Case 2: "Handle missing institution fields" ```python # Focus on Step 1 - missing value handling df = handle_missing_values(df, staff_mapping) # Check results missing_before = df_original['三级机构'].isnull().sum() missing_after = df['三级机构'].isnull().sum() print(f"Fixed {missing_before - missing_after} missing institutions") ``` ### Case 3: "Remove duplicate policies" ```python # Focus on Step 2 - deduplication df_unique = remove_duplicates(df) print(f"Removed {len(df) - len(df_unique)} duplicates") ``` --- ## Cleaning Checklist Before cleaning: - [ ] Backup original data - [ ] Load staff mapping file - [ ] Check file encoding (use `utf-8-sig` for Excel exports) During cleaning: - [ ] Handle missing values (Step 1) - [ ] Remove duplicates (Step 2) - [ ] Standardize formats (Step 3) - [ ] Detect outliers (Step 4) After cleaning: - [ ] Verify record count change is reasonable - [ ] Check critical fields are complete - [ ] Review outlier report - [ ] Save cleaned data with new filename --- ## Troubleshooting ### "Many records deleted" **Check**: Are you accidentally deleting negative premiums? ```python # Check negative premium count negative_count = (df['签单/批改保费'] < 0).sum() print(f"Negative premiums: {negative_count} (should be kept)") ``` ### "Date conversion creates many NaT" **Solution**: Check date format ```python # Inspect date column print(df['投保确认时间'].head()) # Try different format df['投保确认时间'] = pd.to_datetime( df['投保确认时间'], format='%Y/%m/%d', # Adjust format errors='coerce' ) ``` ### "Duplicates not removed" **Check**: Ensure date column is datetime type ```python print(df['投保确认时间'].dtype) # Should be datetime64[ns] ``` --- ## Related Files **Data processor**: [backend/data_processor.py](../../backend/data_processor.py) - See `_clean_data()` method (lines 132-156) - See `merge_with_existing()` method (lines 158-192) **Field definitions**: [docs/FIELD_MAPPING.md](../../docs/FIELD_MAPPING.md) **Related Skills**: - `field-validation` - Run this BEFORE cleaning - `staff-mapping-management` - Update mapping table --- **Skill Version**: v1.0 **Created**: 2025-11-09 **File Size**: ~310 lines **Focuses On**: Data cleaning only (not validation or mapping)