--- name: scientific-data-preprocessing description: "⚠️ CRITICAL USER EXPERIENCE-BASED SKILL - ALWAYS CONSULT BEFORE DATA PREPROCESSING ⚠️ Prevents catastrophic errors (88.9% error rate in V1.0 case study) through multi-level feature analysis, data leakage detection, and semantic validation. MANDATORY for: data preprocessing, feature engineering, standardization, normalization, interpolation, missing value handling, feature selection, or ANY data transformation task. Covers grouped time-series, cross-sectional, panel data. Detects: time travel leakage, causal inversion, ID misuse, semantic-numeric fallacies, distribution blindness. User's hard-won lessons from real project failures." --- # Scientific Data Preprocessing Skill ⚠️ **CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION** ⚠️ This skill encapsulates painful lessons learned from real preprocessing disasters (88.9% error rate documented). **ALWAYS use this skill for planning, reflection, and validation when ANY data preprocessing is involved.** **Why this skill is mandatory:** - Based on actual project failures (V1.0, V2.0 case studies) - Prevents data leakage that causes production disasters - Catches semantic errors AI agents commonly make - Saves weeks of debugging and model retraining **When to invoke (DO NOT SKIP):** - ✅ Before starting ANY data preprocessing task - ✅ During preprocessing for reflection and validation - ✅ After preprocessing for comprehensive audit - ✅ When reviewing AI-generated preprocessing code --- ## Core Mission Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries. ## When to Use This Skill **MANDATORY consultation - trigger immediately when:** ### Data Preprocessing Tasks (ALWAYS) - Any data cleaning, transformation, or preparation work - Loading and preparing data for modeling - Creating training/test splits - Handling missing values (imputation, deletion) - Feature scaling/normalization/standardization - Encoding categorical variables - Feature engineering or construction - Feature selection or dimensionality reduction ### Data Structure Types (ALWAYS) - Preprocesssing time-series data with natural groupings (matches, sessions, patients, experiments) - Sports analytics (tennis, basketball, etc.) - Medical/clinical data with patient groupings - Panel data or longitudinal studies - Any grouped/hierarchical data structure ### Quality Assurance (ALWAYS) - Auditing existing preprocessing for data leakage or semantic errors - Reviewing AI-generated preprocessing code for common pitfalls - Validating preprocessing before model training - Debugging unexpected model performance ### Critical Checkpoints (NEVER SKIP) - ✅ **BEFORE**: Planning preprocessing strategy - ✅ **DURING**: Reflecting on decisions and checking for errors - ✅ **AFTER**: Comprehensive validation and audit **Trigger keywords that MUST invoke this skill:** - "preprocess", "preprocessing", "data cleaning", "data preparation" - "standardize", "normalize", "scale", "transform" - "impute", "fill missing", "handle NaN" - "encode", "one-hot", "categorical" - "feature engineering", "feature selection", "feature construction" - "train test split", "cross validation split" - "interpolate", "smooth", "aggregate" ## Not For / Boundaries This skill does NOT: - Handle purely cross-sectional data (ungrouped, single timepoint) - Make domain-specific feature engineering decisions (you decide business logic) - Choose ML models (focuses on preprocessing only) - Handle distributed/big data infrastructure (assumes data fits in memory) Required inputs before proceeding: 1. Confirmation that data has groups (e.g., match_id, patient_id, session_id) 2. Understanding of whether goal is within-group (relative) or cross-group (absolute) comparison 3. Domain constraints on data ranges/units ## Quick Reference ### Multi-Level Feature Analysis Framework **Level 1: Data Type** ```python # Check data types df.dtypes # int64, float64, object, etc. ``` **Level 2: Feature Type Classification** ```python # Binary (0/1) binary_features = [col for col in df.columns if df[col].nunique() == 2] # Categorical (finite discrete values) categorical_features = [col for col in df.select_dtypes(include='object').columns] # Continuous (infinite possible values) continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns if df[col].nunique() > 10] ``` **Level 3: Data Structure** ```python # Check for grouping print(f"Number of groups: {df['group_id'].nunique()}") print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}") # Check for time-series df_sorted = df.sort_values(['group_id', 'timestamp']) ``` **Level 4: Physical Meaning** ```python # Validate physical ranges assert df['speed_mph'].max() < 200, "Speed exceeds physical limit" assert df['distance_meters'].min() >= 0, "Negative distance impossible" ``` ### Critical Processing Decision Tree ```python # Decision: Within-group or global processing? def choose_processing_scope(data, feature, goal): """ goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH") goal = 'absolute' → global (e.g., "this was an intense point OVERALL") """ if goal == 'relative': return 'within_group' elif goal == 'absolute': return 'global' else: raise ValueError("Goal must be 'relative' or 'absolute'") ``` ### Pattern 1: Within-Group Interpolation (CORRECT) ```python from scipy.interpolate import CubicSpline import numpy as np # ✅ CORRECT: Interpolate within each group for group_id in df['match_id'].unique(): mask = df['match_id'] == group_id group_data = df.loc[mask, 'speed_mph'].copy() # Get valid (non-NaN) indices valid_idx = group_data.notna() valid_positions = np.where(valid_idx)[0] valid_values = group_data[valid_idx].values if len(valid_positions) >= 4: cs = CubicSpline(valid_positions, valid_values) missing_positions = np.where(~valid_idx)[0] df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions) ``` ### Pattern 2: Global Interpolation (WRONG - Don't Do This) ```python # ❌ WRONG: Cross-group interpolation # This interpolates between match A's last point and match B's first point! cs = CubicSpline( np.where(df['speed_mph'].notna())[0], # ❌ All indices globally df['speed_mph'].dropna().values ) df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs( np.where(df['speed_mph'].isna())[0] ) ``` ### Pattern 3: Within-Group Standardization (for Relative Analysis) ```python from sklearn.preprocessing import StandardScaler # ✅ CORRECT: Standardize within each match for match_id in df['match_id'].unique(): mask = df['match_id'] == match_id scaler = StandardScaler() df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform( df.loc[mask, [['distance_run']] ) # Interpretation: z=+2 means "2 std above average FOR THIS MATCH" ``` ### Pattern 4: Global Standardization (for Absolute Comparison) ```python # ✅ CORRECT: Global standardization (when appropriate) scaler = StandardScaler() df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']]) # Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES" ``` ### Pattern 5: Feature Type Processing Rules ```python # Binary variables (0/1) - KEEP AS-IS binary_cols = ['is_ace', 'is_winner', 'is_error'] # ❌ NEVER standardize these! They have semantic meaning as 0/1 # Categorical variables - ONE-HOT ENCODE df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int) # Continuous variables - STANDARDIZE (within-group or global) continuous_cols = ['distance_run', 'rally_count', 'speed_mph'] # ✅ Apply pattern 3 or 4 based on goal ``` ### Pattern 6: Sliding Window Features (for Momentum) ```python # ✅ CORRECT: Sliding window for momentum analysis window = 10 df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform( lambda x: x.rolling(window, min_periods=1).mean() ) # ❌ WRONG: Cumulative features (loses temporal locality) df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum() # This just increases monotonically and correlates with point_number ``` ### Pattern 7: Data Quality Validation ```python def validate_data_quality(df, feature, expected_range): """Validate before processing""" # Check range assert df[feature].min() >= expected_range[0], f"{feature} below minimum" assert df[feature].max() <= expected_range[1], f"{feature} above maximum" # Check for anomalies mean = df[feature].mean() std = df[feature].std() if std > mean: print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)") # Check missing pattern missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum()) if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5: print(f"⚠️ WARNING: {feature} has >50% missing in some groups") # Example validate_data_quality(df, 'speed_mph', expected_range=(50, 165)) ``` ### Pattern 8: Detect Processing Scope Automatically ```python def detect_processing_scope(df, group_col, feature_col): """ Recommend within-group vs global based on variance structure """ # Calculate variance components within_group_var = df.groupby(group_col)[feature_col].var().mean() global_var = df[feature_col].var() # Intraclass correlation between_group_var = global_var - within_group_var icc = between_group_var / global_var if icc > 0.5: return 'within_group', f"High between-group variance (ICC={icc:.2f})" else: return 'global', f"Low between-group variance (ICC={icc:.2f})" scope, reason = detect_processing_scope(df, 'match_id', 'distance_run') print(f"Recommended: {scope} - {reason}") ``` ### Pattern 9: Data Leakage Detection ```python def detect_data_leakage(df, target_col, feature_cols, id_cols): """ Critical checks for data leakage and AI common pitfalls """ issues = [] # 1. ID Leakage: High cardinality variables as features for col in feature_cols: if col in id_cols: issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature") continue # Check if looks like ID (>50% unique) uniqueness = df[col].nunique() / len(df) if uniqueness > 0.5: issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage") # 2. Causal Inversion: Perfect correlation with target for col in feature_cols: if col == target_col: continue if df[col].dtype in ['int64', 'float64']: corr = abs(df[[col, target_col]].corr().iloc[0, 1]) if corr > 0.95: issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!") # 3. Meaningless Numeric: Codes treated as numbers for col in feature_cols: if df[col].dtype in ['int64', 'float64']: # Pattern: High values, many uniques, looks like code if df[col].min() > 1000 and df[col].nunique() > 100: issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical") # 4. Time Travel: Check if standardization used global statistics # (Requires knowing if train/test split was done first) # Print report if issues: print("="*60) print("DATA LEAKAGE AUDIT") print("="*60) for issue in issues: print(issue) print("="*60) else: print("✅ No obvious leakage detected") return issues # Example usage issues = detect_data_leakage( df, target_col='point_won', feature_cols=['speed_mph', 'user_id', 'distance_run'], id_cols=['match_id', 'user_id'] ) ``` ### Pattern 10: Distribution-Aware Scaling ```python from scipy.stats import skew, kurtosis from sklearn.preprocessing import StandardScaler, RobustScaler def smart_scaler_selection(df, col): """ Choose scaler based on distribution characteristics """ data = df[col].dropna() # Check distribution skewness = skew(data) kurt = kurtosis(data) print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}") if abs(skewness) < 0.5 and abs(kurt) < 3: # Roughly normal print(" → StandardScaler (data is roughly normal)") return StandardScaler(), None elif skewness > 1: # Right-skewed (long tail) print(" → Log transform + StandardScaler (right-skewed)") return StandardScaler(), 'log' else: # Heavy outliers or non-normal print(" → RobustScaler (heavy outliers)") return RobustScaler(), None # Example usage for col in continuous_features: scaler, transform = smart_scaler_selection(df, col) if transform == 'log': df[f'{col}_log'] = np.log1p(df[col]) df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']]) else: df[f'{col}_scaled'] = scaler.fit_transform(df[[col]]) ``` ## Examples ### Example 1: Tennis Match Preprocessing (Complete Pipeline) **Input:** - CSV with 7,284 rows, 31 matches - Features: `speed_mph`, `distance_run`, `rally_count`, `is_ace`, `server` - Goal: Analyze momentum (relative intensity within each match) **Steps:** ```python import pandas as pd from sklearn.preprocessing import StandardScaler # 1. Load and inspect df = pd.read_csv('tennis_data.csv') print(f"Matches: {df['match_id'].nunique()}") print(f"Features: {df.dtypes}") # 2. Classify features binary_features = ['is_ace', 'is_winner', 'is_break_point'] categorical_features = ['server', 'serve_number'] continuous_features = ['distance_run', 'speed_mph', 'rally_count'] # 3. Validate data quality for feat in continuous_features: print(f"\n{feat}:") print(df[feat].describe()) # Check for impossible values if feat == 'speed_mph': assert df[feat].max() < 170, "Speed exceeds world record!" # 4. Handle missing values (within-group) for match_id in df['match_id'].unique(): mask = df['match_id'] == match_id for feat in continuous_features: if df.loc[mask, feat].isna().any(): # Simple linear interpolation within match df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear') # 5. One-hot encode categorical df = pd.get_dummies(df, columns=categorical_features, dtype=int) # 6. Standardize continuous features WITHIN each match for feat in continuous_features: df[f'{feat}_std'] = np.nan for match_id in df['match_id'].unique(): mask = df['match_id'] == match_id scaler = StandardScaler() df.loc[mask, f'{feat}_std'] = scaler.fit_transform( df.loc[mask, [[feat]] ) # 7. Create sliding window features window = 10 df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform( lambda x: x.rolling(window, min_periods=1).mean() ) # 8. KEEP binary features as 0/1 (don't transform!) # binary_features are already correct print("\n✅ Preprocessing complete!") print(f"Final shape: {df.shape}") print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}") ``` **Expected output:** - Binary features remain 0/1 - Categorical features one-hot encoded (e.g., `server_1`, `server_2`) - Continuous features have both original and `_std` versions - `_std` features have mean≈0, std≈1 WITHIN each match - Sliding window features capture local momentum - No missing values ### Example 2: Detecting Cross-Group Contamination **Input:** - Preprocessed data where you suspect cross-group standardization **Steps:** ```python # Check if standardization was done correctly def check_within_group_standardization(df, group_col, feature_std_col): """ Verify that standardized feature has mean≈0, std≈1 within each group """ results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std']) # Within-group standardization: each group should have mean≈0, std≈1 if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all(): print("✅ CORRECT: Within-group standardization detected") return True # Global standardization: groups will have varying means and stds else: print("❌ WRONG: Global standardization detected!") print("Group means:", results['mean'].values[:5]) print("Group stds:", results['std'].values[:5]) return False check_within_group_standardization(df, 'match_id', 'distance_run_std') ``` **Expected output:** - CORRECT: All group means ≈ 0, all group stds ≈ 1 - WRONG: Group means vary widely, indicating global standardization ### Example 3: Fixing Cumulative Feature Error **Input:** - Existing pipeline using cumulative sums for momentum **Steps:** ```python # ❌ WRONG approach (existing code) df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum() # Problem: This just counts total wins up to this point # Doesn't capture recent momentum! # ✅ CORRECT approach (fix) # Replace cumulative with sliding window window = 10 df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform( lambda x: x.rolling(window, min_periods=1).mean() ) # Compare print("Cumulative (wrong):", df['cumulative_wins'].values[50:60]) print("Sliding window (correct):", df['recent_win_rate'].values[50:60]) # Cumulative: [25, 26, 26, 27, 28, ...] - monotonic # Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum ``` **Expected output:** - Cumulative features removed - Sliding window features show local variations - Momentum analysis now captures short-term trends ## References - `references/index.md`: Navigation and overview - `references/error-case-studies.md`: Real-world preprocessing disasters from tennis data - `references/decision-trees.md`: Full decision trees for all preprocessing choices - `references/validation-checklist.md`: Pre-processing validation checklist - `references/ai-common-pitfalls.md`: AI-specific errors (data leakage, semantic fallacies, distribution blindness) ## Maintenance ⚠️ **CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL** ⚠️ **This skill is NOT theoretical - it's based on real project failures:** - **V1.0 disaster**: 88.9% error rate, weeks of wasted work - **V2.0 issues**: Cross-group contamination, unreliable results - **V3.0 success**: All errors fixed, production-ready **Why this matters to you (Claude):** - These are the EXACT errors AI agents commonly make - User has already paid the price for these mistakes - Ignoring this skill = repeating documented failures - Following this skill = learning from experience without pain **Authority level**: HIGHEST - Based on user's hard-won lessons from actual project - Validated through multiple iterations (V1.0 → V2.0 → V3.0) - Every error documented with impact metrics - Every fix validated with comprehensive testing **Sources**: - Primary: User's personal project (2024 MCM Problem C - Tennis Momentum Analysis) - Secondary: Statistical best practices for grouped data - Tertiary: Common AI preprocessing errors observed across domains **Mandatory consultation**: - ⚠️ ALWAYS consult before, during, and after any data preprocessing - ⚠️ NEVER skip validation steps outlined in this skill - ⚠️ When in doubt, err on the side of caution (use this skill) **Last updated**: 2026-01-18 (V1.1) **Known limits:** - Assumes data fits in memory (not for big data infrastructure) - Focused on numeric/categorical features (text/image preprocessing partially covered) - Does not prescribe domain-specific feature engineering (user decides business logic) - Requires basic understanding of statistics (mean, std, correlation)