--- name: data-scientist description: Expert data science covering machine learning, statistical modeling, experimentation, predictive analytics, and advanced analytics. version: 1.0.0 author: Claude Skills category: data-analytics tags: [data-science, machine-learning, statistics, modeling, analytics] --- # Data Scientist Expert-level data science for business impact. ## Core Competencies - Machine learning - Statistical modeling - Experimentation design - Predictive analytics - Feature engineering - Model evaluation - Data storytelling - Production ML ## Machine Learning Workflow ``` PROBLEM DEFINITION → DATA → FEATURES → MODEL → EVALUATION → DEPLOYMENT 1. Problem Definition - Business objective - Success metrics - Constraints 2. Data Collection - Data sources - Data quality - Sample size 3. Feature Engineering - Feature creation - Feature selection - Transformation 4. Model Development - Algorithm selection - Training - Tuning 5. Evaluation - Metrics - Validation - Business impact 6. Deployment - Production pipeline - Monitoring - Iteration ``` ## Model Selection ### Algorithm Comparison | Algorithm | Use Case | Pros | Cons | |-----------|----------|------|------| | Linear Regression | Continuous prediction | Interpretable, fast | Linear relationships only | | Logistic Regression | Binary classification | Interpretable, probabilistic | Linear boundaries | | Random Forest | Classification/Regression | Handles non-linearity | Less interpretable | | XGBoost | Classification/Regression | High accuracy | Overfitting risk | | Neural Networks | Complex patterns | Flexible | Requires lots of data | ### Model Selection Framework ```python def select_model(problem_type, data_size, interpretability_need, accuracy_need): """ problem_type: 'classification' or 'regression' data_size: 'small' (<10K), 'medium' (10K-1M), 'large' (>1M) interpretability_need: 'high', 'medium', 'low' accuracy_need: 'high', 'medium', 'low' """ if interpretability_need == 'high': if problem_type == 'classification': return 'Logistic Regression' else: return 'Linear Regression' if data_size == 'small': return 'Random Forest' if accuracy_need == 'high': if data_size == 'large': return 'Neural Network' else: return 'XGBoost' return 'Random Forest' ``` ## Feature Engineering ### Feature Types ```python # Numerical Features def engineer_numerical(df, col): features = { f'{col}_log': np.log1p(df[col]), f'{col}_sqrt': np.sqrt(df[col]), f'{col}_squared': df[col] ** 2, f'{col}_binned': pd.cut(df[col], bins=5, labels=False) } return pd.DataFrame(features) # Categorical Features def engineer_categorical(df, col): # One-hot encoding dummies = pd.get_dummies(df[col], prefix=col) # Target encoding target_mean = df.groupby(col)['target'].mean() target_encoded = df[col].map(target_mean) # Frequency encoding freq = df[col].value_counts(normalize=True) freq_encoded = df[col].map(freq) return dummies, target_encoded, freq_encoded # Time Features def engineer_time(df, col): df[col] = pd.to_datetime(df[col]) features = { f'{col}_hour': df[col].dt.hour, f'{col}_day': df[col].dt.day, f'{col}_dayofweek': df[col].dt.dayofweek, f'{col}_month': df[col].dt.month, f'{col}_is_weekend': df[col].dt.dayofweek.isin([5, 6]).astype(int), f'{col}_hour_sin': np.sin(2 * np.pi * df[col].dt.hour / 24), f'{col}_hour_cos': np.cos(2 * np.pi * df[col].dt.hour / 24) } return pd.DataFrame(features) ``` ### Feature Selection ```python from sklearn.feature_selection import mutual_info_classif, RFE from sklearn.ensemble import RandomForestClassifier def select_features(X, y, method='importance', n_features=20): if method == 'importance': model = RandomForestClassifier(n_estimators=100) model.fit(X, y) importance = pd.Series(model.feature_importances_, index=X.columns) return importance.nlargest(n_features).index.tolist() elif method == 'mutual_info': mi_scores = mutual_info_classif(X, y) mi_series = pd.Series(mi_scores, index=X.columns) return mi_series.nlargest(n_features).index.tolist() elif method == 'rfe': model = RandomForestClassifier(n_estimators=100) rfe = RFE(model, n_features_to_select=n_features) rfe.fit(X, y) return X.columns[rfe.support_].tolist() ``` ## Model Evaluation ### Classification Metrics ```python from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report ) def evaluate_classification(y_true, y_pred, y_proba=None): metrics = { 'accuracy': accuracy_score(y_true, y_pred), 'precision': precision_score(y_true, y_pred), 'recall': recall_score(y_true, y_pred), 'f1': f1_score(y_true, y_pred), } if y_proba is not None: metrics['auc_roc'] = roc_auc_score(y_true, y_proba) print(classification_report(y_true, y_pred)) print(confusion_matrix(y_true, y_pred)) return metrics ``` ### Regression Metrics ```python from sklearn.metrics import ( mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error ) def evaluate_regression(y_true, y_pred): metrics = { 'mae': mean_absolute_error(y_true, y_pred), 'mse': mean_squared_error(y_true, y_pred), 'rmse': np.sqrt(mean_squared_error(y_true, y_pred)), 'r2': r2_score(y_true, y_pred), 'mape': mean_absolute_percentage_error(y_true, y_pred) } return metrics ``` ## Experimentation ### A/B Test Design ```python from scipy import stats def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8): """ Calculate required sample size per variant baseline_rate: Current conversion rate (e.g., 0.05) mde: Minimum detectable effect (e.g., 0.1 for 10% lift) alpha: Significance level power: Statistical power """ effect_size = baseline_rate * mde z_alpha = stats.norm.ppf(1 - alpha / 2) z_beta = stats.norm.ppf(power) p = baseline_rate p_new = p + effect_size n = (2 * p * (1 - p) * (z_alpha + z_beta) ** 2) / (effect_size ** 2) return int(np.ceil(n)) def analyze_ab_test(control, treatment, alpha=0.05): """ Analyze A/B test results control: array of 0/1 outcomes for control treatment: array of 0/1 outcomes for treatment """ n_control = len(control) n_treatment = len(treatment) p_control = control.mean() p_treatment = treatment.mean() # Pooled proportion p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment) # Standard error se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment)) # Z-statistic z = (p_treatment - p_control) / se # P-value (two-tailed) p_value = 2 * (1 - stats.norm.cdf(abs(z))) # Confidence interval ci_low = (p_treatment - p_control) - 1.96 * se ci_high = (p_treatment - p_control) + 1.96 * se return { 'control_rate': p_control, 'treatment_rate': p_treatment, 'lift': (p_treatment - p_control) / p_control, 'z_statistic': z, 'p_value': p_value, 'significant': p_value < alpha, 'confidence_interval': (ci_low, ci_high) } ``` ## Statistical Analysis ### Hypothesis Testing ```python from scipy import stats # T-test def compare_means(group1, group2): stat, p_value = stats.ttest_ind(group1, group2) effect_size = (group1.mean() - group2.mean()) / np.sqrt( (group1.std()**2 + group2.std()**2) / 2 ) return {'t_statistic': stat, 'p_value': p_value, 'cohens_d': effect_size} # Chi-square def test_independence(contingency_table): chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table) return {'chi2': chi2, 'p_value': p_value, 'degrees_of_freedom': dof} # Correlation def analyze_correlation(x, y): pearson_r, pearson_p = stats.pearsonr(x, y) spearman_r, spearman_p = stats.spearmanr(x, y) return { 'pearson': {'r': pearson_r, 'p_value': pearson_p}, 'spearman': {'r': spearman_r, 'p_value': spearman_p} } ``` ## Project Template ```markdown # Data Science Project: [Name] ## Business Objective [What business problem are we solving?] ## Success Metrics - Primary: [Metric] - Secondary: [Metric] ## Data - Sources: [List] - Size: [Rows/Features] - Time period: [Dates] ## Methodology 1. [Step 1] 2. [Step 2] ## Results ### Model Performance | Metric | Value | |--------|-------| | [Metric] | [Value] | ### Business Impact - [Impact 1] - [Impact 2] ## Recommendations 1. [Recommendation] ## Next Steps - [Next step] ## Appendix [Technical details] ``` ## Reference Materials - `references/ml_algorithms.md` - Algorithm deep dives - `references/feature_engineering.md` - Feature engineering patterns - `references/experimentation.md` - A/B testing guide - `references/statistics.md` - Statistical methods ## Scripts ```bash # Model trainer python scripts/train_model.py --config model_config.yaml # Feature importance analyzer python scripts/feature_importance.py --model model.pkl --data test.csv # A/B test analyzer python scripts/ab_analyzer.py --control control.csv --treatment treatment.csv # Model evaluator python scripts/evaluate_model.py --model model.pkl --test test.csv ```