--- name: tooluniverse-statistical-modeling description: Perform statistical modeling and regression analysis on biomedical datasets. Supports linear regression, logistic regression (binary/ordinal/multinomial), mixed-effects models, Cox proportional hazards survival analysis, Kaplan-Meier estimation, and comprehensive model diagnostics. Extracts odds ratios, hazard ratios, confidence intervals, p-values, and effect sizes. Designed to solve BixBench statistical reasoning questions involving clinical/experimental data. Use when asked to fit regression models, compute odds ratios, perform survival analysis, run statistical tests, or interpret model coefficients from provided data. --- # Statistical Modeling for Biomedical Data Analysis Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values. ## Features ✅ **Linear Regression** - OLS for continuous outcomes with diagnostic tests ✅ **Logistic Regression** - Binary, ordinal, and multinomial models with odds ratios ✅ **Survival Analysis** - Cox proportional hazards and Kaplan-Meier curves ✅ **Mixed-Effects Models** - LMM/GLMM for hierarchical/repeated measures data ✅ **ANOVA** - One-way/two-way ANOVA, per-feature ANOVA for omics data ✅ **Model Diagnostics** - Assumption checking, fit statistics, residual analysis ✅ **Statistical Tests** - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc. ## Quick Start ### Binary Logistic Regression ```python import statsmodels.formula.api as smf import numpy as np # Fit logistic regression model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0) # Extract odds ratios odds_ratios = np.exp(model.params) conf_int = np.exp(model.conf_int()) print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}") print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})") print(f"P-value: {model.pvalues['exposure']:.6f}") ``` ### Cox Proportional Hazards ```python from lifelines import CoxPHFitter # Fit Cox model cph = CoxPHFitter() cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']], duration_col='time', event_col='event') # Get hazard ratio hr = cph.hazard_ratios_['treatment'] print(f"Hazard Ratio: {hr:.4f}") print(f"Concordance: {cph.concordance_index_:.4f}") ``` See `QUICK_START.md` for 8 complete examples. ## Model Selection Decision Tree ``` START: What type of outcome variable? │ ├─ CONTINUOUS (height, blood pressure, score) │ ├─ Independent observations → Linear Regression (OLS) │ ├─ Repeated measures → Mixed-Effects Model (LMM) │ └─ Count data → Poisson/Negative Binomial │ ├─ BINARY (yes/no, disease/healthy) │ ├─ Independent observations → Logistic Regression │ ├─ Repeated measures → Logistic Mixed-Effects (GLMM/GEE) │ └─ Rare events → Firth logistic regression │ ├─ ORDINAL (mild/moderate/severe, stages I/II/III/IV) │ └─ Ordinal Logistic Regression (Proportional Odds) │ ├─ MULTINOMIAL (>2 unordered categories) │ └─ Multinomial Logistic Regression │ └─ TIME-TO-EVENT (survival time + censoring) ├─ Regression → Cox Proportional Hazards └─ Survival curves → Kaplan-Meier ``` ## When to Use Apply this skill when user asks: - "What is the odds ratio of X associated with Y?" - "What is the hazard ratio for treatment?" - "Fit a linear regression of Y on X1, X2, X3" - "Perform ordinal logistic regression for severity outcome" - "What is the Kaplan-Meier survival estimate at time T?" - "What is the percentage reduction in odds ratio after adjusting for confounders?" - "Run a mixed-effects model with random intercepts" - "Compute the interaction term between A and B" - "What is the F-statistic from ANOVA comparing groups?" - "Test if gene/miRNA expression differs across cell types" - "Perform one-way ANOVA on expression data" ## Workflow ### Phase 0: Data Validation **Goal**: Load data, identify variable types, check for missing values. **⚠️ CRITICAL: Identify the Outcome Variable First** Before any analysis, verify what you're actually predicting: 1. **Read the full question** - Look for "predict [outcome]", "model [outcome]", or "dependent variable" 2. **Examine available columns** - List all columns in the dataset 3. **Match question to data** - Find the column that matches the described outcome 4. **Verify outcome exists** - Don't create outcome variables from predictors **Common mistake (bix-51-q3 example)**: - ❌ Question mentions "obesity" → Assumed outcome = BMI ≥ 30 (circular logic with BMI predictor) - ✅ Read full question → Actual outcome = treatment response (PR vs non-PR) - **Always check data columns first**: `print(df.columns.tolist())` ```python import pandas as pd import numpy as np # Load data df = pd.read_csv('data.csv') # Check structure print(f"Observations: {len(df)}") print(f"Variables: {len(df.columns)}") print(f"Missing: {df.isnull().sum().sum()}") # Detect variable types for col in df.columns: n_unique = df[col].nunique() if n_unique == 2: print(f"{col}: binary") elif n_unique <= 10 and df[col].dtype == 'object': print(f"{col}: categorical ({n_unique} levels)") elif df[col].dtype in ['float64', 'int64']: print(f"{col}: continuous (mean={df[col].mean():.2f})") ``` ### Phase 1: Model Fitting **Goal**: Fit appropriate model based on outcome type. #### Linear Regression ```python import statsmodels.formula.api as smf # R-style formula (recommended) model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit() # Results print(f"R-squared: {model.rsquared:.4f}") print(f"AIC: {model.aic:.2f}") print(model.summary()) ``` #### Logistic Regression ```python # Fit model model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0) # Odds ratios ors = np.exp(model.params) ci = np.exp(model.conf_int()) for var in ['exposure', 'age', 'sex_M']: print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})") ``` #### Ordinal Logistic Regression ```python from statsmodels.miscmodels.ordinal_model import OrderedModel # Prepare ordered outcome severity_order = ['Mild', 'Moderate', 'Severe'] df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True) y = df['severity'].cat.codes # Fit model X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float) model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0) # Odds ratios ors = np.exp(model.params[:len(X.columns)]) print(f"Exposure OR: {ors[0]:.4f}") ``` #### Cox Proportional Hazards ```python from lifelines import CoxPHFitter # Fit model cph = CoxPHFitter() cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event') # Hazard ratios print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}") print(f"Concordance: {cph.concordance_index_:.4f}") ``` See `references/` for detailed examples of each model type. #### Statistical Tests (t-test, ANOVA, Chi-square) **One-way ANOVA**: Compare means across ≥3 groups ```python from scipy import stats # Single ANOVA (one outcome, multiple groups) group1 = df[df['celltype'] == 'CD4']['expression'] group2 = df[df['celltype'] == 'CD8']['expression'] group3 = df[df['celltype'] == 'CD14']['expression'] f_stat, p_value = stats.f_oneway(group1, group2, group3) print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}") ``` **⚠️ CRITICAL: Multi-feature ANOVA Decision Tree** When data has **multiple features** (genes, miRNAs, metabolites, etc.), there are TWO approaches: ``` Question: "What is the F-statistic comparing [feature] expression across groups?" DECISION TREE: │ ├─ Does question specify "the F-statistic" (singular)? │ │ │ ├─ YES, singular → Likely asking for SPECIFIC FEATURE(S) F-statistic │ │ │ │ │ ├─ Are there thousands of features (genes, miRNAs)? │ │ │ YES → Per-feature approach (Method B below) │ │ │ │ │ └─ Is there one feature of interest? │ │ YES → Single feature ANOVA (Method A below) │ │ │ └─ NO, asks about "all features" or "genes" (plural)? │ YES → Aggregate approach or per-feature summary │ └─ When unsure: Calculate PER-FEATURE and report summary statistics ``` **Method A: Aggregate ANOVA** (all features combined) - Use when: Testing overall expression differences across all features - Result: Single F-statistic representing global effect ```python # Flatten all features across all samples per group groups_agg = [] for celltype in ['CD4', 'CD8', 'CD14']: samples = df[df['celltype'] == celltype] # Flatten: all features × all samples in this group all_values = expression_matrix.loc[:, samples.index].values.flatten() groups_agg.append(all_values) f_stat_agg, p_value = stats.f_oneway(*groups_agg) print(f"Aggregate F-statistic: {f_stat_agg:.4f}") # Result: Very large F-statistic (e.g., 153.8) ``` **Method B: Per-feature ANOVA** (each feature separately) ⭐ RECOMMENDED for gene expression - Use when: Testing EACH feature individually (most common in genomics) - Result: Distribution of F-statistics (one per feature) ```python # Calculate F-statistic FOR EACH FEATURE separately per_feature_f_stats = [] for feature in expression_matrix.index: # For each gene/miRNA/metabolite groups = [] for celltype in ['CD4', 'CD8', 'CD14']: samples = df[df['celltype'] == celltype] # Get expression of THIS feature in THIS cell type values = expression_matrix.loc[feature, samples.index].values groups.append(values) f_stat, _ = stats.f_oneway(*groups) if not np.isnan(f_stat): per_feature_f_stats.append((feature, f_stat)) # Summary statistics f_values = [f for _, f in per_feature_f_stats] print(f"Per-feature F-statistics:") print(f" Median: {np.median(f_values):.4f}") print(f" Mean: {np.mean(f_values):.4f}") print(f" Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]") # Find features in specific range (e.g., 0.76-0.78) target_features = [(name, f) for name, f in per_feature_f_stats if 0.76 <= f <= 0.78] if target_features: print(f"Features with F ∈ [0.76, 0.78]: {len(target_features)}") for name, f_val in target_features: print(f" {name}: F = {f_val:.6f}") ``` **Key Differences**: | Aspect | Method A (Aggregate) | Method B (Per-feature) | |--------|---------------------|------------------------| | **Interpretation** | Overall expression difference | Feature-specific differences | | **Result** | 1 F-statistic | N F-statistics (N = # features) | | **Typical value** | Very large (e.g., 153.8) | Small to large (e.g., 0.1 to 100+) | | **Use case** | Global effect size | Gene/biomarker discovery | | **Common in** | Rarely used | **Genomics, proteomics, metabolomics** ⭐ | **Real-world example (BixBench bix-36-q1)**: - Question: "What is the F-statistic comparing miRNA expression across immune cell types?" - Expected: 0.76-0.78 - Method A (aggregate): 153.836 ❌ **WRONG** - Method B (per-miRNA): Found 2 miRNAs with F ∈ [0.76, 0.78] ✅ **CORRECT** **Default assumption for gene expression data**: Use **Method B (per-feature)** ### Phase 2: Model Diagnostics **Goal**: Check model assumptions and fit quality. #### OLS Diagnostics ```python from scipy import stats as scipy_stats from statsmodels.stats.diagnostic import het_breuschpagan # Residual normality residuals = model.resid sw_stat, sw_p = scipy_stats.shapiro(residuals) print(f"Shapiro-Wilk: p={sw_p:.4f} (normal if p>0.05)") # Heteroscedasticity bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog) print(f"Breusch-Pagan: p={bp_p:.4f} (homoscedastic if p>0.05)") # VIF (multicollinearity) from statsmodels.stats.outliers_influence import variance_inflation_factor X = model.model.exog for i in range(1, X.shape[1]): # Skip intercept vif = variance_inflation_factor(X, i) print(f"{model.model.exog_names[i]}: VIF={vif:.2f}") ``` #### Proportional Hazards Test ```python # Test PH assumption for Cox model results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False) if len(results) == 0: print("✅ Proportional hazards assumption met") else: print(f"⚠️ PH violated for: {results}") ``` See `references/troubleshooting.md` for common diagnostic issues. ### Phase 3: Interpretation **Goal**: Generate publication-quality summary. #### Odds Ratio Interpretation ```python def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value): """Interpret odds ratio with clinical meaning.""" if or_val > 1: pct_increase = (or_val - 1) * 100 direction = f"{pct_increase:.1f}% increase in odds" else: pct_decrease = (1 - or_val) * 100 direction = f"{pct_decrease:.1f}% decrease in odds" sig = "significant" if p_value < 0.05 else "not significant" ci_contains_null = ci_lower <= 1 <= ci_upper return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})" ``` ## Common BixBench Patterns ### Pattern 1: Odds Ratio from Ordinal Regression **Question**: "What is the odds ratio of disease severity associated with exposure?" **Solution**: 1. Identify ordinal outcome (mild/moderate/severe) 2. Fit ordinal logistic regression (proportional odds model) 3. Extract OR = exp(coefficient for exposure) 4. Report with CI and p-value ### Pattern 2: Percentage Reduction in Odds **Question**: "What is the percentage reduction in OR after adjusting for confounders?" **Solution**: ```python # Unadjusted model model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0) or_crude = np.exp(model_crude.params['exposure']) # Adjusted model model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0) or_adj = np.exp(model_adj.params['exposure']) # Percentage reduction pct_reduction = (or_crude - or_adj) / or_crude * 100 print(f"Percentage reduction: {pct_reduction:.1f}%") ``` ### Pattern 3: Interaction Effects **Question**: "What is the odds ratio for the interaction between A and B?" **Solution**: ```python # Fit model with interaction model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0) # Interaction OR interaction_coef = model.params['A:B'] interaction_or = np.exp(interaction_coef) print(f"Interaction OR: {interaction_or:.4f}") ``` ### Pattern 4: Survival Analysis **Question**: "What is the hazard ratio for treatment?" **Solution**: 1. Load survival data (time, event, covariates) 2. Fit Cox proportional hazards model 3. Extract HR = exp(coefficient) 4. Report with CI and concordance index ### Pattern 5: Multi-feature ANOVA (Gene Expression) **Question**: "What is the F-statistic comparing miRNA expression across cell types?" **Solution**: 1. Identify that data has multiple features (genes/miRNAs) 2. Use **per-feature ANOVA** (NOT aggregate) 3. Calculate F-statistic for EACH feature separately 4. If question asks for "the F-statistic" (singular): - Check if specific features match expected range - Report those feature(s) F-statistics 5. If question asks for summary: report median/mean/distribution **Critical**: For gene expression data, default to per-feature ANOVA. Aggregate ANOVA gives F-statistics ~200× larger and is rarely correct. See `references/bixbench_patterns.md` for 15+ question patterns. ## Statsmodels vs Scikit-learn | Use Case | Library | Reason | |----------|---------|--------| | **Inference** (p-values, CIs, ORs) | **statsmodels** | Full statistical output | | **Prediction** (accuracy, AUC) | **scikit-learn** | Better prediction tools | | **Mixed-effects models** | **statsmodels** | Only option | | **Regularization** (LASSO, Ridge) | **scikit-learn** | Better optimization | | **Survival analysis** | **lifelines** | Specialized library | **General rule**: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs). ## Python Package Requirements ``` statsmodels>=0.14.0 scikit-learn>=1.3.0 lifelines>=0.27.0 pandas>=2.0.0 numpy>=1.24.0 scipy>=1.10.0 ``` ## File Structure ``` tooluniverse-statistical-modeling/ ├── SKILL.md # This file ├── QUICK_START.md # 8 quick examples ├── EXAMPLES.md # Legacy examples (kept for reference) ├── TOOLS_REFERENCE.md # ToolUniverse tool catalog ├── test_skill.py # Comprehensive test suite ├── references/ │ ├── logistic_regression.md # Detailed logistic examples │ ├── ordinal_logistic.md # Ordinal logit guide │ ├── cox_regression.md # Survival analysis guide │ ├── linear_models.md # OLS and mixed-effects │ ├── bixbench_patterns.md # 15+ question patterns │ └── troubleshooting.md # Diagnostic issues └── scripts/ ├── format_statistical_output.py # Format results for reporting └── model_diagnostics.py # Automated diagnostics ``` ## Key Principles 1. **Data-first approach** - Always inspect and validate data before modeling 2. **Model selection by outcome type** - Use decision tree above 3. **Assumption checking** - Verify model assumptions (linearity, proportional hazards, etc.) 4. **Complete reporting** - Always report effect sizes, CIs, p-values, and model fit statistics 5. **Confounder awareness** - Adjust for confounders when specified or clinically relevant 6. **Reproducible analysis** - All code must be deterministic and reproducible 7. **Robust error handling** - Graceful handling of convergence failures, separation, collinearity 8. **Round correctly** - Match the precision requested (typically 2-4 decimal places) ## Completeness Checklist Before finalizing any statistical analysis: - [ ] **Outcome variable identified**: Verified which column is the actual outcome (not assumed) - [ ] **Data validated**: N, missing values, variable types confirmed - [ ] **Multi-feature data identified**: If data has multiple features (genes, miRNAs), use per-feature approach - [ ] **Model appropriate**: Outcome type matches model family - [ ] **Assumptions checked**: Relevant diagnostics performed - [ ] **Effect sizes reported**: OR/HR/Cohen's d with CIs - [ ] **P-values reported**: With appropriate correction if needed - [ ] **Model fit assessed**: R-squared, AIC/BIC, concordance - [ ] **Results interpreted**: Plain-language interpretation - [ ] **Precision correct**: Numbers rounded appropriately - [ ] **Confounders addressed**: Adjusted analyses if applicable ## References - **statsmodels**: https://www.statsmodels.org/ - **lifelines**: https://lifelines.readthedocs.io/ - **scikit-learn**: https://scikit-learn.org/ - **Ordinal models**: statsmodels.miscmodels.ordinal_model.OrderedModel ## ToolUniverse Integration While this skill is primarily computational, ToolUniverse tools can provide data: | Use Case | Tools | |----------|-------| | Clinical trial data | `clinical_trials_search` | | Drug safety outcomes | `FAERS_calculate_disproportionality` | | Gene-disease associations | `OpenTargets_target_disease_evidence` | | Biomarker data | `fda_pharmacogenomic_biomarkers` | See `TOOLS_REFERENCE.md` for complete tool catalog. ## Support For detailed examples and troubleshooting: - **Logistic regression**: `references/logistic_regression.md` - **Ordinal models**: `references/ordinal_logistic.md` - **Survival analysis**: `references/cox_regression.md` - **Linear/mixed models**: `references/linear_models.md` - **BixBench patterns**: `references/bixbench_patterns.md` - **Diagnostics**: `references/troubleshooting.md`