--- name: statistical-analysis description: >- Guided statistical analysis: test choice, assumption checks, effect sizes, power, APA reporting. Pick tests, verify assumptions, or format results for publication. Covers frequentist (t-test, ANOVA, chi-square, regression, correlation, survival, count, reliability) and Bayesian. Use statsmodels or pymc-bayesian-modeling to fit. license: CC-BY-4.0 --- # Statistical Analysis ## Overview Statistical analysis is the systematic process of selecting appropriate tests, verifying assumptions, quantifying effect magnitudes, and reporting results. This knowhow guides test selection, assumption diagnostics, and APA-style reporting for frequentist and Bayesian analyses in academic research. ## Key Concepts ### Frequentist vs Bayesian Framework | Aspect | Frequentist | Bayesian | |--------|-------------|----------| | Core output | p-value, confidence interval | Posterior distribution, credible interval | | Interpretation | "How likely is this data if H0 is true?" | "How likely is H1 given the data?" | | Null support | Cannot support H0 (only fail to reject) | Can quantify evidence for H0 via Bayes Factor | | Prior info | Not used | Incorporated via prior distributions | | Sample size | Requires adequate power | Works with any sample size | | Best for | Standard analyses, large samples | Small samples, prior info, complex models | ### Statistical vs Practical Significance A statistically significant result (p < .05) may be trivially small in practice. Always report: - **Effect size**: Magnitude of the effect (Cohen's d, eta-squared, r, R-squared) - **Confidence interval**: Precision of the estimate - **Context**: Clinical/practical relevance in the domain ### Common Effect Sizes | Test | Effect Size | Small | Medium | Large | |------|-------------|-------|--------|-------| | t-test | Cohen's d | 0.20 | 0.50 | 0.80 | | t-test (small n) | Hedges' g | 0.20 | 0.50 | 0.80 | | ANOVA | eta-squared partial | 0.01 | 0.06 | 0.14 | | ANOVA | omega-squared | 0.01 | 0.06 | 0.14 | | Correlation | r | 0.10 | 0.30 | 0.50 | | Regression | R-squared | 0.02 | 0.13 | 0.26 | | Regression | f-squared | 0.02 | 0.15 | 0.35 | | Chi-square | Cramer's V | 0.07 | 0.21 | 0.35 | | Chi-square 2x2 | phi coefficient | 0.10 | 0.30 | 0.50 | Cohen's benchmarks are guidelines, not rigid thresholds -- domain context always matters. ### Assumptions Overview Most parametric tests require: 1. **Independence**: Observations are independent of each other 2. **Normality**: Data (or residuals) are approximately normally distributed 3. **Homogeneity of variance**: Groups have similar variances (for group comparisons) 4. **Linearity**: Relationship between variables is linear (for regression) When assumptions are violated: - **Normality violated, n > 30**: Proceed -- parametric tests are robust with large samples - **Normality violated, n < 30**: Use non-parametric alternative - **Variance heterogeneity**: Use Welch's correction (t-test) or Welch's ANOVA - **Linearity violated**: Add polynomial terms, transform variables, or use GAMs ### Test-Specific Assumption Workflows **T-test assumptions**: (1) Check normality per group with Shapiro-Wilk + Q-Q plots. (2) Check homogeneity with Levene's test. (3) If normality violated: Mann-Whitney U (independent) or Wilcoxon signed-rank (paired). If variance heterogeneity: use Welch's t-test. **ANOVA assumptions**: (1) Normality per group. (2) Homogeneity via Levene's test. (3) For repeated measures: check sphericity (Mauchly's test); if violated, apply Greenhouse-Geisser (epsilon < 0.75) or Huynh-Feldt (epsilon > 0.75) correction. (4) If normality violated: Kruskal-Wallis (independent) or Friedman (repeated). **Linear regression assumptions**: (1) Linearity via residuals-vs-fitted plot. (2) Independence via Durbin-Watson test (1.5-2.5 acceptable). (3) Homoscedasticity via Breusch-Pagan test + scale-location plot. (4) Normality of residuals via Q-Q plot + Shapiro-Wilk. (5) Multicollinearity via VIF (>10 = severe, >5 = moderate). **Logistic regression assumptions**: (1) Independence. (2) Linearity of log-odds with continuous predictors (Box-Tidwell test). (3) No perfect multicollinearity (VIF). (4) Adequate sample size (10-20 events per predictor minimum). ### Specialized Test Categories Beyond the main decision flowchart, several specialized test families address specific data types: **Survival / time-to-event analysis**: - **Log-rank test**: Compares survival curves between groups (non-parametric) - **Cox proportional hazards**: Models time-to-event with covariates; assumes proportional hazards - **Parametric survival models**: Weibull, exponential, log-normal for known distributional forms - Use when outcome is time until an event (death, relapse, failure) with possible censoring **Count outcome models**: - **Poisson regression**: For count data where mean approximately equals variance - **Negative binomial regression**: For overdispersed counts (variance > mean) - **Zero-inflated models**: For excess zeros beyond what Poisson/NB predicts - Use when outcome is a count (number of events, incidents, occurrences) **Agreement and reliability**: - **Cohen's kappa**: Inter-rater agreement for categorical ratings (2 raters) - **Fleiss' kappa / Krippendorff's alpha**: Agreement for >2 raters - **Intraclass correlation coefficient (ICC)**: Continuous ratings reliability - **Cronbach's alpha**: Internal consistency of multi-item scales - **Bland-Altman analysis**: Agreement between two measurement methods (continuous) - Use when assessing measurement reliability or inter-rater consistency **Categorical data extensions**: - **McNemar's test**: Paired binary outcomes (2x2) - **Cochran's Q test**: Paired binary outcomes (3+ conditions) - **Cochran-Armitage trend test**: Ordered categories in contingency tables ## Decision Framework ### Test Selection Flowchart ``` What is your research question? | +-- Comparing GROUPS on a continuous outcome? | | | +-- How many groups? | | +-- 2 groups | | | +-- Independent -> Independent t-test (or Mann-Whitney U) | | | +-- Paired/repeated -> Paired t-test (or Wilcoxon signed-rank) | | +-- 3+ groups | | +-- Independent -> One-way ANOVA (or Kruskal-Wallis) | | +-- Repeated -> Repeated-measures ANOVA (or Friedman) | | | +-- Multiple factors? -> Factorial ANOVA / Mixed ANOVA | +-- With covariates? -> ANCOVA | +-- Testing a RELATIONSHIP between variables? | | | +-- Both continuous? | | +-- Normal -> Pearson correlation | | +-- Non-normal or ordinal -> Spearman correlation | | | +-- Predicting continuous outcome? | | +-- 1 predictor -> Simple linear regression | | +-- Multiple predictors -> Multiple linear regression | | | +-- Predicting categorical outcome? | | +-- Binary -> Logistic regression | | +-- Ordinal -> Ordinal logistic regression | | | +-- Predicting count outcome? | | +-- Equidispersed -> Poisson regression | | +-- Overdispersed -> Negative binomial regression | | +-- Excess zeros -> Zero-inflated Poisson/NB | | | +-- Time-to-event outcome? | +-- Compare survival curves -> Log-rank test | +-- With covariates -> Cox proportional hazards | +-- Testing ASSOCIATION between categorical variables? | +-- Expected cell count >= 5 -> Chi-square test | +-- Expected cell count < 5 -> Fisher's exact test | +-- Ordered categories -> Cochran-Armitage trend test | +-- Paired categories -> McNemar's test | +-- Assessing AGREEMENT / RELIABILITY? +-- Categorical, 2 raters -> Cohen's kappa +-- Categorical, >2 raters -> Fleiss' kappa +-- Continuous ratings -> ICC +-- Two measurement methods -> Bland-Altman analysis +-- Internal consistency -> Cronbach's alpha ``` ### Quick Reference Table | Research Question | Data Type | Normal? | Test | Non-parametric Alternative | |-------------------|-----------|---------|------|---------------------------| | 2 independent groups | Continuous | Yes | Independent t-test | Mann-Whitney U | | 2 paired groups | Continuous | Yes | Paired t-test | Wilcoxon signed-rank | | 3+ independent groups | Continuous | Yes | One-way ANOVA | Kruskal-Wallis | | 3+ repeated groups | Continuous | Yes | Repeated-measures ANOVA | Friedman test | | 2 variables | Continuous | Yes | Pearson r | Spearman rho | | Predict continuous | Mixed | -- | Linear regression | -- | | Predict binary | Mixed | -- | Logistic regression | -- | | Predict counts | Count | -- | Poisson / Negative binomial | -- | | Time-to-event | Survival | -- | Cox PH / Log-rank | -- | | 2 categorical | Categorical | -- | Chi-square / Fisher's exact | -- | | Rater agreement | Categorical | -- | Cohen's kappa / Fleiss' kappa | -- | | Method agreement | Continuous | -- | Bland-Altman / ICC | -- | ## Best Practices 1. **Pre-register analyses** when possible to distinguish confirmatory from exploratory findings. Specify primary outcome, tests, and correction methods before data collection 2. **Always check assumptions before interpreting results**. Run normality tests (Shapiro-Wilk), homogeneity tests (Levene's), and residual diagnostics. Document results even when assumptions are met 3. **Report effect sizes with confidence intervals** for every test. p-values alone are insufficient -- effect sizes convey practical importance 4. **Report all planned analyses** including non-significant findings. Selective reporting inflates false positive rates 5. **Use appropriate multiple comparison corrections**. Bonferroni (conservative), Holm (step-down, less conservative), or FDR/Benjamini-Hochberg (for many tests). Choose based on the number of comparisons and acceptable error rate 6. **Visualize data before and after analysis**. Box plots for group comparisons, scatter plots for correlations, residual plots for regression diagnostics 7. **Conduct sensitivity analyses** to assess robustness: re-run with outliers removed, different transformations, or alternative tests 8. **Anti-pattern -- p-hacking**: Testing multiple outcomes, subgroups, or model specifications until p < .05 inflates false positives. Pre-register to avoid 9. **Anti-pattern -- HARKing** (Hypothesizing After Results are Known): Presenting exploratory findings as confirmatory undermines scientific integrity 10. **Anti-pattern -- misinterpreting non-significance**: Failure to reject H0 does not mean H0 is true. Use Bayesian methods or equivalence testing to support null ## Common Pitfalls 1. **Misinterpreting p-values as probability of the hypothesis being true**. p-values measure P(data | H0), not P(H0 | data). *How to avoid*: Use precise language: "If the null hypothesis were true, the probability of observing data this extreme is p = ..." 2. **Confusing statistical significance with practical importance**. A large sample can make trivially small effects significant. *How to avoid*: Always report and interpret effect sizes alongside p-values 3. **Running post-hoc power analysis after a non-significant result**. Post-hoc power is a mathematical function of the p-value and adds no new information. *How to avoid*: Use sensitivity analysis instead -- determine what effect size the study could detect at 80% power 4. **Ignoring assumption violations and proceeding with parametric tests**. *How to avoid*: Run assumption checks systematically. Use Welch's corrections, non-parametric alternatives, or transformations when violated 5. **Multiple comparisons without correction**. Running 20 tests at alpha = .05 gives ~64% chance of at least one false positive. *How to avoid*: Apply Bonferroni, Holm, or FDR correction. Report both corrected and uncorrected p-values 6. **Treating ordinal data as continuous**. Likert scales are ordinal -- means and standard deviations assume equal intervals. *How to avoid*: Use non-parametric tests (Mann-Whitney, Kruskal-Wallis) or ordinal regression 7. **Ignoring missing data patterns**. Listwise deletion assumes MCAR, which is rarely true. *How to avoid*: Assess missingness mechanism (MCAR, MAR, MNAR). Use multiple imputation for MAR data 8. **Confusing correlation with causation**. Observational studies cannot establish causal relationships regardless of effect size. *How to avoid*: Use causal language only for experimental designs with random assignment 9. **Not reporting non-significant results**. Publication bias and file-drawer effect distort the literature. *How to avoid*: Report all pre-registered analyses. Consider registered reports 10. **Using one-tailed tests to "improve" significance**. One-tailed tests should be pre-specified based on strong directional hypotheses. *How to avoid*: Default to two-tailed. Only use one-tailed when justified a priori ## Workflow ### Standard Analysis Pipeline 1. **Define research question and hypotheses** - State H0 and H1 explicitly - Specify primary outcome and covariates 2. **Select statistical test** (use Decision Framework above) - Match test to data type, design, and assumptions - Plan multiple comparison corrections if needed 3. **Conduct a priori power analysis** - Specify target effect size (from literature or clinical relevance) - Set alpha = .05, power = .80 (minimum), determine required n - Libraries: `statsmodels.stats.power`, `pingouin` 4. **Inspect and clean data** - Check for missing data patterns (MCAR/MAR/MNAR) - Identify outliers (IQR method: Q1 - 1.5*IQR / Q3 + 1.5*IQR, or z-scores > 3) - Verify variable types and coding - The original `assumption_checks.py` script provides automated normality, homogeneity, and outlier detection with visualization 5. **Check assumptions** (see Test-Specific Assumption Workflows above) - Normality: Shapiro-Wilk test + Q-Q plots (visual primary for n > 50) - Homogeneity: Levene's test + box plots - Linearity: Residual plots (for regression) - Sphericity: Mauchly's test (for repeated measures) - Document results and remedial actions 6. **Run primary analysis** - Execute planned test with appropriate library (scipy.stats, pingouin, statsmodels) - Calculate effect size and confidence interval - For Bayesian analyses: specify priors, run MCMC, check convergence (see `references/bayesian_statistics.md`) 7. **Conduct post-hoc and secondary analyses** - Post-hoc pairwise comparisons (Tukey HSD, Bonferroni) - Sensitivity analyses (remove outliers, alternative methods) - Exploratory analyses (clearly labeled) 8. **Report results in APA format** - Descriptive statistics (M, SD, n per group) - Test statistic, degrees of freedom, exact p-value - Effect size with confidence interval - See `references/reporting_standards.md` for templates ## Bundled Resources - **`references/effect_sizes_and_power.md`** -- Detailed guide to calculating, interpreting, and reporting effect sizes (Cohen's d, Hedges' g, Glass's delta, eta-squared, omega-squared, partial eta-squared, phi coefficient, standardized beta, f-squared, Cramer's V, odds ratio); a priori, sensitivity, and correlation power analysis with code examples. Condensed from 582-line original. - **`references/bayesian_statistics.md`** -- Comprehensive Bayesian analysis guide: Bayes' theorem, prior specification, ROPE (Region of Practical Equivalence), prior sensitivity analysis, Bayesian t-test/ANOVA/correlation/regression, hierarchical models, model comparison (WAIC/LOO), convergence diagnostics. Condensed from 662-line original. - **`references/reporting_standards.md`** -- APA-style reporting templates for t-tests, ANOVA, regression, correlation, chi-square, non-parametric, and Bayesian analyses; pre-registration guidance; methods section templates (participants, design, measures); null results reporting; reporting checklist. Condensed from 470-line original. ### Fully-Consolidated Files (no separate reference file) - **`test_selection_guide.md`** (130 lines original) -- Fully consolidated into Decision Framework (flowchart + Quick Reference Table) and Specialized Test Categories subsection in Key Concepts. Combined coverage: flowchart (~35 lines) + Quick Reference Table (~15 lines) + Specialized Test Categories (~35 lines) = ~85 lines covering all original capabilities. Original content on sample size considerations, multiple comparisons, and missing data was consolidated into Best Practices and Common Pitfalls. Omitted: study design considerations (RCTs, observational, clustered data) -- general guidance covered by statsmodels-statistical-modeling skill. - **`assumptions_and_diagnostics.md`** (370 lines original) -- Fully consolidated into Key Concepts (Assumptions Overview + Test-Specific Assumption Workflows) and Workflow Steps 4-5. Combined coverage: Assumptions Overview (~12 lines) + Test-Specific Assumption Workflows (~20 lines) + Workflow Steps 4-5 (~16 lines) = ~48 lines. The original contained detailed code blocks for each assumption check; since this is Knowhow (not Skill), code is referenced rather than reproduced. Key diagnostic thresholds preserved (VIF > 10, Durbin-Watson 1.5-2.5, variance ratio < 2-3). Omitted: extensive Python code blocks for individual checks (normality, homogeneity, linearity, logistic regression diagnostics) -- available in scipy.stats and pingouin documentation. Sample size rules of thumb covered in Workflow Step 3. ### Script Disposition - **`assumption_checks.py`** (540 lines) -- Contains 6 functions: `check_normality()`, `check_normality_per_group()`, `check_homogeneity_of_variance()`, `check_linearity()`, `detect_outliers()`, `comprehensive_assumption_check()`. As Knowhow entry, script functions are referenced in Workflow Step 4 rather than reproduced inline. Key capabilities (Shapiro-Wilk, Levene's, IQR/z-score outlier detection, Q-Q plots) are described in Assumptions Overview and Test-Specific Assumption Workflows. Users needing automated checking should use scipy.stats and pingouin directly following the patterns described. ### Intentional Omissions - Time series methods (ARIMA, ACF/PACF) -- specialized topic beyond core statistical testing scope - Mixed-effects models / GEE -- covered by statsmodels-statistical-modeling skill - Bootstrap and permutation tests -- mentioned in passing; detailed implementation deferred to computational statistics resources ## Further Reading - Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.) - Field, A. (2013). *Discovering Statistics Using IBM SPSS Statistics* (4th ed.) - Gelman, A., & Hill, J. (2006). *Data Analysis Using Regression and Multilevel/Hierarchical Models* - Kruschke, J. K. (2014). *Doing Bayesian Data Analysis* (2nd ed.) - APA Publication Manual: https://apastyle.apa.org/ - Cross Validated (stats Q&A): https://stats.stackexchange.com/ ## Related Skills - **statsmodels-statistical-modeling** -- Implementing OLS, GLM, Logit, time-series models programmatically - **pymc-bayesian-modeling** -- Full Bayesian modeling with MCMC sampling - **scikit-learn-machine-learning** -- Predictive modeling, cross-validation, classification - **matplotlib-scientific-plotting** -- Creating publication-quality statistical figures - **hypothesis-generation** -- Structured hypothesis formulation before statistical testing