--- title: 'Statistical Inference Intro' author: "Kevin Donovan" date: "`r format(Sys.time(), '%B %d, %Y')`" output: slidy_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE, fig.width=6, fig.height=6) library(tidyverse) library(readr) library(shiny) library(rmarkdown) ``` # Introduction **Previous Sessions**: focused on data management and visualization with coding **Now**: focus on statistical analysis of data **Objectives** - Conceptual understanding of statistics - Common pitfalls - Analytic methods - Method implementation in R # Populations, Samples, Data Properties **Common Viewpoint**:

**Limitations**: - Assumes full information can be obtained - Assumes randomness ONLY from sampling part of group - Doesn't work with many examples of interest (coin tossing) - Doesn't allow for large/"infinite" sampling # Populations, Samples, Data Properties **Alternative Viewpoint**: Data Generating Mechanism

Ex.: What is the **probability** of seeing a heads after flipping coin? # Populations, Samples, Data Properties Statistical analysis uses both **inductive** and **deductive** reasoning

We **infer** systematic properties and **test** hypotheses based on data Inferences are **probabilistic** due to unavoidable uncertainty from only seeing output of process # Parameters and Estimation **Parameters** = Fixed properties of "black box" Determine "signal" present in process with random variation being "noise" **Examples**: - Variable Distribution - Mean, Median ("center") - Variance, Standard Deviation ("spread") - Probability of outcomes # Parameters and Estimation ```{r example_properties, fig.width=7, fig.height=7} plot(x=seq(from = -3, to = 3, by=0.25), y=dnorm(x=seq(from = -3, to = 3, by=0.25)), type = "b", xlab="", ylab="", main="Standard Normal Distribution\nMean = 0; Standard Deviation = 1") ``` # Parameters and Estimation **First Step**: Choose parameter(s) we want to estimate **Second Step**: Determine method for estimation **Third Step**: Assess estimation **Concern**: How do we take into account the probabilistic uncertainty?

# Parameters and Estimation **Estimator** = function of observed data **Sample** = $(X_1, X_2, \ldots, X_n)$ from chosen population **Exs.**: Sample Mean = $(X_1 + X_2 + \ldots + X_n)/n = \bar{X}$ Sample Variance = $[(X_1 - \bar{X})^2+(X_2 - \bar{X})^2+ \ldots+(X_n - \bar{X})^2]/n = \hat{\sigma^2}$ # Parameters and Estimation Estimate will vary from sample to sample. How to measure this variability? ```{r estimate_sim_variance, fig.width=6, fig.height=6} no_of_samples <- 10 n <- 50 estimate_vector_mean <- rep(NA, no_of_samples) estimate_vector_var <- rep(NA, no_of_samples) for(i in 1:no_of_samples){ set.seed(i) data <- rnorm(n) # sample mean estimate_vector_mean[i] <- mean(data) estimate_vector_var[i] <- var(data) } plot(x=seq(from = -3, to = 3, by=0.25), y=dnorm(x=seq(from = -3, to = 3, by=0.25)), type = "b", xlab="", ylab="", main="Standard Normal Distribution\nMean = 0; Standard Deviation = 1") paged_table(data.frame("Sample_Mean" = round(estimate_vector_mean, 4), "Sample_Variance" = round(estimate_vector_var, 4), "n"=n)) ``` **Answer**: Standard Error and Confidence Intervals # Parameters and Estimation **Standard Error** = Variance of Estimator **Confidence Interval** = Interval of plausible values for parameter based on sample

# Parameters and Estimation **CI Interpretation**: 95% of intervals from sampled data will include true parameter value ```{r confident_int_plot, results='asis'} no_of_samples <- 20 n <- 50 estimate_list <- list() for(i in 1:no_of_samples){ set.seed(i) data <- rnorm(n) estimate_list[[i]] <- list() estimate_list[[i]][["sample"]] <- i estimate_list[[i]][["mean_est"]] <- mean(data) estimate_list[[i]][["confint_lower"]] <- t.test(data)$conf.int[1] estimate_list[[i]][["confint_upper"]] <- t.test(data)$conf.int[2] estimate_list[[i]] <- data.frame(estimate_list[[i]]) %>% mutate(color_int = ifelse(confint_lower<0 & confint_upper>0, TRUE, FALSE)) } estimate_df <- do.call("rbind", estimate_list) ggplot(data = estimate_df, mapping=aes(x=sample, y=mean_est))+ geom_errorbar(mapping=aes(ymin=confint_lower, ymax=confint_upper, y=mean_est, color=color_int))+ scale_color_manual(values=c("red", "black"))+ labs(y="Sample Mean (95% CI)", x="Sample Number")+ geom_hline(yintercept = 0)+ coord_flip()+ theme_bw()+ theme(legend.position = "none") ``` # Hypothesis Testing

# Hypothesis Testing **Statistical Analogue**: Hypothesis Testing for parameter(s) **Example**: **Scientific Hypothesis**: Infants at 12 months with Autism (ASD) have larger brain volumes then those without ASD **Statistical Hypothesis**: $\mu_{ASD, 12}>\mu_{ASD_{Neg}, 12}$ where $\mu_{x, 12}=$ mean of total brain volume at 12 months of age for group "x" # Hypothesis Testing **For Scientific Testing**: Focused on **falsifying** claim **For Statistical Testing**: Uncertainty $\implies$ **cannot** "falsify" claim **instead** judge evidence to see if one has "enough" to "reject" claim # Hypothesis Testing **Process**: **Step One**: Determine **null** (baseline claim) and **alternative** hypotheses **Step Two**: Determine quantities used as evidence to make decision (test statistic) **Step Three**: Evaluate evidence # Hypothesis Testing **Evaluating Evidence**: **Method 1**: Binary decision - Choose to reject or fail to reject null - **Do not** "accept" the null or "prove" the null or alternative are true - **Pro**: Direct conclusion made, may be needed for clinical action - **Con**: Binary decisions often made using some threshold, reduces info in data

**Method 2**: Continuous evaluation - **Pro**: More complete reflection of uncertainty in data, thresholds may be arbitrary - **Con**: Conclusions less decisive, may be diffcult to put results into action

# Hypothesis Testing **Test Statistic**: Function of data used to evaluate null hypothesis claim **Example**: Recall our standard normal distribution example ```{r example_properties_2, fig.width=7, fig.height=7} plot(x=seq(from = -3, to = 3, by=0.25), y=dnorm(x=seq(from = -3, to = 3, by=0.25)), type = "b", xlab="", ylab="", main="Standard Normal Distribution\nMean = 0; Standard Deviation = 1") ``` Given this population, suppose we wanted to evaluate the following: $H_0$ (Null): mean $=0$ $H_1$ (Alt.): mean $\neq 0$ **Idea**: Let's compare the sample mean from a single sample to 0 **Issue**: How to determine "difference" from 0 given random sampling variability? **Idea**: Determine distribution of sample mean to determine correspondance with $H_0$ # Hypothesis Testing **Sample Mean Distribution**: Function of data used to evaluate null hypothesis claim ```{r sample_mean_dist, results='asis'} sample_reps <- c(50, 100, 500, 1000) n <- 100 estimate_list <- list() for(j in 1:length(sample_reps)){ no_of_samples <- sample_reps[j] for(i in 1:no_of_samples){ set.seed(i) data <- rnorm(n) estimate_list[[i]] <- list() estimate_list[[i]][["sample"]] <- i estimate_list[[i]][["mean_est"]] <- mean(data) estimate_list[[i]][["confint_lower"]] <- t.test(data)$conf.int[1] estimate_list[[i]][["confint_upper"]] <- t.test(data)$conf.int[2] estimate_list[[i]] <- data.frame(estimate_list[[i]]) %>% mutate(color_int = ifelse(confint_lower<0 & confint_upper>0, TRUE, FALSE)) } estimate_df <- do.call("rbind", estimate_list) print(ggplot(data = estimate_df, mapping=aes(x=mean_est))+ geom_histogram()+ labs(x="", y="", title=paste0("Distribution of sample mean across simulated samples with n = ", n), subtitle = paste0("Number of samples = ", no_of_samples))+ theme_bw()+ theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())) } ``` # Hypothesis Testing **Note**: Need enough samples to accurately estimate distribution (i.e. sample size) **Observation**: Samples goes up $\implies$ estimated distribution of **sample mean** "converges" to normal curve Turns out, this is a **mathematical fact** for **any** distribution, with random samples Referred to as the **Central Limit Theorem (CLT)** # Hypothesis Testing **Recall**: Need distribution of sample mean to determine "rare" outcome **From CLT**, we know ```{r sample_mean_clt, results='asis'} sample_reps <- 1000 n <- 100 estimate_list <- list() for(j in 1:length(sample_reps)){ no_of_samples <- sample_reps[j] for(i in 1:no_of_samples){ set.seed(i) data <- rnorm(n) estimate_list[[i]] <- list() estimate_list[[i]][["sample"]] <- i estimate_list[[i]][["mean_est"]] <- mean(data) estimate_list[[i]][["confint_lower"]] <- t.test(data)$conf.int[1] estimate_list[[i]][["confint_upper"]] <- t.test(data)$conf.int[2] estimate_list[[i]] <- data.frame(estimate_list[[i]]) %>% mutate(color_int = ifelse(confint_lower<0 & confint_upper>0, TRUE, FALSE)) } estimate_df <- do.call("rbind", estimate_list) print(ggplot(data = estimate_df, mapping=aes(x=mean_est))+ geom_histogram()+ labs(x="", y="", title=paste0("Distribution of sample mean across simulated samples with n = ", n), subtitle = paste0("Number of samples = ", no_of_samples))+ theme_bw()+ theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())) } ``` is an accurate estimate of this distribution, **assuming the null hypothesis** of mean = 0 # Hypothesis Testing Suppose we obtain a sample of size **n=100** and compute the sample mean Let's compare this observed sample mean to the expected distribution of values under the null ```{r sample_mean_clt_vs_obs, results='asis'} sample_reps <- 1000 n <- 100 estimate_list <- list() for(j in 1:length(sample_reps)){ no_of_samples <- sample_reps[j] for(i in 1:no_of_samples){ set.seed(i) data <- rnorm(n) estimate_list[[i]] <- list() estimate_list[[i]][["sample"]] <- i estimate_list[[i]][["mean_est"]] <- mean(data) estimate_list[[i]][["confint_lower"]] <- t.test(data)$conf.int[1] estimate_list[[i]][["confint_upper"]] <- t.test(data)$conf.int[2] estimate_list[[i]] <- data.frame(estimate_list[[i]]) %>% mutate(color_int = ifelse(confint_lower<0 & confint_upper>0, TRUE, FALSE)) } estimate_df <- do.call("rbind", estimate_list) # Get another sample mean to compare set.seed(1) data <- rnorm(n) mean_check <- mean(data) print(ggplot(data = estimate_df, mapping=aes(x=mean_est))+ geom_histogram()+ labs(x="", y="", title=paste0("Distribution of sample mean across simulated samples with n = ", n), subtitle = paste0("Number of samples = ", no_of_samples))+ geom_vline(xintercept = mean_check)+ geom_text(aes(x=mean_check, y=120, label="Obs. Sample Mean"))+ theme_bw()+ theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())) } ``` # Hypothesis Testing Can we quantify this comparison? Yes 1) Compute Test Statistic value = statistic whose distriubtion is known under null In this case from above, sample mean = one possible test statistic 2) Compute "probability-based" measure = probability of observing sample mean as or extreme then seen in data - **low** probability = **unlikely** event under null $\implies$ observed sample **poorly matches** with null - **high** probability = **likely** event under null $\implies$ observed sample **matches** with null This measure is denoted as a **p-value** = measure of correspondence between sample and null # Hypothesis Testing Let's compute the p-value in the above example ```{r sample_mean_clt_pval, results='asis'} sample_reps <- 1000 n <- 100 estimate_list <- list() no_of_samples <- sample_reps for(i in 1:no_of_samples){ set.seed(i) data <- rnorm(n) estimate_list[[i]] <- list() estimate_list[[i]][["sample"]] <- i estimate_list[[i]][["mean_est"]] <- mean(data) estimate_list[[i]][["confint_lower"]] <- t.test(data)$conf.int[1] estimate_list[[i]][["confint_upper"]] <- t.test(data)$conf.int[2] estimate_list[[i]] <- data.frame(estimate_list[[i]]) %>% mutate(color_int = ifelse(confint_lower<0 & confint_upper>0, TRUE, FALSE)) } estimate_df <- do.call("rbind", estimate_list) # Get another sample mean to compare set.seed(1) data <- rnorm(n) mean_check <- mean(data) estimate_df$comp_obs_sample = ifelse(abs(estimate_df$mean_est)>abs(mean_check), TRUE, FALSE) print(ggplot(data = estimate_df, mapping=aes(x=mean_est))+ geom_histogram(aes(fill=comp_obs_sample))+ labs(x="", y="", title=paste0("Distribution of sample mean across simulated samples with n = ", n), subtitle = paste0("Number of samples = ", no_of_samples))+ geom_vline(xintercept = mean_check)+ geom_text(aes(x=mean_check, y=120, label="Obs. Sample Mean"))+ scale_fill_manual(values=c("black", "red"))+ theme_bw()+ theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), legend.position = "none")) paste0("P-value is ", round(2*pnorm(abs(mean_check)/(1/sqrt(n)), lower.tail = FALSE), 3)) ``` Can see much of distribution is shaded in $\implies$ observed sample **not** unusual under null Note that CTL assumes that the true population standard deviation is **known** This is rarely the case and often needs to be estimated This results in the following test statistic: $T=\frac{\bar{X}-\mu_0}{\hat{\sigma}/\sqrt{n}}$ where $\mu_0$ = mean under $H_0$ which has a **T distribution** with $n-1$ degrees of freedom # Hypothesis Testing Finding a test statistic when looking at the mean in this scenario is easy **However** n general it can be quite challenging for more complicated scenarios **Recap**: Testing process 1) Determine parameters of interest 2) Define null and alternative hypotheses 3) Derive test statistic 4) Translate test statistic to more interpretable measure (e.g. p-value) # Hypothesis Testing **Complications**: Many considerations need to be taken into account when testing 1) Multiple testing and **Type I error** 2) **Hypothesis generating** analyses vs **hypothesis testing** analyses 3) [**P-value hacking**](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106#:~:text=Inflation%20bias%2C%20also%20known%20as,results%20%5B12%E2%80%9315%5D) and [**hypothesis generation from data**](https://www2.psych.ubc.ca/~schaller/528Readings/Kerr1998.pdf) These (and many more) aspects will be covered in future sessions # Songs of the session [Mexican Grand Prix by Mogwai](https://www.youtube.com/watch?v=BJZQ9xBrkWo) [Your Hand in Mine by Explosion in the Sky](https://www.youtube.com/watch?v=cdiY6kijYHE) [Pacific Theme by Broken Social Club](https://www.youtube.com/watch?v=cdiY6kijYHE) [The Big Ship by Brian Eno](https://www.youtube.com/watch?v=lCCJc_V8_MQ)