--- title: "Bootstrap for sampling distribution of sample mean" --- ## Assessing assumptions - Our $t$-tests assume normality of variable being tested - but, Central Limit Theorem says that normality matters less if sample is "large" - in practice "approximate normality" is enough, but how do we assess whether what we have is normal enough? - so far, use histogram/boxplot and make a call, allowing for sample size. ## What actually has to be normal - is: **sampling distribution of sample mean** - the distribution of sample mean over *all possible samples* - but we only have *one* sample! - Idea: assume our sample is representative of the population, and draw samples from our sample (!), with replacement. - This gives an idea of what different samples from the population might look like. - Called *bootstrap*, after expression "to pull yourself up by your own bootstraps". ## Packages ```{r} library(tidyverse) ``` ## Blue Jays attendances ```{r bootstrap-R-1, echo=FALSE, message=FALSE} jays <- read_csv("http://ritsokiguess.site/datafiles/jays15-home.csv") set.seed(457299) ``` ```{r bootstrap-R-2} jays$attendance ``` - A bootstrap sample: ```{r bootstrap-R-3} s <- sample(jays$attendance, replace = TRUE) s ``` - It is easier to see what is happening if we sort both the actual attendances and the bootstrap sample: ```{r} sort(jays$attendance) sort(s) ``` ## Getting mean of bootstrap sample - A bootstrap sample is same size as original, but contains repeated values (eg. 15062) and missing ones (42917). - We need the mean of our bootstrap sample: ```{r bootstrap-R-4} mean(s) ``` - This is a little different from the mean of our actual sample: ```{r bootstrap-R-5} mean(jays$attendance) ``` - Want a sense of how the sample mean might vary, if we were able to take repeated samples from our population. - Idea: take lots of *bootstrap* samples, and see how *their* sample means vary. ## Setting up bootstrap sampling - Begin by setting up a dataframe that contains a row for each bootstrap sample. I usually call this column `sim`. Do just 4 to get the idea: ```{r bootstrap-R-6} tibble(sim = 1:4) ``` ## Drawing the bootstrap samples - Then set up to work one row at a time, and draw a bootstrap sample of the attendances in each row: ```{r bootstrap-R-7} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) ``` - Each row of our dataframe contains *all* of a bootstrap sample of 25 observations drawn with replacement from the attendances. ## Sample means - Find the mean of each sample: ```{r bootstrap-R-8} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% mutate(my_mean = mean(sample)) ``` - These are (four simulated values of) the bootstrapped sampling distribution of the sample mean. ## Make a histogram of them - rather pointless here, but to get the idea: ```{r bootstrap-R-9} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(x = my_mean)) + geom_histogram(bins = 3) -> g ``` ## The (pointless) histogram ```{r bootstrap-R-10} g ``` ## Now do again with a decent number of bootstrap samples - say 1000, and put a decent number of bins on the histogram also: ```{r bootstrap-R-11} tibble(sim = 1:1000) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g ``` ## The (better) histogram ```{r bootstrap-R-12} g ``` ## Comments - This is very close to normal - The bootstrap says that the sampling distribution of the sample mean is close to normal, even though the distribution of the data is not - A sample size of 25 is big enough to overcome the skewness that we saw - This is the Central Limit Theorem in practice - It is surprisingly powerful. - Thus, the $t$-test is actually perfectly good here. ## Comments on the code 1/2 - You might have been wondering about this: ```{r bootstrap-R-13} tibble(sim = 1:4) %>% rowwise() %>% mutate(sample = list(sample(jays$attendance, replace = TRUE))) ``` ## Comments on the code 2/2 - how did we squeeze all 25 sample values into one cell? - sample is a so-called "list-column" that can contain anything. - why did we have to put `list()` around the `sample()`? - because `sample` produces a collection of numbers, not just a single one - the `list()` signals this: "make a list-column of samples". ## Two samples - Assumption: *both* samples are from a normal distribution. - In this case, each sample should be "normal enough" given its sample size, since Central Limit Theorem will help. - Use bootstrap on each group independently, as above. ## Kids learning to read ```{r bootstrap-R-14, echo=FALSE, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/drp.txt" kids <- read_delim(my_url," ") kids ``` ```{r bootstrap-R-15} ggplot(kids, aes(x=group, y=score)) + geom_boxplot() ``` ## Getting just the control group - Use `filter` to select rows where something is true: ```{r bootstrap-R-16} kids %>% filter(group=="c") -> controls controls ``` ## Bootstrap these ```{r bootstrap-R-17} tibble(sim = 1:1000) %>% rowwise() %>% mutate(sample = list(sample(controls$score, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) ``` ## ... and the treatment group: ```{r bootstrap-R-19} kids %>% filter(group=="t") -> treats tibble(sim = 1:1000) %>% rowwise() %>% mutate(sample = list(sample(treats$score, replace = TRUE))) %>% mutate(my_mean = mean(sample)) %>% ggplot(aes(x = my_mean)) + geom_histogram(bins = 15) ``` ## Comments - sampling distributions of sample means both look pretty normal, though treatment group is a tiny bit left-skewed - as we thought, no problems with our two-sample $t$ at all.