---
title: "Bootstrap for sampling distribution of sample mean"
---

## Assessing assumptions

-   Our $t$-tests assume normality of variable being tested
-   but, Central Limit Theorem says that normality matters less if
    sample is "large"
-   in practice "approximate normality" is enough, but how do we assess
    whether what we have is normal enough?
-   so far, use histogram/boxplot and make a call, allowing for sample
    size.

## What actually has to be normal

-   is: **sampling distribution of sample mean**
-   the distribution of sample mean over *all possible samples*
-   but we only have *one* sample!
-   Idea: assume our sample is representative of the population, and
    draw samples from our sample (!), with replacement.
-   This gives an idea of what different samples from the population
    might look like.
-   Called *bootstrap*, after expression "to pull yourself up by your
    own bootstraps".

## Packages

```{r}
library(tidyverse)
```

## Blue Jays attendances

```{r bootstrap-R-1, echo=FALSE, message=FALSE}
jays <- read_csv("http://ritsokiguess.site/datafiles/jays15-home.csv")
set.seed(457299)
```

```{r bootstrap-R-2}
jays$attendance
```

-   A bootstrap sample:

```{r bootstrap-R-3}
s <- sample(jays$attendance, replace = TRUE)
s
```

-   It is easier to see what is happening if we sort both the actual
    attendances and the bootstrap sample:

```{r}
sort(jays$attendance)
sort(s)
```

## Getting mean of bootstrap sample

-   A bootstrap sample is same size as original, but contains repeated
    values (eg. 15062) and missing ones (42917).
-   We need the mean of our bootstrap sample:

```{r bootstrap-R-4}
mean(s)
```

-   This is a little different from the mean of our actual sample:

```{r bootstrap-R-5}
mean(jays$attendance)
```

-   Want a sense of how the sample mean might vary, if we were able to
    take repeated samples from our population.
-   Idea: take lots of *bootstrap* samples, and see how *their* sample
    means vary.

## Setting up bootstrap sampling

-   Begin by setting up a dataframe that contains a row for each
    bootstrap sample. I usually call this column `sim`. Do just 4 to get
    the idea:

```{r bootstrap-R-6}
tibble(sim = 1:4)
```

## Drawing the bootstrap samples

-   Then set up to work one row at a time, and draw a bootstrap sample
    of the attendances in each row:

```{r bootstrap-R-7}
tibble(sim = 1:4) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(jays$attendance, replace = TRUE)))
```

-   Each row of our dataframe contains *all* of a bootstrap sample of 25
    observations drawn with replacement from the attendances.

## Sample means

-   Find the mean of each sample:

```{r bootstrap-R-8}
tibble(sim = 1:4) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>%   
  mutate(my_mean = mean(sample))
```

-   These are (four simulated values of) the bootstrapped sampling
    distribution of the sample mean.

## Make a histogram of them

-   rather pointless here, but to get the idea:

```{r bootstrap-R-9}
tibble(sim = 1:4) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% 
  mutate(my_mean = mean(sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 3)  -> g
```

## The (pointless) histogram

```{r bootstrap-R-10}
g
```

## Now do again with a decent number of bootstrap samples

-   say 1000, and put a decent number of bins on the histogram also:

```{r bootstrap-R-11}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% 
  mutate(my_mean = mean(sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g
```

## The (better) histogram

```{r bootstrap-R-12}
g
```

## Comments

-   This is very close to normal
-   The bootstrap says that the sampling distribution of the sample mean
    is close to normal, even though the distribution of the data is not
-   A sample size of 25 is big enough to overcome the skewness that we
    saw
-   This is the Central Limit Theorem in practice
-   It is surprisingly powerful.
-   Thus, the $t$-test is actually perfectly good here.

## Comments on the code 1/2

-   You might have been wondering about this:

```{r bootstrap-R-13}
tibble(sim = 1:4) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(jays$attendance, replace = TRUE)))
```

## Comments on the code 2/2

-   how did we squeeze all 25 sample values into one cell?
    -   sample is a so-called "list-column" that can contain anything.
-   why did we have to put `list()` around the `sample()`?
    -   because `sample` produces a collection of numbers, not just a
        single one
    -   the `list()` signals this: "make a list-column of samples".

## Two samples

-   Assumption: *both* samples are from a normal distribution.
-   In this case, each sample should be "normal enough" given its sample
    size, since Central Limit Theorem will help.
-   Use bootstrap on each group independently, as above.

## Kids learning to read

```{r bootstrap-R-14, echo=FALSE, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
kids
```

```{r bootstrap-R-15}
ggplot(kids, aes(x=group, y=score)) + geom_boxplot()
```

## Getting just the control group

-   Use `filter` to select rows where something is true:

```{r bootstrap-R-16}
kids %>% filter(group=="c") -> controls
controls
```

## Bootstrap these

```{r bootstrap-R-17}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(controls$score, replace = TRUE))) %>% 
  mutate(my_mean = mean(sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) 
```

## ... and the treatment group:

```{r bootstrap-R-19}
kids %>% filter(group=="t") -> treats
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(sample = list(sample(treats$score, replace = TRUE))) %>% 
  mutate(my_mean = mean(sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 15) 
```

## Comments

-   sampling distributions of sample means both look pretty normal,
    though treatment group is a tiny bit left-skewed
-   as we thought, no problems with our two-sample $t$ at all.