---
title: "The bootstrap revisited"
---


## Packages for this section

```{r bootstrap-1}
library(tidyverse)
library(bootstrap)
```

Source: [Hesterberg et al](https://www.researchgate.net/publication/265399426_Bootstrap_Methods_and_Permutation_Tests)


## Is my sampling distribution normal enough?

- Recall IRS data (used as a motivation for the sign test) :

```{r bootstrap-2, include=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/irs.txt"
irs <- read_csv(my_url)
```


```{r bootstrap-3, fig.height=5}
ggplot(irs, aes(x=Time))+geom_histogram(bins=10)
```

- $t$ procedure for the mean would not be a good idea because the distribution is skewed.

## What *actually* matters

- It's not the distribution of the *data* that has to be approx normal (for a $t$ procedure).
- What matters is the *sampling distribution of the sample mean*.
- If the sample size is large enough, the sampling distribution will be normal enough even if the data distribution is not.
  - This is why we had to consider the sample size as well as the shape.
- But how do we know whether this is the case or not? We only have *one* sample.

## The (nonparametric) bootstrap

- Typically, our sample will be reasonably representative of the population.
- Idea: pretend the sample *is* the population, and sample from it *with replacement*.
- Calculate test statistic, and repeat many times.
- This gives an idea of how our statistic might vary in repeated samples: that is, its sampling distribution.
- Called the **bootstrap distribution** of the test statistic.
- If the bootstrap distribution is approx normal, infer that the true sampling distribution also approx normal, therefore inference about the mean such as $t$ is good enough.
- If not, we should be more careful.

## Why it works

- We typically estimate population parameters by using the corresponding sample thing: eg. estimate population mean using sample mean.
- This called **plug-in principle**.
- The fraction of sample values less than a value $x$ called the **empirical distribution function** (as a function of $x$).
- By plug-in principle, the empirical distribution function is an estimate of the population CDF.
- In this sense, the sample *is* an estimate of the population, and so sampling from it is an estimate of sampling from the population.

## Bootstrapping the IRS data

- Sampling with replacement is done like this (the default sample size is as long as the original data):

```{r bootstrap-4}
boot <- sample(irs$Time, replace=T)
mean(boot)
```

- That's one bootstrapped mean. We need a whole bunch.

## A whole bunch

- Use the same idea as for simulating power:

```{r bootstrap-5, echo=F}
set.seed(457299)
```

\small
```{r bootstrap-6}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot_sample = list(sample(irs$Time, replace = TRUE)))
```
\normalsize

## Get the mean of each of those

\small
```{r bootstrap-7}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot_sample = list(sample(irs$Time, replace = TRUE))) %>% 
  mutate(my_mean = mean(boot_sample)) -> samples
samples
```
\normalsize

## Sampling distribution of sample mean

```{r bootstrap-8, fig.height=4}
ggplot(samples, aes(x=my_mean)) + geom_histogram(bins=10)
```

- Is that a slightly long right tail?

## Normal quantile plot

might be better than a histogram:

```{r bootstrap-9, fig.height=3.5}
ggplot(samples, aes(sample = my_mean)) + 
  stat_qq()+stat_qq_line()
```

- a very very slight right-skewness, but very close to normal.

## Confidence interval from the bootstrap distribution

There are two ways (at least):

- percentile bootstrap interval: take the 2.5 and 97.5 percentiles (to get the middle 95%). This is easy, but not always the best:

\small
```{r bootstrap-10}
(b_p=quantile(samples$my_mean, c(0.025, 0.975)))
```
\normalsize

- bootstrap $t$: use the SD of the bootstrapped sampling distribution as the SE of the estimator of the mean and make a $t$ interval:

\small
```{r bootstrap-11}
n <- length(irs$Time)
t_star <- qt(0.975, n-1)
b_t <- with(samples, mean(my_mean)+c(-1, 1)*t_star*sd(my_mean))
b_t
```
\normalsize

## Comparing

- get ordinary $t$ interval:

```{r bootstrap-12}
my_names=c("LCL", "UCL")
o_t <- t.test(irs$Time)$conf.int
```


- Compare the 2 bootstrap intervals with the ordinary $t$-interval: 

```{r bootstrap-13}
tibble(limit=my_names, o_t, b_t, b_p)
```

- The bootstrap $t$ and the ordinary $t$ are very close
- The percentile bootstrap interval is noticeably shorter (common) and higher (skewness).
  
## Which to prefer?

- If the intervals agree, then they are all good.
- If they disagree, they are all bad! 
- In that case, use BCA interval (over).

  
## Bias correction and acceleration

- this from 
"An introduction to the bootstrap", by
Brad Efron and Robert J. Tibshirani.
- there is way of correcting the CI for skewness in the bootstrap distribution, called the BCa method
- complicated (see the Efron and Tibshirani book), but implemented in `bootstrap` package.

## Run this on the IRS data:

```{r bootstrap-14}
bca=bcanon(irs$Time, 1000, mean)
bca$confpoints
```

## use 2.5% and 97.5% points for CI

```{r bootstrap-15}
bca$confpoints %>% as_tibble() %>% 
  filter(alpha %in% c(0.025, 0.975)) %>% 
  pull(`bca point`) -> b_bca
b_bca
```

## Comparing

```{r bootstrap-16}
tibble(limit=my_names, o_t, b_t, b_p, b_bca)
```

- The BCA interval says that the mean should be estimated even higher than the bootstrap percentile interval does. 
- The BCA interval is the one to trust.


## Bootstrapping the correlation

Recall the soap data:

```{r bootstrap-17, message=FALSE}
url <- "http://ritsokiguess.site/datafiles/soap.txt"
soap <- read_delim(url," ")
soap
```

## Scatterplot

```{r bootstrap-18, fig.height=3.75}
ggplot(soap, aes(x=speed, y=scrap, colour=line))+
  geom_point()+geom_smooth(method="lm", se=F)
```

## Comments

- Line B produces less scrap for any given speed.
- For line B, estimate the correlation between speed and scrap (with a confidence interval.)

## Extract the line B data; standard correlation test 

```{r bootstrap-19}
soap %>% filter(line=="b") -> line_b
with(line_b, cor.test(speed, scrap))
```

```{r bootstrap-20, include=FALSE}
o_c=with(line_b, cor.test(speed, scrap))$conf.int
```


## Bootstrapping a correlation 1/2

- This illustrates a different technique: we need to keep the $x$ and $y$ values *together*.
- Sample *rows* of the data frame rather than individual values of `speed` and `scrap`:

\scriptsize
```{r bootstrap-21}
line_b %>% sample_frac(replace=T)
```
\normalsize

## Bootstrapping a correlation 2/2

1000 times:

```{r bootstrap-22}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot_df = list(sample_frac(line_b, replace = TRUE))) %>% 
  mutate(my_cor = with(boot_df, cor(speed, scrap))) -> cors
```

## A picture of this

```{r bootstrap-23, fig.height=4}
ggplot(cors, aes(x=my_cor))+geom_histogram(bins=15)
```

## Comments and next steps

- This is very left-skewed.
- Bootstrap percentile interval is:

```{r bootstrap-24}
(b_p=quantile(cors$my_cor, c(0.025, 0.975)))
```

- We probably need the BCA interval instead.

## Getting the BCA interval 1/2

- To use `bcanon`, write a function that takes a vector of row numbers and returns the correlation between `speed` and `scrap` for those rows:

\footnotesize
```{r bootstrap-25}
theta=function(rows, d) {
  d %>% slice(rows) %>% with(., cor(speed, scrap))
}
theta(1:3, line_b)
line_b %>% slice(1:3)
```
\normalsize

- That looks about right.

## Getting the BCA interval 2/2

- Inputs to `bcanon` are now:
  - row numbers (1 through 12 in our case: 12 rows in `line_b`)
  - number of bootstrap samples
  - the function we just wrote
  - the data frame:

```{r bootstrap-26}
points=bcanon(1:12, 1000, theta, line_b)$confpoints
points %>% as_tibble() %>% 
  filter(alpha %in% c(0.025, 0.975)) %>% 
  pull(`bca point`) -> b_bca
b_bca
```

## Comparing the results

```{r bootstrap-27}
tibble(limit=my_names, o_c, b_p, b_bca)
```

- The bootstrap percentile interval doesn't go down far enough. 
- The BCA interval seems to do a better job in capturing the skewness of the distribution.
- The ordinary confidence interval for the correlation is very similar to the BCA one, and thus seems to be trustworthy here even though the correlation has a very skewed distribution. (`cor.test` uses the Fisher $z$ transformation which "spreads out" correlations close to 1).

## The $z$-transformed bootstrapped correlations

\small
```{r bootstrap-28}
cors %>% 
  mutate(z = 0.5 * log((1+my_cor)/(1-my_cor))) %>% 
  ggplot(aes(sample=z)) + stat_qq() + stat_qq_line()
```
\normalsize