--- title: "The bootstrap revisited" --- ## Packages for this section ```{r bootstrap-1} library(tidyverse) library(bootstrap) ``` Source: [Hesterberg et al](https://www.researchgate.net/publication/265399426_Bootstrap_Methods_and_Permutation_Tests) ## Is my sampling distribution normal enough? - Recall IRS data (used as a motivation for the sign test) : ```{r bootstrap-2, include=FALSE} my_url <- "http://ritsokiguess.site/datafiles/irs.txt" irs <- read_csv(my_url) ``` ```{r bootstrap-3, fig.height=5} ggplot(irs, aes(x=Time))+geom_histogram(bins=10) ``` - $t$ procedure for the mean would not be a good idea because the distribution is skewed. ## What *actually* matters - It's not the distribution of the *data* that has to be approx normal (for a $t$ procedure). - What matters is the *sampling distribution of the sample mean*. - If the sample size is large enough, the sampling distribution will be normal enough even if the data distribution is not. - This is why we had to consider the sample size as well as the shape. - But how do we know whether this is the case or not? We only have *one* sample. ## The (nonparametric) bootstrap - Typically, our sample will be reasonably representative of the population. - Idea: pretend the sample *is* the population, and sample from it *with replacement*. - Calculate test statistic, and repeat many times. - This gives an idea of how our statistic might vary in repeated samples: that is, its sampling distribution. - Called the **bootstrap distribution** of the test statistic. - If the bootstrap distribution is approx normal, infer that the true sampling distribution also approx normal, therefore inference about the mean such as $t$ is good enough. - If not, we should be more careful. ## Why it works - We typically estimate population parameters by using the corresponding sample thing: eg. estimate population mean using sample mean. - This called **plug-in principle**. - The fraction of sample values less than a value $x$ called the **empirical distribution function** (as a function of $x$). - By plug-in principle, the empirical distribution function is an estimate of the population CDF. - In this sense, the sample *is* an estimate of the population, and so sampling from it is an estimate of sampling from the population. ## Bootstrapping the IRS data - Sampling with replacement is done like this (the default sample size is as long as the original data): ```{r bootstrap-4} boot <- sample(irs$Time, replace=T) mean(boot) ``` - That's one bootstrapped mean. We need a whole bunch. ## A whole bunch - Use the same idea as for simulating power: ```{r bootstrap-5, echo=F} set.seed(457299) ``` \small ```{r bootstrap-6} tibble(sim = 1:1000) %>% rowwise() %>% mutate(boot_sample = list(sample(irs$Time, replace = TRUE))) ``` \normalsize ## Get the mean of each of those \small ```{r bootstrap-7} tibble(sim = 1:1000) %>% rowwise() %>% mutate(boot_sample = list(sample(irs$Time, replace = TRUE))) %>% mutate(my_mean = mean(boot_sample)) -> samples samples ``` \normalsize ## Sampling distribution of sample mean ```{r bootstrap-8, fig.height=4} ggplot(samples, aes(x=my_mean)) + geom_histogram(bins=10) ``` - Is that a slightly long right tail? ## Normal quantile plot might be better than a histogram: ```{r bootstrap-9, fig.height=3.5} ggplot(samples, aes(sample = my_mean)) + stat_qq()+stat_qq_line() ``` - a very very slight right-skewness, but very close to normal. ## Confidence interval from the bootstrap distribution There are two ways (at least): - percentile bootstrap interval: take the 2.5 and 97.5 percentiles (to get the middle 95%). This is easy, but not always the best: \small ```{r bootstrap-10} (b_p=quantile(samples$my_mean, c(0.025, 0.975))) ``` \normalsize - bootstrap $t$: use the SD of the bootstrapped sampling distribution as the SE of the estimator of the mean and make a $t$ interval: \small ```{r bootstrap-11} n <- length(irs$Time) t_star <- qt(0.975, n-1) b_t <- with(samples, mean(my_mean)+c(-1, 1)*t_star*sd(my_mean)) b_t ``` \normalsize ## Comparing - get ordinary $t$ interval: ```{r bootstrap-12} my_names=c("LCL", "UCL") o_t <- t.test(irs$Time)$conf.int ``` - Compare the 2 bootstrap intervals with the ordinary $t$-interval: ```{r bootstrap-13} tibble(limit=my_names, o_t, b_t, b_p) ``` - The bootstrap $t$ and the ordinary $t$ are very close - The percentile bootstrap interval is noticeably shorter (common) and higher (skewness). ## Which to prefer? - If the intervals agree, then they are all good. - If they disagree, they are all bad! - In that case, use BCA interval (over). ## Bias correction and acceleration - this from "An introduction to the bootstrap", by Brad Efron and Robert J. Tibshirani. - there is way of correcting the CI for skewness in the bootstrap distribution, called the BCa method - complicated (see the Efron and Tibshirani book), but implemented in `bootstrap` package. ## Run this on the IRS data: ```{r bootstrap-14} bca=bcanon(irs$Time, 1000, mean) bca$confpoints ``` ## use 2.5% and 97.5% points for CI ```{r bootstrap-15} bca$confpoints %>% as_tibble() %>% filter(alpha %in% c(0.025, 0.975)) %>% pull(`bca point`) -> b_bca b_bca ``` ## Comparing ```{r bootstrap-16} tibble(limit=my_names, o_t, b_t, b_p, b_bca) ``` - The BCA interval says that the mean should be estimated even higher than the bootstrap percentile interval does. - The BCA interval is the one to trust. ## Bootstrapping the correlation Recall the soap data: ```{r bootstrap-17, message=FALSE} url <- "http://ritsokiguess.site/datafiles/soap.txt" soap <- read_delim(url," ") soap ``` ## Scatterplot ```{r bootstrap-18, fig.height=3.75} ggplot(soap, aes(x=speed, y=scrap, colour=line))+ geom_point()+geom_smooth(method="lm", se=F) ``` ## Comments - Line B produces less scrap for any given speed. - For line B, estimate the correlation between speed and scrap (with a confidence interval.) ## Extract the line B data; standard correlation test ```{r bootstrap-19} soap %>% filter(line=="b") -> line_b with(line_b, cor.test(speed, scrap)) ``` ```{r bootstrap-20, include=FALSE} o_c=with(line_b, cor.test(speed, scrap))$conf.int ``` ## Bootstrapping a correlation 1/2 - This illustrates a different technique: we need to keep the $x$ and $y$ values *together*. - Sample *rows* of the data frame rather than individual values of `speed` and `scrap`: \scriptsize ```{r bootstrap-21} line_b %>% sample_frac(replace=T) ``` \normalsize ## Bootstrapping a correlation 2/2 1000 times: ```{r bootstrap-22} tibble(sim = 1:1000) %>% rowwise() %>% mutate(boot_df = list(sample_frac(line_b, replace = TRUE))) %>% mutate(my_cor = with(boot_df, cor(speed, scrap))) -> cors ``` ## A picture of this ```{r bootstrap-23, fig.height=4} ggplot(cors, aes(x=my_cor))+geom_histogram(bins=15) ``` ## Comments and next steps - This is very left-skewed. - Bootstrap percentile interval is: ```{r bootstrap-24} (b_p=quantile(cors$my_cor, c(0.025, 0.975))) ``` - We probably need the BCA interval instead. ## Getting the BCA interval 1/2 - To use `bcanon`, write a function that takes a vector of row numbers and returns the correlation between `speed` and `scrap` for those rows: \footnotesize ```{r bootstrap-25} theta=function(rows, d) { d %>% slice(rows) %>% with(., cor(speed, scrap)) } theta(1:3, line_b) line_b %>% slice(1:3) ``` \normalsize - That looks about right. ## Getting the BCA interval 2/2 - Inputs to `bcanon` are now: - row numbers (1 through 12 in our case: 12 rows in `line_b`) - number of bootstrap samples - the function we just wrote - the data frame: ```{r bootstrap-26} points=bcanon(1:12, 1000, theta, line_b)$confpoints points %>% as_tibble() %>% filter(alpha %in% c(0.025, 0.975)) %>% pull(`bca point`) -> b_bca b_bca ``` ## Comparing the results ```{r bootstrap-27} tibble(limit=my_names, o_c, b_p, b_bca) ``` - The bootstrap percentile interval doesn't go down far enough. - The BCA interval seems to do a better job in capturing the skewness of the distribution. - The ordinary confidence interval for the correlation is very similar to the BCA one, and thus seems to be trustworthy here even though the correlation has a very skewed distribution. (`cor.test` uses the Fisher $z$ transformation which "spreads out" correlations close to 1). ## The $z$-transformed bootstrapped correlations \small ```{r bootstrap-28} cors %>% mutate(z = 0.5 * log((1+my_cor)/(1-my_cor))) %>% ggplot(aes(sample=z)) + stat_qq() + stat_qq_line() ``` \normalsize