--- title: "Using the computer for statistical inference" author: Abhijit Dasgupta date: BIOF 339 --- ```{r setup, child = here::here('slides/templates/setup.Rmd'), include=FALSE} ``` --- class: middle, center # A bit of computational inference --- ## Computational inference - Using the computer and simulation to simulate null distributions or sampling distributions of statistics - _sampling distribution_ is the distribution of values we'd see for a statistics if we repeated the experiment over and over - _null distribution_ is the distribution of values we'd expect to see if the null hypothesis we're using happens to be true. ![](../img/infer.png) --- ## Simulation R has several standard distributions programmed in, from which random numbers can be drawn and distributions visualized --- ## Simulation .pull-left[ The Gaussian or normal distribution ```{r sim1, eval = F, echo = T} x <- rnorm(10000) # 10,000 random numbers from standard normal distribution hist(x) # This is the base R way to create a histogram ``` ] .pull-right[ ```{r 09-Modeling-1, eval=T, echo = F, ref.label="sim1"} ``` ] --- ## Simulation .pull-left[ ```{r sim2, eval = F, echo = T} # Toss a fair coin 10 times, count number of heads # Repeat 10,000 times x <- rbinom(10000, size = 10, prob = 0.5) hist(x) ``` ] .pull-right[ ```{r 09-Modeling-2, eval=T, echo = F, ref.label="sim2"} ``` ] --- ## Simulation .pull-left[ ```{r sim3, eval = F, echo = T} # 10,000 random dumbers from a chi-square distribution # with 1 d.f. x <- rchisq(10000, 1) hist(x) ``` ] .pull-right[ ```{r 09-Modeling-3, eval=T, echo = F, ref.label="sim3"} ``` ] --- ## Simulations Simulations on a computer aren't exactly random, but *pseudo-random*. They form a _complex, deterministic_ series of numbers which have some properties. You can set the starting point of the series. It's called the **seed**. --- ```{r sim, eval = F, echo = T} set.seed(28954) #<< dd <- data.frame(x = rnorm(100)) dd$y <- rnorm(100) plt1 <- ggplot(dd, aes(x))+geom_histogram() plt2 <- ggplot(dd, aes( y)) + geom_histogram() cowplot::plot_grid(plt1, plt2, nrow=1) ``` ```{r 09-Modeling-4, eval=T, echo = F, ref.label="sim", fig.height=3.5} ``` --- ```{r sim4, eval = F, echo = T} set.seed(28954) #<< dd <- data.frame(x = rnorm(100)) set.seed(28954) #<< dd$y <- rnorm(100) plt1 <- ggplot(dd, aes(x))+geom_histogram() plt2 <- ggplot(dd, aes( y)) + geom_histogram() cowplot::plot_grid(plt1, plt2, nrow=1) ``` ```{r 09-Modeling-5, eval=T, echo = F, ref.label="sim4", fig.height=3.5} ``` .bg-washed-blue.b --- class: middle, center ## Always set a seed to ensure replicability of simulation experiments --- ## Permutation tests We will use the R package `infer` for this section, as well as the `pbc` data. We'll first do a permutation test to see if bilirubin is significantly different by treatment group. A classical thing to do would be a `t.test` or a `wilcox.test`. ```{r 09-Modeling-6, echo=T} library(survival) t.test(bili~trt, data=pbc) ``` --- ## Permutation tests In a permutation test, we assume the null hypothesis of no difference between the groups, which means - if we shuffled the group memberships (re-assigned individuals to different treatments) nothing should change. If we compute the test statistic over different permutations, we'll get the distribution of test statistic values we'd see if the null hypothesis was true. --- ## Permutation tests .pull-left[ ```{r simt, echo=T, eval=F} library(infer) set.seed(10193) sims <- pbc %>% mutate(trt = as.factor(trt)) %>% specify(bili ~ trt) %>% hypothesize(null = 'independence') %>% generate(reps = 1000, type = 'permute') %>% calculate(stat = 'diff in means', order = c('1','2')) visualize(sims) ``` ```{r 09-Modeling-7, echo=T} library(infer) (obs_stat <- pbc %>% mutate(trt=as.factor(trt)) %>% specify(bili ~ trt) %>% calculate(stat='diff in means',order = c('1','2'))) ``` ] .pull-right[ ```{r 09-Modeling-8, ref.label='simt', echo = F, eval=T} ``` ] --- ## Permutation tests .pull-left[ ```{r simt2, echo=T, eval=F} library(infer) set.seed(10193) sims <- pbc %>% mutate(trt = as.factor(trt)) %>% specify(bili ~ trt) %>% # trt must be a factor hypothesize(null = 'independence') %>% generate(reps = 1000, type = 'permute') %>% calculate(stat = 'diff in means', order = c('1','2')) visualize(sims) + shade_p_value(obs_stat, direction = 'both') ``` ```{r 09-Modeling-9, echo=T} sims %>% get_pvalue(obs_stat, direction ='both') ``` ] .pull-right[ ```{r 09-Modeling-10, ref.label='simt2', echo = F, eval=T} ``` ] --- ## Bootstrapping for confidence intervals Suppose we want to get a confidence interval for the mean bilirubin level overall. The bootstrap samples the original data **with replacement** to get a dataset of the same size. Since sampling is with replacement, some observations are repeated, some are omitted. Strong theory from the 80s and 90s says that if we repeatedly take bootstrap samples of our data and compute the sample means, their distribution will be very close to the true sampling distribution of the sample mean. --- ## Bootstrapping for confidence intervals .pull-left[ ```{r 09-Modeling-11,echo=T, eval=T} x <- pbc %>% specify(response = bili) %>% generate(reps = 1000, type = 'bootstrap') %>% #<< calculate('mean') ``` ```{r boot, echo=F, eval=F} visualize(x) ``` ```{r 09-Modeling-12, echo=T} (ci <- x %>% get_confidence_interval()) ``` ] .pull-right[ ```{r 09-Modeling-13, echo=F, eval=T, ref.label='boot'} ``` ] --- ## Resources 1. Several [infer](https://infer.netlify.com/articles/two_sample_t.html) [examples](https://infer.netlify.com/articles/flights_examples.html) 1. The [R Companion](https://rcompanion.org/handbook/K_01.html) chapter on permutation tests 1. [This site](https://mac-theobio.github.io/QMEE/permutation_examples.html) gives alternative methods for doing permutation tests in R 1. The [coin](http://coin.r-forge.r-project.org/) package provides a richer set of permutation tests, but `infer` covers what you need most often.