--- title: "Statistical Inference: Power" editor: markdown: wrap: 72 --- ## Packages ```{r} #| message: false library(tidyverse) ``` ## Errors in testing What can happen: | | Decision | | |:---------------|:------------------|:----------------| | **Truth** | **Do not reject** | **Reject null** | | **Null true** | Correct | Type I error | | **Null false** | Type II error | Correct | Tension between truth and decision about truth (imperfect). - Prob. of type I error denoted $\alpha$. Usually fix $\alpha$, eg. $\alpha = 0.05$. - Prob. of type II error denoted $\beta$. Determined by the planned experiment. Low $\beta$ good. - Prob. of not making type II error called **power** (= $1 - \beta$). *High* power good. ## Power - Suppose $H_0 : \theta = 10$, $H_a : \theta \ne 10$ for some parameter $\theta$. - Suppose $H_0$ wrong. What does that say about $\theta$? - Not much. Could have $\theta = 11$ or $\theta = 8$ or $\theta = 496$. In each case, $H_0$ wrong. - How likely a type II error is depends on what $\theta$ is: - If $\theta = 496$, should be able to reject $H_0 : \theta = 10$ even for small sample, so $\beta$ should be small (power large). - If $\theta = 11$, might have hard time rejecting $H_0$ even with large sample, so $\beta$ would be larger (power smaller). - Power depends on true parameter value, and on sample size. - So we play "what if": "if $\theta$ were 11 (or 8 or 496), what would power be?". ## Figuring out power - Time to figure out power is before you collect any data, as part of planning process. - Need to have idea of what kind of departure from null hypothesis of interest to you, eg. average improvement of 5 points on reading test scores. (Subject-matter decision, not statistical one.) - Then, either: - "I have this big a sample and this big a departure I want to detect. What is my power for detecting it?" - "I want to detect this big a departure with this much power. How big a sample size do I need?" ## How to understand/estimate power? - Suppose we test $H_0 : \mu = 10$ against $H_a : \mu \ne 10$, where $\mu$ is population mean. - Suppose in actual fact, $\mu = 8$, so $H_0$ is wrong. We want to reject it. How likely is that to happen? - Need population SD (take $\sigma = 4$) and sample size (take $n = 15$). In practice, get $\sigma$ from pilot/previous study, and take the $n$ we plan to use. - Idea: draw a random sample from the true distribution, test whether its mean is 10 or not. - Repeat previous step "many" times. - "Simulation". ## Making it go - Random sample of 15 normal observations with mean 8 and SD 4: ```{r} #| echo: false set.seed(457299) ``` ```{r} x <- rnorm(15, 8, 4) x ``` - Test whether `x` from population with mean 10 or not (over): ## ...continued ```{r} t.test(x, mu = 10) ``` - Fail to reject the mean being 10 (a Type II error). ## or get just P-value ```{r} ans <- t.test(x, mu = 10) str(ans) ans$p.value ``` ## Run this lots of times - without a loop! - use `rowwise` to work one random sample at a time - draw random samples from the truth - test that $\mu = 10$ - get P-value - Count up how many of the P-values are 0.05 or less. ## In code ```{r inference-2-R-5, echo=FALSE} set.seed(457299) ``` ```{r inference-2-R-6} tibble(sim = 1:1000) %>% rowwise() %>% mutate(my_sample = list(rnorm(15, 8, 4))) %>% mutate(t_test = list(t.test(my_sample, mu = 10))) %>% mutate(p_val = t_test$p.value) %>% count(p_val <= 0.05) ``` We correctly rejected 422 times out of 1000, so the estimated power is 0.422. ## Try again with bigger sample ```{r inference-2-R-6a} tibble(sim = 1:1000) %>% rowwise() %>% mutate(my_sample = list(rnorm(40, 8, 4))) %>% mutate(t_test = list(t.test(my_sample, mu = 10))) %>% mutate(p_val = t_test$p.value) %>% count(p_val <= 0.05) ``` ## Calculating power - Simulation approach very flexible: will work for any test. But answer different each time because of randomness. - In some cases, for example 1-sample and 2-sample t-tests, power can be calculated. - `power.t.test`. Input `delta` is difference between null and true mean: ```{r inference-2-R-7} power.t.test(n = 15, delta = 10-8, sd = 4, type = "one.sample") ``` ## Comparison of results | Method | Power | |:-------------------|:-------| | Simulation | 0.422 | | **`power.t.test`** | 0.4378 | - Simulation power is similar to calculated power; to get more accurate value, repeat more times (eg. 10,000 instead of 1,000), which takes longer. - CI for power based on simulation approx. $0.42 \pm 0.03$. - With this small a sample size, the power is not great. With a bigger sample, the sample mean should be closer to 8 most of the time, so would reject $H_0 : \mu = 10$ more often. ## Calculating required sample size - Often, when planning a study, we do not have a particular sample size in mind. Rather, we want to know how big a sample to take. This can be done by asking how big a sample is needed to achieve a certain power. - The simulation approach does not work naturally with this, since you have to supply a sample size. - For that, you try different sample sizes until you get power close to what you want. - For the power-calculation method, you supply a value for the power, but leave the sample size missing. - Re-use the same problem: $H_0 : \mu = 10$ against 2-sided alternative, true $\mu = 8$, $\sigma = 4$, but now aim for power 0.80. ## Using power.t.test - No `n=`, replaced by a `power=`: ```{r inference-2-R-8} power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample") ``` - Sample size must be a whole number, so round up to 34 (to get at least as much power as you want). one-sided test? ```{r} power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample", alternative = "one.sided") ``` ## Power curves - Rather than calculating power for one sample size, or sample size for one power, might want a picture of relationship between sample size and power. - Or, likewise, picture of relationship between difference between true and null-hypothesis means and power. - Called power curve. - Build and plot it yourself. ## Building it - If you feed power.t.test a collection ("vector") of values, it will do calculation for each one. - Do power for variety of sample sizes, from 10 to 100 in steps of 10: ```{r inference-2-R-9} ns <- seq(10,100,10) ns ``` \small - Calculate powers: ```{r inference-2-R-10} ans<- power.t.test(n=ns, delta=10-8, sd=4, type="one.sample") ans str(ans) ans$power ``` \normalsize ## Building a plot (1/2) - Make a data frame out of the values to plot: ```{r inference-2-R-11} d <- tibble(n=ns, power=ans$power) d ``` ## Building a plot (2/2) - Plot these as points joined by lines, and add horizontal line at 1 (maximum power): ```{r inference-2-R-12} g <- ggplot(d, aes(x = n, y = power)) + geom_point() + geom_line() + geom_hline(yintercept = 1, linetype = "dashed") ``` ## The power curve ```{r inference-2-R-13} g ``` ## Another way to do it: ```{r inference-2-R-14} tibble(n=ns) %>% rowwise() %>% mutate(power_output = list(power.t.test(n = n, delta = 10-8, sd = 4, type = "one.sample"))) %>% mutate(power = power_output$power) %>% ggplot(aes(x=n, y=power)) + geom_point() + geom_line() + geom_hline(yintercept=1, linetype="dashed") -> g2 ``` ## The power curve done the other way ```{r inference-2-R-15} g2 ``` ## Power curves for means - Can also investigate power as it depends on what the true mean is (the farther from null mean 10, the higher the power will be). - Investigate for two different sample sizes, 15 and 30. - First make all combos of mean and sample size: ```{r inference-2-R-16} means <- seq(6,10,0.5) means ns <- c(15,30) ns combos <- crossing(mean=means, n=ns) ``` ## The combos \scriptsize ```{r inference-2-R-17} combos ``` \normalsize ## Calculate and plot - Calculate the powers, carefully: ```{r inference-2-R-18} ans <- with(combos, power.t.test(n=n, delta=10-mean, sd=4, type="one.sample")) ans$power ``` ## Make a data frame to plot, pulling things from the right places: ```{r inference-2-R-20} d <- tibble(n=factor(combos$n), mean=combos$mean, power=ans$power) d ``` ## then make the plot: ```{r inference-2-R-21} g <- ggplot(d, aes(x = mean, y = power, colour = n)) + geom_point() + geom_line() + geom_hline(yintercept = 1, linetype = "dashed") + geom_vline(xintercept = 10, linetype = "dotted") ``` ## The power curves ```{r inference-2-R-22, fig.height=3.8} g ``` ## Comments - When `mean=10`, that is, the true mean equals the null mean, $H_0$ is actually true, and the probability of rejecting it then is $\alpha = 0.05$. - As the null gets more wrong (mean decreases), it becomes easier to correctly reject it. - The blue power curve is above the red one for any mean \< 10, meaning that no matter how wrong $H_0$ is, you always have a greater chance of correctly rejecting it with a larger sample size. - Previously, we had $H_0 : \mu = 10$ and a true $\mu = 8$, so a mean of 8 produces power 0.42 and 0.80 as shown on the graph. - With $n = 30$, a true mean that is less than about 7 is almost certain to be correctly rejected. (With $n = 15$, the true mean needs to be less than 6.) ## Two-sample power ```{r inference-2-R-25, echo=FALSE} #| message = FALSE my_url <- "http://ritsokiguess.site/datafiles/drp.txt" kids <- read_delim(my_url," ") ``` - For kids learning to read, had sample sizes of 22 (approx) in each group - and these group SDs: ```{r inference-2-R-26} kids %>% group_by(group) %>% summarize(n=n(), s=sd(score)) ``` ## Setting up - suppose a 5-point improvement in reading score was considered important (on this scale) - in a 2-sample test, null (difference of) mean is zero, so `delta` is true difference in means - what is power for these sample sizes, and what sample size would be needed to get power up to 0.80? - SD in both groups has to be same in `power.t.test`, so take as 14. ## Calculating power for sample size 22 (per group) ```{r pow1} power.t.test(n=22, delta=5, sd=14, type="two.sample", alternative="one.sided") ``` ## sample size for power 0.8 ```{r pow2} power.t.test(power=0.80, delta=5, sd=14, type="two.sample", alternative="one.sided") ``` ## Comments - The power for the sample sizes we have is very small (to detect a 5-point increase). - To get power 0.80, we need 98 kids in *each* group!