--- title: "Power of hypothesis tests" editor: markdown: wrap: 72 --- ## Packages ```{r} #| message: false library(tidyverse) ``` ## Errors in testing What can happen: | | Decision | | |:---------------|:------------------|:----------------| | **Truth** | **Do not reject** | **Reject null** | | **Null true** | Correct | Type I error | | **Null false** | Type II error | Correct | Tension between truth and decision about truth (imperfect). ## ... continued - Prob. of type I error denoted $\alpha$. Usually fix $\alpha$, eg. $\alpha = 0.05$. - Prob. of type II error denoted $\beta$. Determined by the planned experiment. Low $\beta$ good. - Prob. of not making type II error called **power** (= $1 - \beta$). *High* power good. ## Power 1/2 - Suppose $H_0 : \theta = 10$, $H_a : \theta \ne 10$ for some parameter $\theta$. - Suppose $H_0$ wrong. What does that say about $\theta$? - Not much. Could have $\theta = 11$ or $\theta = 8$ or $\theta = 496$. In each case, $H_0$ wrong. ## Power 2/2 - How likely a type II error is depends on what $\theta$ is: - If $\theta = 496$, should reject $H_0 : \theta = 10$ even for small sample, so $\beta$ small (power large). - If $\theta = 11$, hard to reject $H_0$ even with large sample, so $\beta$ would be larger (power smaller). - Power depends on true parameter value, and on sample size. - So we play "what if": "if $\theta$ were 11 (or 8 or 496), what would power be?". ## Figuring out power 1/2 - Time to figure out power is before you collect any data, as part of planning process. - Need to have idea of what kind of departure from null hypothesis of interest to you, eg. average improvement of 5 points on reading test scores. (Subject-matter decision, not statistical one.) ## Figuring out power 2/2 - Then, either: - "I have this big a sample and this big a departure I want to detect. What is my power for detecting it?" - "I want to detect this big a departure with this much power. How big a sample size do I need?" ## How to understand/estimate power? - Suppose we test $H_0 : \mu = 10$ against $H_a : \mu \ne 10$, where $\mu$ is population mean. - Suppose in actual fact, $\mu = 8$, so $H_0$ is wrong. We want to reject it. How likely is that to happen? - Need population SD (take $\sigma = 4$) and sample size (take $n = 15$). In practice, get $\sigma$ from pilot/previous study, and take the $n$ we plan to use. - Idea: draw a random sample from the true distribution, test whether its mean is 10 or not. - Repeat previous step "many" times: simulation. ## Making it go - Random sample of 15 normal observations with mean 8 and SD 4: ```{r} #| echo: false set.seed(457299) ``` ```{r} x <- rnorm(15, 8, 4) x ``` - Test whether `x` from population with mean 10 or not (over): ## ...continued ```{r} t.test(x, mu = 10) ``` - P-value 0.081, so fail to reject the mean being 10 (a Type II error). ## or get just P-value ```{r} ans <- t.test(x, mu = 10) ans$p.value ``` ## Run this lots of times via simulation - draw random samples from the truth - test that $\mu = 10$ - get P-value - Count up how many of the P-values are 0.05 or less. ## In code ```{r inference-2-R-5, echo=FALSE} set.seed(457299) ``` ```{r inference-2-R-6} tibble(sim = 1:1000) %>% rowwise() %>% mutate(my_sample = list(rnorm(15, 8, 4))) %>% mutate(t_test = list(t.test(my_sample, mu = 10))) %>% mutate(p_val = t_test$p.value) %>% count(p_val <= 0.05) ``` We correctly rejected 422 times out of 1000, so the estimated power is 0.422. ## Try again with bigger sample ```{r inference-2-R-6a} tibble(sim = 1:1000) %>% rowwise() %>% mutate(my_sample = list(rnorm(40, 8, 4))) %>% mutate(t_test = list(t.test(my_sample, mu = 10))) %>% mutate(p_val = t_test$p.value) %>% count(p_val <= 0.05) ``` Power is (much) larger with a bigger sample. ## How accurate is my simulation? - At our chosen $\alpha$, each simulated test independently either rejects or not with some probability $p$ that I am trying to estimate (the power) - Estimating a population probability using the sample proportion (the number of simulated rejections out of the number of simulated tests) - hence, `prop.test`. - inputs: number of rejections, number of simulations. ## Sample size 15, rejected 422 times ```{r} prop.test(422, 1000) ``` 95% CI for power: 0.391 to 0.453 ## To estimate power more accurately - Run more *simulations*: Change 1000 to eg 10,000: ```{r} tibble(sim = 1:10000) %>% rowwise() %>% mutate(my_sample = list(rnorm(15, 8, 4))) %>% mutate(t_test = list(t.test(my_sample, mu = 10))) %>% mutate(p_val = t_test$p.value) %>% count(p_val <= 0.05) ``` ## Accuracy of power now ```{r} prop.test(4353, 10000) ``` 0.426 to 0.445, about factor $\sqrt{10}$ shorter because number of simulations 10 times bigger. ## Calculating power 1/2 - Simulation approach very flexible: will work for any test. But answer different each time because of randomness. - In some cases, for example 1-sample and 2-sample t-tests, power can be calculated. - `power.t.test`. ## Calculating power 2/2 - Input `delta` is difference between null and true mean: ```{r inference-2-R-7} power.t.test(n = 15, delta = 10-8, sd = 4, type = "one.sample") ``` ## Comparison of results | Method | Power | |:-------------------|:-------| | Simulation (10000) | 0.4353 | | **`power.t.test`** | 0.4378 | - Simulation power is similar to calculated power; to get more accurate value, repeat more times (eg. 100,000 instead of 10,000), which takes longer. - With this small a sample size, the power is not great. With a bigger sample, the sample mean should be closer to 8 most of the time, so would reject $H_0 : \mu = 10$ more often. ## Calculating required sample size - Often, when planning a study, we do not have a particular sample size in mind. Rather, we want to know how big a sample to take. This can be done by asking how big a sample is needed to achieve a certain power. - The simulation approach does not work naturally with this, since you have to supply a sample size. - For that, you try different sample sizes until you get power close to what you want. - For the power-calculation method, you supply a value for the power, but leave the sample size missing. ## Using power.t.test - Re-use the same problem: $H_0 : \mu = 10$ against 2-sided alternative, true $\mu = 8$, $\sigma = 4$, but now aim for power 0.80. - No `n=`, replaced by a `power=`: ```{r inference-2-R-8} power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample") ``` - Sample size must be a whole number, so round up to 34 (to get at least as much power as you want). ## Power curves - Rather than calculating power for one sample size, or sample size for one power, might want a picture of relationship between sample size and power. - Or, likewise, picture of relationship between difference between true and null-hypothesis means and power. - Called power curve. - Build and plot it yourself. ## Building it: ```{r inference-2-R-14} tibble(n=seq(10, 100, 10)) %>% rowwise() %>% mutate(power_output = list(power.t.test(n = n, delta = 10-8, sd = 4, type = "one.sample"))) %>% mutate(power = power_output$power) %>% ggplot(aes(x=n, y=power)) + geom_point() + geom_line() + geom_hline(yintercept=1, linetype="dashed") -> g2 ``` ## The power curve ```{r inference-2-R-15} #| fig-height: 5 g2 ``` ## Power curves for means - Can also investigate power as it depends on what the true mean is (the farther from null mean 10, the higher the power will be). - Investigate for two different sample sizes, 15 and 30. - First make all combos of mean and sample size: ```{r inference-2-R-16} means <- seq(6,10,0.5) ns <- c(15,30) combos <- crossing(mean=means, n=ns) ``` ## The combos \scriptsize ```{r inference-2-R-17} combos ``` \normalsize ## Calculate powers: ```{r inference-2-R-18} combos %>% rowwise() %>% mutate(power_stuff = list(power.t.test(n=n, delta=10-mean, sd=4, type="one.sample"))) %>% mutate(power = power_stuff$power) -> powers ``` ## then make the plot: ```{r inference-2-R-21} g <- ggplot(powers, aes(x = mean, y = power, colour = factor(n))) + geom_point() + geom_line() + geom_hline(yintercept = 1, linetype = "dashed") + geom_vline(xintercept = 10, linetype = "dotted") ``` - Need `n` as categorical so that `colour` works properly. ## The power curves ```{r inference-2-R-22, fig.height=5} g ``` ## Comments 1/2 - When `mean=10`, that is, the true mean equals the null mean, $H_0$ is actually true, and the probability of rejecting it then is $\alpha = 0.05$. - As the null gets more wrong (mean decreases), it becomes easier to correctly reject it. - The blue power curve is above the red one for any mean \< 10, meaning that no matter how wrong $H_0$ is, you always have a greater chance of correctly rejecting it with a larger sample size. ## Comments 2/2 - Previously, we had $H_0 : \mu = 10$ and a true $\mu = 8$, so a mean of 8 produces power 0.42 and 0.80 as shown on the graph. - With $n = 30$, a true mean that is less than about 7 is almost certain to be correctly rejected. (With $n = 15$, the true mean needs to be less than 6.) ## Two-sample power ```{r inference-2-R-25, echo=FALSE} #| message = FALSE my_url <- "http://ritsokiguess.site/datafiles/drp.txt" kids <- read_delim(my_url," ") ``` - For kids learning to read, had sample sizes of 22 (approx) in each group - and these group SDs: ```{r inference-2-R-26} kids %>% group_by(group) %>% summarize(n=n(), s=sd(score)) ``` ## Setting up - suppose a 5-point improvement in reading score was considered important (on this scale) - in a 2-sample test, null (difference of) mean is zero, so `delta` is true difference in means - what is power for these sample sizes, and what sample size would be needed to get power up to 0.80? - SD in both groups has to be same in `power.t.test`, so take as 14. ## Calculating power for sample size 22 (per group) ```{r pow1} power.t.test(n=22, delta=5, sd=14, type="two.sample", alternative="one.sided") ``` ## sample size for power 0.8 ```{r pow2} power.t.test(power=0.80, delta=5, sd=14, type="two.sample", alternative="one.sided") ``` ## Comments - The power for the sample sizes we have is very small (to detect a 5-point increase). - To get power 0.80, we need 98 kids in *each* group!