---
title: "Power of hypothesis tests"
editor: 
  markdown: 
    wrap: 72
---

## Packages

```{r}
#| message: false
library(tidyverse)
```

## Errors in testing

What can happen:

|                | Decision          |                 |
|:---------------|:------------------|:----------------|
| **Truth**      | **Do not reject** | **Reject null** |
| **Null true**  | Correct           | Type I error    |
| **Null false** | Type II error     | Correct         |

Tension between truth and decision about truth (imperfect).

## ... continued

-   Prob. of type I error denoted $\alpha$. Usually fix $\alpha$, eg.
    $\alpha = 0.05$.
-   Prob. of type II error denoted $\beta$. Determined by the planned
    experiment. Low $\beta$ good.
-   Prob. of not making type II error called **power** (= $1 - \beta$).
    *High* power good.

## Power 1/2

-   Suppose $H_0 : \theta = 10$, $H_a : \theta \ne 10$ for some
    parameter $\theta$.
-   Suppose $H_0$ wrong. What does that say about $\theta$?
-   Not much. Could have $\theta = 11$ or $\theta = 8$ or
    $\theta = 496$. In each case, $H_0$ wrong.
    
## Power 2/2
    
-   How likely a type II error is depends on what $\theta$ is:
    -   If $\theta = 496$, should reject $H_0 : \theta = 10$
        even for small sample, so $\beta$ small (power large).
    -   If $\theta = 11$, hard to reject $H_0$ even with
        large sample, so $\beta$ would be larger (power smaller).
-   Power depends on true parameter value, and on sample size.
-   So we play "what if": "if $\theta$ were 11 (or 8 or 496), what would
    power be?".

## Figuring out power 1/2

-   Time to figure out power is before you collect any data, as part of
    planning process.
-   Need to have idea of what kind of departure from null hypothesis of
    interest to you, eg. average improvement of 5 points on reading test
    scores. (Subject-matter decision, not statistical one.)

## Figuring out power 2/2

-   Then, either:
    -   "I have this big a sample and this big a departure I want to
        detect. What is my power for detecting it?"
    -   "I want to detect this big a departure with this much power. How
        big a sample size do I need?"

## How to understand/estimate power?

-   Suppose we test $H_0 : \mu = 10$ against $H_a : \mu \ne 10$, where
    $\mu$ is population mean.
-   Suppose in actual fact, $\mu = 8$, so $H_0$ is wrong. We want to
    reject it. How likely is that to happen?
-   Need population SD (take $\sigma = 4$) and sample size (take
    $n = 15$). In practice, get $\sigma$ from pilot/previous study, and
    take the $n$ we plan to use.
-   Idea: draw a random sample from the true distribution, test whether
    its mean is 10 or not.
-   Repeat previous step "many" times: simulation.

## Making it go

-   Random sample of 15 normal observations with mean 8 and SD 4:

```{r}
#| echo: false
set.seed(457299)
```

```{r}
x <- rnorm(15, 8, 4)
x
```

-   Test whether `x` from population with mean 10 or not (over):

## ...continued

```{r}
t.test(x, mu = 10)
```

-   P-value 0.081, so fail to reject the mean being 10 (a Type II error).

## or get just P-value

```{r}
ans <- t.test(x, mu = 10)
ans$p.value
```

## Run this lots of times via simulation

-   draw random samples from the truth
-   test that $\mu = 10$
-   get P-value
-   Count up how many of the P-values are 0.05 or less.

## In code

```{r inference-2-R-5, echo=FALSE}
set.seed(457299)
```

```{r inference-2-R-6}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(15, 8, 4))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 10))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)
```

We correctly rejected 422 times out of 1000, so the estimated power is
0.422.

## Try again with bigger sample

```{r inference-2-R-6a}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(40, 8, 4))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 10))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)
```
 
Power is (much) larger with a bigger sample. 
 
## How accurate is my simulation?

- At our chosen $\alpha$, each simulated test independently either rejects or not with some probability $p$ that I am trying to estimate (the power)
- Estimating a population probability using the sample proportion (the number of simulated rejections out of the number of simulated tests)
- hence, `prop.test`.
- inputs: number of rejections, number of simulations.

## Sample size 15, rejected 422 times

```{r}
prop.test(422, 1000)
```

95% CI for power: 0.391 to 0.453

## To estimate power more accurately

- Run more *simulations*:

Change 1000 to eg 10,000:

```{r}
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(15, 8, 4))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 10))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

```

## Accuracy of power now

```{r}
prop.test(4353, 10000)
```

0.426 to 0.445, about factor $\sqrt{10}$ shorter because number of simulations 10 times bigger.


## Calculating power 1/2

-   Simulation approach very flexible: will work for any test. But
    answer different each time because of randomness.
-   In some cases, for example 1-sample and 2-sample t-tests, power can
    be calculated.
-   `power.t.test`. 


## Calculating power 2/2

- Input `delta` is difference between null and true
    mean:

```{r inference-2-R-7}
power.t.test(n = 15, delta = 10-8, sd = 4, type = "one.sample")
```

## Comparison of results

| Method             | Power  |
|:-------------------|:-------|
| Simulation (10000) | 0.4353 |
| **`power.t.test`** | 0.4378 |

-   Simulation power is similar to calculated power; to get more
    accurate value, repeat more times (eg. 100,000 instead of 10,000),
    which takes longer.
-   With this small a sample size, the power is not great. With a bigger
    sample, the sample mean should be closer to 8 most of the time, so
    would reject $H_0 : \mu = 10$ more often.

## Calculating required sample size

-   Often, when planning a study, we do not have a particular sample
    size in mind. Rather, we want to know how big a sample to take. This
    can be done by asking how big a sample is needed to achieve a
    certain power.
-   The simulation approach does not work naturally with this, since you
    have to supply a sample size.
    -   For that, you try different sample sizes until you get power
        close to what you want.
-   For the power-calculation method, you supply a value for the power,
    but leave the sample size missing.

## Using power.t.test

-   Re-use the same problem: $H_0 : \mu = 10$ against 2-sided
    alternative, true $\mu = 8$, $\sigma = 4$, but now aim for power
    0.80.
-   No `n=`, replaced by a `power=`:

```{r inference-2-R-8}
power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample")
```


-   Sample size must be a whole number, so round up to 34 (to get at
    least as much power as you want).

## Power curves

-   Rather than calculating power for one sample size, or sample size
    for one power, might want a picture of relationship between sample
    size and power.
-   Or, likewise, picture of relationship between difference between
    true and null-hypothesis means and power.
-   Called power curve.
-   Build and plot it yourself.


## Building it:

```{r inference-2-R-14}
tibble(n=seq(10, 100, 10)) %>% rowwise() %>%
  mutate(power_output = 
           list(power.t.test(n = n, delta = 10-8, sd = 4, 
                             type = "one.sample"))) %>% 
  mutate(power = power_output$power) %>% 
  ggplot(aes(x=n, y=power)) + geom_point() + geom_line() +
    geom_hline(yintercept=1, linetype="dashed") -> g2
```

## The power curve

```{r inference-2-R-15}
#| fig-height: 5
g2
```

## Power curves for means

-   Can also investigate power as it depends on what the true mean is
    (the farther from null mean 10, the higher the power will be).
-   Investigate for two different sample sizes, 15 and 30.
-   First make all combos of mean and sample size:

```{r inference-2-R-16}
means <- seq(6,10,0.5)
ns <- c(15,30)
combos <- crossing(mean=means, n=ns)
```

## The combos

\scriptsize

```{r inference-2-R-17}
combos
```

\normalsize

## Calculate powers:

```{r inference-2-R-18}
combos %>% 
  rowwise() %>% 
  mutate(power_stuff = list(power.t.test(n=n, delta=10-mean, sd=4, 
                              type="one.sample"))) %>% 
  mutate(power = power_stuff$power) -> powers
```

## then make the plot:

```{r inference-2-R-21}
g  <-  ggplot(powers, aes(x = mean, y = power, colour = factor(n))) +
  geom_point() + geom_line() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  geom_vline(xintercept = 10, linetype = "dotted")
```

- Need `n` as categorical so that `colour` works properly.

## The power curves

```{r inference-2-R-22, fig.height=5}
g
```

## Comments 1/2

-   When `mean=10`, that is, the true mean equals the null mean, $H_0$
    is actually true, and the probability of rejecting it then is
    $\alpha = 0.05$.
-   As the null gets more wrong (mean decreases), it becomes easier to
    correctly reject it.
-   The blue power curve is above the red one for any mean \< 10,
    meaning that no matter how wrong $H_0$ is, you always have a greater
    chance of correctly rejecting it with a larger sample size.

## Comments 2/2


-   Previously, we had $H_0 : \mu = 10$ and a true $\mu = 8$, so a mean
    of 8 produces power 0.42 and 0.80 as shown on the graph.
-   With $n = 30$, a true mean that is less than about 7 is almost
    certain to be correctly rejected. (With $n = 15$, the true mean
    needs to be less than 6.)

## Two-sample power

```{r inference-2-R-25, echo=FALSE}
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
```

-   For kids learning to read, had sample sizes of 22 (approx) in each
    group
-   and these group SDs:

```{r inference-2-R-26}
kids %>% group_by(group) %>% 
  summarize(n=n(), s=sd(score))
```

## Setting up

-   suppose a 5-point improvement in reading score was considered
    important (on this scale)
-   in a 2-sample test, null (difference of) mean is zero, so `delta` is
    true difference in means
-   what is power for these sample sizes, and what sample size would be
    needed to get power up to 0.80?
-   SD in both groups has to be same in `power.t.test`, so take as 14.

## Calculating power for sample size 22 (per group)

```{r pow1}
power.t.test(n=22, delta=5, sd=14, type="two.sample", 
             alternative="one.sided")
```

## sample size for power 0.8

```{r pow2}
power.t.test(power=0.80, delta=5, sd=14, type="two.sample", 
             alternative="one.sided")
```

## Comments

-   The power for the sample sizes we have is very small (to detect a
    5-point increase).
-   To get power 0.80, we need 98 kids in *each* group!