---
title: "Statistical Inference: Power"
editor: 
  markdown: 
    wrap: 72
---

## Packages

```{r}
#| message: false
library(tidyverse)
```

## Errors in testing

What can happen:

|                | Decision          |                 |
|:---------------|:------------------|:----------------|
| **Truth**      | **Do not reject** | **Reject null** |
| **Null true**  | Correct           | Type I error    |
| **Null false** | Type II error     | Correct         |

Tension between truth and decision about truth (imperfect).

-   Prob. of type I error denoted $\alpha$. Usually fix $\alpha$, eg.
    $\alpha = 0.05$.
-   Prob. of type II error denoted $\beta$. Determined by the planned
    experiment. Low $\beta$ good.
-   Prob. of not making type II error called **power** (= $1 - \beta$).
    *High* power good.

## Power

-   Suppose $H_0 : \theta = 10$, $H_a : \theta \ne 10$ for some
    parameter $\theta$.
-   Suppose $H_0$ wrong. What does that say about $\theta$?
-   Not much. Could have $\theta = 11$ or $\theta = 8$ or
    $\theta = 496$. In each case, $H_0$ wrong.
-   How likely a type II error is depends on what $\theta$ is:
    -   If $\theta = 496$, should be able to reject $H_0 : \theta = 10$
        even for small sample, so $\beta$ should be small (power large).
    -   If $\theta = 11$, might have hard time rejecting $H_0$ even with
        large sample, so $\beta$ would be larger (power smaller).
-   Power depends on true parameter value, and on sample size.
-   So we play "what if": "if $\theta$ were 11 (or 8 or 496), what would
    power be?".

## Figuring out power

-   Time to figure out power is before you collect any data, as part of
    planning process.
-   Need to have idea of what kind of departure from null hypothesis of
    interest to you, eg. average improvement of 5 points on reading test
    scores. (Subject-matter decision, not statistical one.)
-   Then, either:
    -   "I have this big a sample and this big a departure I want to
        detect. What is my power for detecting it?"
    -   "I want to detect this big a departure with this much power. How
        big a sample size do I need?"

## How to understand/estimate power?

-   Suppose we test $H_0 : \mu = 10$ against $H_a : \mu \ne 10$, where
    $\mu$ is population mean.
-   Suppose in actual fact, $\mu = 8$, so $H_0$ is wrong. We want to
    reject it. How likely is that to happen?
-   Need population SD (take $\sigma = 4$) and sample size (take
    $n = 15$). In practice, get $\sigma$ from pilot/previous study, and
    take the $n$ we plan to use.
-   Idea: draw a random sample from the true distribution, test whether
    its mean is 10 or not.
-   Repeat previous step "many" times.
-   "Simulation".

## Making it go

-   Random sample of 15 normal observations with mean 8 and SD 4:

```{r}
#| echo: false
set.seed(457299)
```

```{r}
x <- rnorm(15, 8, 4)
x
```

-   Test whether `x` from population with mean 10 or not (over):

## ...continued

```{r}
t.test(x, mu = 10)
```

-   Fail to reject the mean being 10 (a Type II error).

## or get just P-value

```{r}
ans <- t.test(x, mu = 10)
str(ans)
ans$p.value
```

## Run this lots of times

-   without a loop!
-   use `rowwise` to work one random sample at a time
-   draw random samples from the truth
-   test that $\mu = 10$
-   get P-value
-   Count up how many of the P-values are 0.05 or less.

## In code

```{r inference-2-R-5, echo=FALSE}
set.seed(457299)
```

```{r inference-2-R-6}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(15, 8, 4))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 10))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)
```

We correctly rejected 422 times out of 1000, so the estimated power is
0.422.

## Try again with bigger sample

```{r inference-2-R-6a}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(40, 8, 4))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 10))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)
```

## Calculating power

-   Simulation approach very flexible: will work for any test. But
    answer different each time because of randomness.
-   In some cases, for example 1-sample and 2-sample t-tests, power can
    be calculated.
-   `power.t.test`. Input `delta` is difference between null and true
    mean:

```{r inference-2-R-7}
power.t.test(n = 15, delta = 10-8, sd = 4, type = "one.sample")
```

## Comparison of results

| Method             | Power  |
|:-------------------|:-------|
| Simulation         | 0.422  |
| **`power.t.test`** | 0.4378 |

-   Simulation power is similar to calculated power; to get more
    accurate value, repeat more times (eg. 10,000 instead of 1,000),
    which takes longer.
-   CI for power based on simulation approx. $0.42 \pm 0.03$.
-   With this small a sample size, the power is not great. With a bigger
    sample, the sample mean should be closer to 8 most of the time, so
    would reject $H_0 : \mu = 10$ more often.

## Calculating required sample size

-   Often, when planning a study, we do not have a particular sample
    size in mind. Rather, we want to know how big a sample to take. This
    can be done by asking how big a sample is needed to achieve a
    certain power.
-   The simulation approach does not work naturally with this, since you
    have to supply a sample size.
    -   For that, you try different sample sizes until you get power
        close to what you want.
-   For the power-calculation method, you supply a value for the power,
    but leave the sample size missing.
-   Re-use the same problem: $H_0 : \mu = 10$ against 2-sided
    alternative, true $\mu = 8$, $\sigma = 4$, but now aim for power
    0.80.

## Using power.t.test

-   No `n=`, replaced by a `power=`:

```{r inference-2-R-8}
power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample")
```


-   Sample size must be a whole number, so round up to 34 (to get at
    least as much power as you want).

## One-sided test

```{r}
power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample", alternative = "one.sided")
```

## Power curves

-   Rather than calculating power for one sample size, or sample size
    for one power, might want a picture of relationship between sample
    size and power.
-   Or, likewise, picture of relationship between difference between
    true and null-hypothesis means and power.
-   Called power curve.
-   Build and plot it yourself.

## Building it

-   If you feed `power.t.test` a collection ("vector") of values, it will
    do calculation for each one.
-   Do power for variety of sample sizes, from 10 to 100 in steps of 10:

```{r inference-2-R-9}
ns <- seq(10,100,10)
ns
```

\small

-   Calculate powers:

```{r inference-2-R-10}
ans<- power.t.test(n=ns, delta=10-8, sd=4, type="one.sample")
ans
str(ans)
ans$power
```

\normalsize

## Building a plot (1/2)

-   Make a data frame out of the values to plot:

```{r inference-2-R-11}
d <- tibble(n=ns, power=ans$power)
d
```

## Building a plot (2/2)

-   Plot these as points joined by lines, and add horizontal line at 1
    (maximum power):

```{r inference-2-R-12}
g <-   ggplot(d, aes(x = n, y = power)) + geom_point() + 
  geom_line() + 
  geom_hline(yintercept = 1, linetype = "dashed")
```

## The power curve

```{r inference-2-R-13}
g
```

## Another way to do it:

```{r inference-2-R-14}
tibble(n=ns) %>% rowwise() %>%
  mutate(power_output = 
           list(power.t.test(n = n, delta = 10-8, sd = 4, 
                             type = "one.sample"))) %>% 
  mutate(power = power_output$power) %>% 
  ggplot(aes(x=n, y=power)) + geom_point() + geom_line() +
    geom_hline(yintercept=1, linetype="dashed") -> g2
```

## The power curve done the other way

```{r inference-2-R-15}
g2
```

## Power curves for means

-   Can also investigate power as it depends on what the true mean is
    (the farther from null mean 10, the higher the power will be).
-   Investigate for two different sample sizes, 15 and 30.
-   First make all combos of mean and sample size:

```{r inference-2-R-16}
means <- seq(6,10,0.5)
means
ns <- c(15,30)
ns
combos <- crossing(mean=means, n=ns)
```

## The combos

\scriptsize

```{r inference-2-R-17}
combos
```

\normalsize

## Calculate and plot

-   Calculate the powers, carefully:

```{r inference-2-R-18}
ans <- with(combos, power.t.test(n=n, delta=10-mean, sd=4, 
                              type="one.sample"))
ans$power
```

## Make a data frame to plot, pulling things from the right places:

```{r inference-2-R-20}
d <- tibble(n=factor(combos$n), mean=combos$mean,
power=ans$power)
# d <- tibble(n=combos$n, mean=combos$mean,
#          power=ans$power)
d
```

## then make the plot:

```{r inference-2-R-21}
g  <-  ggplot(d, aes(x = mean, y = power, colour = n)) +
  geom_point() + geom_line() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  geom_vline(xintercept = 10, linetype = "dotted")
```

## The power curves

```{r inference-2-R-22, fig.height=3.8}
g
```

## Comments

-   When `mean=10`, that is, the true mean equals the null mean, $H_0$
    is actually true, and the probability of rejecting it then is
    $\alpha = 0.05$.
-   As the null gets more wrong (mean decreases), it becomes easier to
    correctly reject it.
-   The blue power curve is above the red one for any mean \< 10,
    meaning that no matter how wrong $H_0$ is, you always have a greater
    chance of correctly rejecting it with a larger sample size.
-   Previously, we had $H_0 : \mu = 10$ and a true $\mu = 8$, so a mean
    of 8 produces power 0.42 and 0.80 as shown on the graph.
-   With $n = 30$, a true mean that is less than about 7 is almost
    certain to be correctly rejected. (With $n = 15$, the true mean
    needs to be less than 6.)

## Two-sample power

```{r inference-2-R-25, echo=FALSE}
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
```

-   For kids learning to read, had sample sizes of 22 (approx) in each
    group
-   and these group SDs:

```{r inference-2-R-26}
kids %>% group_by(group) %>% 
  summarize(n=n(), s=sd(score))
```

## Setting up

-   suppose a 5-point improvement in reading score was considered
    important (on this scale)
-   in a 2-sample test, nul(difference of) mean is zero, so `delta` is
    true difference in means
-   what is power for these sample sizes, and what sample size would be
    needed to get power up to 0.80?
-   SD in both groups has to be same in `power.t.test`, so take as 14.

## Calculating power for sample size 22 (per group)

```{r pow1}
power.t.test(n=22, delta=5, sd=14, type="two.sample", 
             alternative="one.sided")
```

## sample size for power 0.8

```{r pow2}
power.t.test(power=0.80, delta=5, sd=14, type="two.sample", 
             alternative="one.sided")
```

## Comments

-   The power for the sample sizes we have is very small (to detect a
    5-point increase).
-   To get power 0.80, we need 98 kids in *each* group!