---
title: "Analysis of variance revisited"
editor: 
  markdown: 
    wrap: 72
---

## Analysis of variance

-   Analysis of variance used with:

    -   counted/measured response

    -   categorical explanatory variable(s)

    -   that is, data divided into groups, and see if response
        significantly different among groups

    -   or, see whether knowing group membership helps to predict
        response.

## Two stages

-   Typically two stages:

    -   $F$-test to detect *any* differences among/due to groups

    -   if $F$-test significant, do *multiple comparisons* to see which
        groups significantly different from which.

-   Need special multiple comparisons method because just doing (say)
    two-sample $t$-tests on each pair of groups gives too big a chance
    of finding "significant" differences by accident.

## Packages

These:

```{r bAnova-1}
library(tidyverse)
library(broom)
library(car) # for Levene's text
```

## Example: Pain threshold and hair colour

-   Do people with different hair colour have different abilities to
    deal with pain?

-   Men and women of various ages divided into 4 groups by hair colour:
    light and dark blond, light and dark brown.

-   Each subject given a pain sensitivity test resulting in pain
    threshold score: higher score is higher pain tolerance.

-   19 subjects altogether.

## The data

In `hairpain.txt` (some):

![](Screenshot_2023-08-10_11-37-10.png)

## Summarizing the groups

\footnotesize

```{r bAnova-2, message=F}
my_url <- "http://ritsokiguess.site/datafiles/hairpain.txt"
hairpain <- read_delim(my_url, " ")
hairpain %>%
  group_by(hair) %>%
  summarize(
    n = n(),
    xbar = mean(pain),
    s = sd(pain)
  )
```

\normalsize

Brown-haired people seem to have lower pain tolerance.

## Boxplot

```{r tartuffo,fig.height=3.5}
ggplot(hairpain, aes(x = hair, y = pain)) + geom_boxplot()
```

## Assumptions

-   Data should be:

    -   normally distributed within each group

    -   same spread for each group

-   `darkbrown` group has upper outlier (suggests not normal)

-   `darkblond` group has smaller IQR than other groups.

-   But, groups *small*.

-   Shrug shoulders and continue for moment.

## Testing equality of SDs

-   via **Levene's test** in package `car`:

\small

```{r bAnova-3}
leveneTest(pain ~ hair, data = hairpain)
```

\normalsize

-   No evidence (at all) of difference among group SDs.

-   Possibly because groups *small*.

## Analysis of variance

\small

```{r bAnova-4}
hairpain.1 <- aov(pain ~ hair, data = hairpain)
summary(hairpain.1)
```

\normalsize

-   P-value small: the mean pain tolerances for the four groups are
    *not* all the same.

-   Which groups differ from which, and how?

## Multiple comparisons

-   Which groups differ from which? Multiple comparisons method. Lots.

-   Problem: by comparing all the groups with each other, doing many
    tests, have large chance to (possibly incorrectly) reject $H_0:$
    groups have equal means.

-   4 groups: 6 comparisons (1 vs 2, 1 vs 3, \ldots, 3 vs 4). 5 groups:
    10 comparisons. Thus 6 (or 10) chances to make mistake.

-   Get "familywise error rate" of 0.05 (whatever), no matter how many
    comparisons you’re doing.

-   My favourite: Tukey, or "honestly significant differences": how far
    apart might largest, smallest group means be (if actually no
    differences). Group means more different: significantly different.

## Tukey

-   `TukeyHSD:`

\footnotesize

```{r bAnova-5}
TukeyHSD(hairpain.1)
```

\normalsize

## The old-fashioned way

-   List group means in order

-   Draw lines connecting groups that are *not* significantly different:

```         
darkbrown lightbrown  darkblond lightblond
37.4      42.5       51.2       59.2
-------------------------
                     ---------------
```

-   `lightblond` significantly higher than everything except `darkblond`
    (at $\alpha=0.05$).

-   `darkblond` in middle ground: not significantly less than
    `lightblond`, not significantly greater than `darkbrown` and
    `lightbrown`.

-   More data might resolve this.

-   Looks as if blond-haired people do have higher pain tolerance, but
    not completely clear.

## Some other multiple-comparison methods

-   Work any time you do $k$ tests at once (not just ANOVA).

    -   **Bonferroni**: multiply all P-values by $k$.

    -   **Holm**: multiply smallest P-value by $k$, next-smallest by
        $k-1$, etc.

    -   **False discovery rate**: multiply smallest P-value by $k/1$,
        2nd-smallest by $k/2$, \ldots, $i$-th smallest by $k/i$.

-   Stop after non-rejection.

## Example

-   P-values 0.005, 0.015, 0.03, 0.06 (4 tests all done at once) Use
    $\alpha=0.05$.

-   Bonferroni:

    -   Multiply all P-values by 4 (4 tests).

    -   Reject only 1st null.

-   Holm:

    -   Times smallest P-value by 4: $0.005*4=0.020<0.05$, reject.

    -   Times next smallest by 3: $0.015*3=0.045<0.05$, reject.

    -   Times next smallest by 2: $0.03*2=0.06>0.05$, do not reject.
        Stop.

## \ldots Continued

-   With P-values 0.005, 0.015, 0.03, 0.06:

-   False discovery rate:

    -   Times smallest P-value by 4: $0.005*4=0.02<0.05$: reject.

    -   Times second smallest by $4/2$: $0.015*4/2=0.03<0.05$, reject.

    -   Times third smallest by $4/3$: $0.03*4/3=0.04<0.05$, reject.

    -   Times fourth smallest by $4/4$: $0.06*4/4=0.06>0.05$, do not
        reject. Stop.

## `pairwise.t.test`

\tiny

```{r bAnova-6 }
with(hairpain, pairwise.t.test(pain, hair, p.adj = "none"))
with(hairpain, pairwise.t.test(pain, hair, p.adj = "holm"))
```

\normalsize

## `pairwise.t.test` part 2

\tiny

```{r bAnova-7 }
with(hairpain, pairwise.t.test(pain, hair, p.adj = "fdr"))
with(hairpain, pairwise.t.test(pain, hair, p.adj = "bon"))
```

\normalsize

## Comments

-   P-values all adjusted upwards from "none".

-   Required because 6 tests at once.

-   Highest P-values for Bonferroni: most "conservative".

-   Prefer Tukey or FDR or Holm.

-   Tukey only applies to ANOVA, not to other cases of multiple testing.

## Rats and vitamin B

-   What is the effect of dietary vitamin B on the kidney?

-   A number of rats were randomized to receive either a B-supplemented
    diet or a regular diet.

-   Desired to control for initial size of rats, so classified into size
    classes `lean` and `obese`.

-   After 20 weeks, rats' kidneys weighed.

-   Variables:

    -   Response: `kidneyweight` (grams).

    -   Explanatory: `diet`, `ratsize`.

-   Read in data:

```{r bAnova-8 }
my_url <- "http://ritsokiguess.site/datafiles/vitaminb.txt"
vitaminb <- read_delim(my_url, " ")
```

## The data

```{r bAnova-9 }
vitaminb
```

## Grouped boxplot

```{r bAnova-10, fig.height=4.5}
ggplot(vitaminb, aes(
  x = ratsize, y = kidneyweight,
  fill = diet
)) + geom_boxplot()
```

## What's going on? 

-   Calculate group means:

\small

```{r bAnova-11 }
summary <- vitaminb %>%
  group_by(ratsize, diet) %>%
  summarize(n = n(), mean = mean(kidneyweight))
summary
```

\normalsize

-   Rat size: a large and consistent effect.

-   Diet: small/no effect (compare same rat size, different diet).

-   Effect of rat size *same* for each diet: no interaction.

## ANOVA with interaction

```{r bAnova-12 }
vitaminb.1 <- aov(kidneyweight ~ ratsize * diet,
  data = vitaminb
)
summary(vitaminb.1)
```

-   Significance/nonsignificance as we expected.
-   Note no significant interaction (can be removed).

## Interaction plot

-   Plot mean of response variable against one of the explanatory, using
    other one as groups. Start from `summary`:

```{r bAnova-13 }
g <- ggplot(summary, aes(
  x = ratsize, y = mean,
  colour = diet, group = diet
)) +
  geom_point() + geom_line()
```

-   For this, have to give *both* `group` and `colour`.

## The interaction plot

```{r bAnova-14, fig.height=6}
g
```

Lines basically parallel, indicating no interaction.

## Take out interaction

```{r bAnova-15}
vitaminb.2 <- update(vitaminb.1, . ~ . - ratsize:diet)
summary(vitaminb.2)
```

-   No Tukey for `diet`: not significant.

-   No Tukey for `ratsize`: only two sizes, and already know that obese
    rats have larger kidneys than lean ones.

-   Bottom line: diet has no effect on kidney size once you control for
    size of rat.

```{r}
TukeyHSD(vitaminb.2)
```

## Assessing assumptions: residuals

-   In two-way ANOVA, not many observations per treatment group.
-   Difficult to check for normality / equal spreads.
-   *But*, any regular ANOVA also a regression.
-   Use regression residual ideas.
-   In ANOVA, one fitted value per treatment group (based on means).
-   Residual: observation minus fitted value.

## Previous ANOVA as regression

\footnotesize

```{r bAnova-16}
vitaminb.3 <- lm(kidneyweight ~ ratsize + diet, data = vitaminb)
summary(vitaminb.3)
```

\normalsize

## Reproduce ANOVA

```{r bAnova-17}
  drop1(vitaminb.3, test = "F") 
```

-   ANOVA and regression `drop1` output always the same.
-   this time, ANOVA and regression `summary` output have same P-values,
    but only because categorical variables both have two levels.

## Are the residuals normal?

```{r bAnova-18}
#| fig-height: 5
ggplot(vitaminb.3, aes(sample=.resid)) + 
  stat_qq() + stat_qq_line()
```

## Residuals against fitted

```{r bAnova-19}
#| fig-height: 5
ggplot(vitaminb.3, aes(x=.fitted, y=.resid)) + geom_point()
```

## Comments

-   2 rat sizes, 2 diets: only $2 \times 2 = 4$ different fitted values
-   larger fitted values have greater spread (fan-out, transformation?)
-   add residuals to data to plot residuals against size, diet
    (`augment` from `broom`):

```{r bAnova-20}
vitaminb.3 %>% augment(vitaminb) -> vitaminb.3a
```

-   explanatory `ratsize`, `diet` categorical, so plot resid vs. them
    with *boxplots*.

## Residuals vs rat size

```{r bAnova-21}
#| fig-height: 5
ggplot(vitaminb.3a, aes(x = ratsize, y = .resid)) + 
  geom_boxplot()
```

## Residuals vs diet

```{r bAnova-22}
#| fig-height: 5
ggplot(vitaminb.3a, aes(x = diet, y = .resid)) + 
  geom_boxplot()
```

## Comments

-   there are low outliers on the plot against diet
-   residuals for obese rats seem more spread out than for lean rats
-   case for transformation of rat weights
-   however, story from our analysis very clear:
    -   rat size strongly significant
    -   diet nowhere near significant
-   and so expect transformation to make no difference to conclusions.

## The auto noise data

In 1973, the President of Texaco cited an automobile filter developed by
Associated Octel Company as effective in reducing pollution. However,
questions had been raised about the effects of filter silencing. He
referred to the data included in the report (and below) as evidence that
the silencing properties of the Octel filter were at least equal to
those of standard silencers.

```{r bAnova-23 }
u <- "http://ritsokiguess.site/datafiles/autonoise.txt"
autonoise <- read_table(u)
```

## The data

```{r bAnova-24 }
autonoise
```

## Making boxplot

-   Make a boxplot, but have combinations of filter type and engine
    size.

-   Use grouped boxplot again, thus:

```{r bAnova-25 }
g <- autonoise %>%
  ggplot(aes(x = size, y = noise, fill = type)) +
  geom_boxplot()
```

## The boxplot

-   See difference in engine noise between Octel and standard is larger
    for medium engine size than for large or small.

-   Some evidence of differences in spreads (ignore for now):

```{r bAnova-26}
#| fig-height: 4.5
g
```

## ANOVA

```{r bAnova-27}
autonoise.1 <- aov(noise ~ size * type, data = autonoise)
summary(autonoise.1)
```

-   The interaction is significant, as we suspected from the boxplots.

-   The within-group spreads don't look very equal, but only based on 6
    obs each.

## Tukey: ouch!

\scriptsize

```{r bAnova-28}
autonoise.2 <- TukeyHSD(autonoise.1)
autonoise.2$`size:type`
```

\normalsize

## Interaction plot

-   This time, don't have summary of mean noise for each size-type
    combination.

-   One way is to compute summaries (means) first, and feed into
    `ggplot` as in vitamin B example.

-   Or, have `ggplot` compute them for us, thus:

```{r bAnova-29 }
g <- ggplot(autonoise, aes(
  x = size, y = noise,
  colour = type, group = type
)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun = mean, geom = "line")
```

## Interaction plot

The lines are definitely *not* parallel, showing that the effect of
`type` is different for medium-sized engines than for others:

```{r bAnova-30}
#| fig-height: 5
g
```

## If you don't like that...

... then compute the means first:

```{r bAnova-31}
autonoise %>%
  group_by(size, type) %>%
  summarize(mean_noise = mean(noise)) %>%
  ggplot(aes(
    x = size, y = mean_noise, group = type,
    colour = type
  )) + geom_point() + geom_line() -> g
```

## Interaction plot again

```{r}
#| fig-height: 5

g
```


## Simple effects for auto noise example

-   In auto noise example, weren't interested in all comparisons between
    car size and filter type combinations.

-   Wanted to demonstrate (lack of) difference between filter types *for
    each engine type*.

-   These are called **simple effects** of one variable (filter type)
    conditional on other variable (engine type).

-   To do this, pull out just the data for small cars, compare noise for
    the two filter types. Then repeat for medium and large cars. (Three
    one-way ANOVAs.)

## Do it using `dplyr tools`

-   Small cars:

```{r bAnova-32 }
autonoise %>%
  filter(size == "S") %>%
  aov(noise ~ type, data = .) %>%
  summary()
```

-   No filter difference for small cars.

-   For Medium, change `S` to `M` and repeat.

## Simple effect of filter type for medium cars

\small

```{r bAnova-33 }
autonoise %>%
  filter(size == "M") %>%
  aov(noise ~ type, data = .) %>%
  summary()
```

-   There *is* an effect of filter type for medium cars. Look at means
    to investigate (over).

## Mean noise for each filter type

... for medium engine size:

```{r bAnova-34 }
autonoise %>%
  filter(size == "M") %>%
  group_by(type) %>%
  summarize(m = mean(noise))
```

-   Octel filters produce *less* noise for medium cars.

## Large cars

-   Large cars:

```{r bAnova-35 }
autonoise %>%
  filter(size == "L") %>%
  aov(noise ~ type, data = .) %>%
  summary()
```

-   No significant difference again.

## All at once, using split/apply/combine

The "split" part:

```{r bAnova-36}
autonoise %>%
  group_by(size) %>%
  nest()
```

Now have *three* rows, with the data frame for each size encoded as *one
element* of this data frame.

## Apply

-   Write function to do `aov` on a data frame with columns `noise` and
    `type`, returning P-value:

```{r bAnova-37 }
aov_pval <- function(x) {
  noise.1 <- aov(noise ~ type, data = x)
  gg <- tidy(noise.1)
  gg$p.value[1]
}
```

-   Test it:

```{r bAnova-38 }
autonoise %>%
  filter(size == "L") %>%
  aov_pval()
```

-   Check.

## Combine

-   Apply this function to each of the nested data frames (one per
    engine size):

```{r bAnova-39}
autonoise %>%
  nest_by(size) %>% 
  rowwise() %>% 
  mutate(p_val = aov_pval(data)) %>% 
  select(-data)
```

## Tidy up

-   The `data` column was stepping-stone to getting answer. Don't need
    it any more:

\small

```{r bAnova-40}
autonoise %>%
  nest_by(size) %>% 
  rowwise() %>% 
  mutate(p_val = aov_pval(data)) %>% 
  select(-data) -> simple_effects
simple_effects
```

## Simultaneous tests

-   When testing simple effects, doing several tests at once. (In this
    case, 3.) Should adjust P-values for this. Eg. Bonferroni:

```{r bAnova-41}
simple_effects %>% 
  mutate(p_val_adj = p_val * 3)
```

-   No change in rejection decisions.

-   Octel filters sig. better in terms of noise for medium cars, and not
    sig. different for other sizes.

-   Octel filters never significantly worse than standard ones.

## Confidence intervals

-   Perhaps better way of assessing simple effects: look at *confidence
    intervals* rather than tests.

-   Gives us sense of accuracy of estimation, and thus whether
    non-significance might be lack of power: \`\`absence of evidence is
    not evidence of absence''.

-   Works here because *two* filter types, using `t.test` for each
    engine type.

-   Want to show that the Octel filter is equivalent to or better than
    the standard filter, in terms of engine noise.

## Equivalence and noninferiority

-   Known as "equivalence testing" in medical world. A good read:
    [link](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3019319/). Basic
    idea: decide on size of difference $\delta$ that would be considered
    "equivalent", and if CI entirely inside $\pm \delta$, have evidence
    in favour of equivalence.

-   We really want to show that the Octel filters are "no worse" than
    the standard one: that is, equivalent *or better* than standard
    filters.

-   Such a "noninferiority test" done by checking that `upper limit` of
    CI, new minus old, is *less* than $\delta$. (This requires careful
    thinking about (i) which way around the difference is and (ii)
    whether a higher or lower value is better.)

## CI for small cars

Same idea as for simple effect test:

```{r bAnova-42 }
autonoise %>%
  filter(size == "S") %>%
  t.test(noise ~ type, data = .) %>%
  pluck("conf.int")
```

## CI for medium cars

```{r bAnova-43 }
autonoise %>%
  filter(size == "M") %>%
  t.test(noise ~ type, data = .) %>%
  pluck("conf.int")
```

## CI for large cars

```{r bAnova-44 }
autonoise %>%
  filter(size == "L") %>%
  t.test(noise ~ type, data = .) %>%
  pluck("conf.int")
```

## Or, all at once: split/apply/combine

```{r bAnova-45}
ci_func <- function(x) {
  tt <- t.test(noise ~ type, data = x)
  tt$conf.int
}

autonoise %>% nest_by(size) %>%
  rowwise() %>% 
  mutate(ci = list(ci_func(data))) %>%
  unnest_wider(ci, names_sep = "_") -> cis
```

## Results

```{r bAnova-46}
cis %>% select(size, starts_with("ci"))
```

## Procedure

-   Function to get CI of difference in noise means for types of filter
    on input data frame

-   Nest by `size` (mini-df `data` per size)

-   Calculate CI for each thing in `data`: CI is two numbers long

-   `unnest` `ci` column (wider) to see two numbers in each CI.

## CIs and noninferiority test

-   Suppose we decide that a 20 dB difference would be considered
    equivalent. (I have no idea whether that is reasonable.)

-   Intervals:

```{r bAnova-47}
cis %>% select(-data)
```

## Comments

-   In all cases, upper limit of CI is less than 20 dB. The Octel
    filters are "noninferior" to the standard ones.

-   Caution: we did 3 procedures at once again. The true confidence
    level is not 95%. (Won't worry about that here.)

## Contrasts in ANOVA

-   Sometimes, don't want to compare *all* groups, only *some* of them.

-   Might be able to specify these comparisons ahead of time; other
    comparisons of no interest.

-   Wasteful to do ANOVA and Tukey.

## Example: chainsaw kickback

-   From
    [link](http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/lectures/ANOVA2.pdf).

-   Forest manager concerned about safety of chainsaws issued to field
    crew. 4 models of chainsaws, measure "kickback" (degrees of
    deflection) for 5 of each:

```         

 A  B  C  D
-----------
42 28 57 29
17 50 45 29
24 44 48 22
39 32 41 34
43 61 54 30
```

-   So far, standard 1-way ANOVA: what differences are there among
    models?

## chainsaw kickback (2)

-   But: models A and D are designed to be used at home, while models B
    and C are industrial models.

- Suggests these comparisons of interest:

  - home vs. industrial

  - the two home models A vs. D

  - the two industrial models B vs. C.

- Don't need to compare *all* the pairs of models.

## What is a contrast?

- Contrast is a linear combination of group means.

- Notation: $\mu_A$ for (population) mean of group $A$, and so on. In example:

  - compare two home models: $H_0: \mu_A-\mu_D=0$.

  - compare two industrial models: $H_0: \mu_B-\mu_C=0$.

  - compare average of two home models vs. average of two industrial
    models: $H_0: \frac{1}{2}(\mu_A+\mu_D)-{1\over 2}(\mu_B+\mu_C)=0$ or
    $H_0: 0.5\mu_A-0.5\mu_B-0.5\mu_C+0.5\mu_D=0$.

- Note that coefficients of contrasts add to 0, 
and right-hand side is 0.

## Contrasts in R

-   Comparing two home models A and D ($\mu_A-\mu_D=0$):

```{r bAnova-48 }
c.home <- c(1, 0, 0, -1)
```

-   Comparing two industrial models B and C ($\mu_B-\mu_C=0$):

```{r bAnova-49 }
c.industrial <- c(0, 1, -1, 0)
```

-   Comparing home average vs. industrial average
    ($0.5\mu_A-0.5\mu_B-0.5\mu_C+0.5\mu_D=0$):

```{r bAnova-50 }
c.home.ind <- c(0.5, -0.5, -0.5, 0.5)
```

## Orthogonal contrasts

-   What happens if we multiply the contrast coefficients one by one?

```{r bAnova-51 }
# c.home
c.industrial
c.home.ind
# c.home * c.industrial
# c.home * c.home.ind
c.industrial * c.home.ind
```

-   in each case, the results **add up to zero**. Such contrasts are
    called **orthogonal**.

## Orthogonal contrasts (2)

-   Compare these:

\normalsize

```{r bAnova-52 }
c1 <- c(1, -1, 0)
c2 <- c(0, 1, -1)
sum(c1 * c2)
```

\normalsize

Not zero, so `c1` and `c2` are *not* orthogonal.

-   Orthogonal contrasts are much easier to deal with.

-   Can use non-orthogonal contrasts, but more trouble (beyond us).

## Read in data

```{r bAnova-53, message=FALSE}
url <- "http://ritsokiguess.site/datafiles/chainsaw.txt"
chain.wide <- read_table(url)
chain.wide
```

## Tidying

Need all the kickbacks in *one* column:

```{r bAnova-54 }
chain.wide %>% 
  pivot_longer(A:D, names_to = "model", 
               names_ptypes = list(model = factor()), 
               values_to = "kickback") -> chain
```

## Starting the analysis (2)

The proper data frame:

```{r bAnova-55 }
chain 
```

## Setting up contrasts

```{r bAnova-56 }
m <- cbind(c.home, c.industrial, c.home.ind)
m
contrasts(chain$model) <- m
```

## ANOVA *as if regression*

\footnotesize

```{r bAnova-57 }
chain.1 <- lm(kickback ~ model, data = chain)
summary(chain.1)
```

\normalsize

## Conclusions

```{r bAnova-58 }
tidy(chain.1) %>% select(term, p.value)
```

-   Two home models not sig. diff. (P-value 0.51)

-   Two industrial models not sig. diff. (P-value 0.34)

-   Home, industrial models *are* sig. diff. (P-value 0.0032).

## Means by model

-   The means:

```{r bAnova-59 }
chain %>%
  group_by(model) %>%
  summarize(mean.kick = mean(kickback)) %>%
  arrange(desc(mean.kick))
```

-   Home models A & D have less kickback than industrial ones B & C.

-   Makes sense because industrial users should get training to cope
    with additional kickback.