--- title: "Analysis of variance revisited" --- ## Analysis of variance * Analysis of variance used with: * counted/measured response * categorical explanatory variable(s) * that is, data divided into groups, and see if response significantly different among groups * or, see whether knowing group membership helps to predict response. ## Two stages * Typically two stages: * $F$-test to detect *any* differences among/due to groups * if $F$-test significant, do *multiple comparisons* to see which groups significantly different from which. * Need special multiple comparisons method because just doing (say) two-sample $t$-tests on each pair of groups gives too big a chance of finding "significant" differences by accident. ## Packages These: {r bAnova-1} library(tidyverse) library(broom) library(car) # for Levene's text  ## Example: Pain threshold and hair colour * Do people with different hair colour have different abilities to deal with pain? * Men and women of various ages divided into 4 groups by hair colour: light and dark blond, light and dark brown. * Each subject given a pain sensitivity test resulting in pain threshold score: higher score is higher pain tolerance. * 19 subjects altogether. ## The data In hairpain.txt (some): ![](Screenshot_2023-08-10_11-37-10.png) ## Summarizing the groups \footnotesize {r bAnova-2, message=F} my_url <- "http://ritsokiguess.site/datafiles/hairpain.txt" hairpain <- read_delim(my_url, " ") hairpain %>% group_by(hair) %>% summarize( n = n(), xbar = mean(pain), s = sd(pain) )  \normalsize Brown-haired people seem to have lower pain tolerance. ## Boxplot {r tartuffo,fig.height=3.5} ggplot(hairpain, aes(x = hair, y = pain)) + geom_boxplot()  ## Assumptions * Data should be: * normally distributed within each group * same spread for each group * darkbrown group has upper outlier (suggests not normal) * darkblond group has smaller IQR than other groups. * But, groups *small*. * Shrug shoulders and continue for moment. ## Testing equality of SDs * via **Levene's test** in package car: \small {r bAnova-3} leveneTest(pain ~ hair, data = hairpain)  \normalsize * No evidence (at all) of difference among group SDs. * Possibly because groups *small*. ## Analysis of variance \small {r bAnova-4} hairpain.1 <- aov(pain ~ hair, data = hairpain) summary(hairpain.1)  \normalsize * P-value small: the mean pain tolerances for the four groups are *not* all the same. * Which groups differ from which, and how? ## Multiple comparisons * Which groups differ from which? Multiple comparisons method. Lots. * Problem: by comparing all the groups with each other, doing many tests, have large chance to (possibly incorrectly) reject $H_0:$ groups have equal means. * 4 groups: 6 comparisons (1 vs 2, 1 vs 3, \ldots, 3 vs 4). 5 groups: 10 comparisons. Thus 6 (or 10) chances to make mistake. * Get "familywise error rate" of 0.05 (whatever), no matter how many comparisons youâ€™re doing. * My favourite: Tukey, or "honestly significant differences": how far apart might largest, smallest group means be (if actually no differences). Group means more different: significantly different. ## Tukey * TukeyHSD: \footnotesize {r bAnova-5} TukeyHSD(hairpain.1)  \normalsize ## The old-fashioned way * List group means in order * Draw lines connecting groups that are *not* significantly different:  darkbrown lightbrown darkblond lightblond 37.4 42.5 51.2 59.2 ------------------------- ---------------  * lightblond significantly higher than everything except darkblond (at $\alpha=0.05$). * darkblond in middle ground: not significantly less than lightblond, not significantly greater than darkbrown and lightbrown. * More data might resolve this. * Looks as if blond-haired people do have higher pain tolerance, but not completely clear. ## Some other multiple-comparison methods * Work any time you do $k$ tests at once (not just ANOVA). * **Bonferroni**: multiply all P-values by $k$. * **Holm**: multiply smallest P-value by $k$, next-smallest by $k-1$, etc. * **False discovery rate**: multiply smallest P-value by $k/1$, 2nd-smallest by $k/2$, \ldots, $i$-th smallest by $k/i$. * Stop after non-rejection. ## Example * P-values 0.005, 0.015, 0.03, 0.06 (4 tests all done at once) Use $\alpha=0.05$. * Bonferroni: * Multiply all P-values by 4 (4 tests). * Reject only 1st null. * Holm: * Times smallest P-value by 4: $0.005*4=0.020<0.05$, reject. * Times next smallest by 3: $0.015*3=0.045<0.05$, reject. * Times next smallest by 2: $0.03*2=0.06>0.05$, do not reject. Stop. ## \ldots Continued * With P-values 0.005, 0.015, 0.03, 0.06: * False discovery rate: * Times smallest P-value by 4: $0.005*4=0.02<0.05$: reject. * Times second smallest by $4/2$: $0.015*4/2=0.03<0.05$, reject. * Times third smallest by $4/3$: $0.03*4/3=0.04<0.05$, reject. * Times fourth smallest by $4/4$: $0.06*4/4=0.06>0.05$, do not reject. Stop. ## pairwise.t.test \tiny {r bAnova-6 } with(hairpain, pairwise.t.test(pain, hair, p.adj = "none")) with(hairpain, pairwise.t.test(pain, hair, p.adj = "holm"))  \normalsize ## pairwise.t.test part 2 \tiny {r bAnova-7 } with(hairpain, pairwise.t.test(pain, hair, p.adj = "fdr")) with(hairpain, pairwise.t.test(pain, hair, p.adj = "bon"))  \normalsize ## Comments * P-values all adjusted upwards from "none". * Required because 6 tests at once. * Highest P-values for Bonferroni: most "conservative". * Prefer Tukey or FDR or Holm. * Tukey only applies to ANOVA, not to other cases of multiple testing. ## Rats and vitamin B * What is the effect of dietary vitamin B on the kidney? * A number of rats were randomized to receive either a B-supplemented diet or a regular diet. * Desired to control for initial size of rats, so classified into size classes lean and obese. * After 20 weeks, rats' kidneys weighed. * Variables: * Response: kidneyweight (grams). * Explanatory: diet, ratsize. * Read in data: \small {r bAnova-8 } my_url <- "http://ritsokiguess.site/datafiles/vitaminb.txt" vitaminb <- read_delim(my_url, " ")  \normalsize ## The data {r bAnova-9 } vitaminb  ## Grouped boxplot {r bAnova-10, fig.height=3.0} ggplot(vitaminb, aes( x = ratsize, y = kidneyweight, fill = diet )) + geom_boxplot()  ## What's going on? * Calculate group means: \footnotesize {r bAnova-11 } summary <- vitaminb %>% group_by(ratsize, diet) %>% summarize(mean = mean(kidneyweight)) summary  \normalsize * Rat size: a large and consistent effect. * Diet: small/no effect (compare same rat size, different diet). * Effect of rat size *same* for each diet: no interaction. ## ANOVA with interaction {r bAnova-12 } vitaminb.1 <- aov(kidneyweight ~ ratsize * diet, data = vitaminb ) summary(vitaminb.1)  - Significance/nonsignificance as we expected. - Note no significant interaction (can be removed). ## Interaction plot * Plot mean of response variable against one of the explanatory, using other one as groups. Start from summary: {r bAnova-13 } g <- ggplot(summary, aes( x = ratsize, y = mean, colour = diet, group = diet )) + geom_point() + geom_line()  * For this, have to give *both* group and colour. ## The interaction plot {r bAnova-14, fig.height=3.7} g  Lines basically parallel, indicating no interaction. ## Take out interaction \small {r bAnova-15} vitaminb.2 <- update(vitaminb.1, . ~ . - ratsize:diet) summary(vitaminb.2)  \normalsize * No Tukey for diet: not significant. * No Tukey for ratsize: only two sizes, and already know that obese rats have larger kidneys than lean ones. * Bottom line: diet has no effect on kidney size once you control for size of rat. ## Assessing assumptions: residuals - In two-way ANOVA, not many observations per treatment group. - Difficult to check for normality / equal spreads. - *But*, any regular ANOVA also a regression. - Use regression residual ideas. - In ANOVA, one fitted value per treatment group (based on means). - Residual: observation minus fitted value. ## Previous ANOVA as regression \scriptsize {r bAnova-16} vitaminb.3 <- lm(kidneyweight ~ ratsize + diet, data = vitaminb) summary(vitaminb.3)  \normalsize ## Reproduce ANOVA \small {r bAnova-17} drop1(vitaminb.3, test = "F")  \normalsize - ANOVA and regression drop1 output always the same. - this time, ANOVA and regression summary output have same P-values, but only because categorical variables both have two levels. ## Are the residuals normal? {r bAnova-18} ggplot(vitaminb.3, aes(sample=.resid)) + stat_qq() + stat_qq_line()  ## Residuals against fitted {r bAnova-19} ggplot(vitaminb.3, aes(x=.fitted, y=.resid)) + geom_point()  ## Comments - 2 rat sizes, 2 diets: only $2 \times 2 = 4$ different fitted values - larger fitted values have greater spread (fan-out, transformation?) - add residuals to data to plot residuals against size, diet (augment from broom): {r bAnova-20} vitaminb.3 %>% augment(vitaminb) -> vitaminb.3a  - explanatory ratsize, diet categorical, so plot resid vs. them with *boxplots*. ## Residuals vs rat size {r bAnova-21} ggplot(vitaminb.3a, aes(x = ratsize, y = .resid)) + geom_boxplot()  ## Residuals vs diet {r bAnova-22} ggplot(vitaminb.3a, aes(x = diet, y = .resid)) + geom_boxplot()  ## Comments - there are low outliers on the plot against diet - residuals for obese rats seem more spread out than for lean rats - case for transformation of rat weights - however, story from our analysis very clear: - rat size strongly significant - diet nowhere near significant - and so expect transformation to make no difference to conclusions. ## The auto noise data In 1973, the President of Texaco cited an automobile filter developed by Associated Octel Company as effective in reducing pollution. However, questions had been raised about the effects of filter silencing. He referred to the data included in the report (and below) as evidence that the silencing properties of the Octel filter were at least equal to those of standard silencers. {r bAnova-23 } u <- "http://ritsokiguess.site/datafiles/autonoise.txt" autonoise <- read_table(u)  ## The data {r bAnova-24 } autonoise  ## Making boxplot * Make a boxplot, but have combinations of filter type and engine size. * Use grouped boxplot again, thus: {r bAnova-25 } g <- autonoise %>% ggplot(aes(x = size, y = noise, fill = type)) + geom_boxplot()  ## The boxplot * See difference in engine noise between Octel and standard is larger for medium engine size than for large or small. * Some evidence of differences in spreads (ignore for now): {r bAnova-26, fig.height=2.7} g  ## ANOVA \small {r bAnova-27} autonoise.1 <- aov(noise ~ size * type, data = autonoise) summary(autonoise.1)  \normalsize * The interaction is significant, as we suspected from the boxplots. * The within-group spreads don't look very equal, but only based on 6 obs each. ## Tukey: ouch! \scriptsize {r bAnova-28} autonoise.2 <- TukeyHSD(autonoise.1) autonoise.2$size:type  \normalsize ## Interaction plot * This time, don't have summary of mean noise for each size-type combination. * One way is to compute summaries (means) first, and feed into ggplot as in vitamin B example. * Or, have ggplot compute them for us, thus: {r bAnova-29 } g <- ggplot(autonoise, aes( x = size, y = noise, colour = type, group = type )) + stat_summary(fun = mean, geom = "point") + stat_summary(fun = mean, geom = "line")  ## Interaction plot The lines are definitely *not* parallel, showing that the effect of type is different for medium-sized engines than for others: {r bAnova-30, fig.height=3} g  ## If you don't like that\ldots \ldots then compute the means first, in a pipeline: \footnotesize {r bAnova-31, fig.height=2.2} autonoise %>% group_by(size, type) %>% summarize(mean_noise = mean(noise)) %>% ggplot(aes( x = size, y = mean_noise, group = type, colour = type )) + geom_point() + geom_line()  \normalsize ## Simple effects for auto noise example * In auto noise example, weren't interested in all comparisons between car size and filter type combinations. * Wanted to demonstrate (lack of) difference between filter types *for each car type*. * These are called **simple effects** of one variable (filter type) conditional on other variable (car type). * To do this, pull out just the data for small cars, compare noise for the two filter types. Then repeat for medium and large cars. (Three one-way ANOVAs.) ## Do it using dplyr tools * Small cars: {r bAnova-32 } autonoise %>% filter(size == "S") %>% aov(noise ~ type, data = .) %>% summary()  * No filter difference for small cars. * For Medium, change S to M and repeat. ## Simple effect of filter type for medium cars \small {r bAnova-33 } autonoise %>% filter(size == "M") %>% aov(noise ~ type, data = .) %>% summary()  \normalsize * There *is* an effect of filter type for medium cars. Look at means to investigate (over). ## Mean noise for each filter type \ldots for medium engine size: {r bAnova-34 } autonoise %>% filter(size == "M") %>% group_by(type) %>% summarize(m = mean(noise))  * Octel filters produce *less* noise for medium cars. ## Large cars * Large cars: {r bAnova-35 } autonoise %>% filter(size == "L") %>% aov(noise ~ type, data = .) %>% summary()  * No significant difference again. ## All at once, using split/apply/combine The "split" part: {r bAnova-36} autonoise %>% group_by(size) %>% nest()  Now have *three* rows, with the data frame for each size encoded as *one element* of this data frame. ## Apply * Write function to do aov on a data frame with columns noise and type, returning P-value: {r bAnova-37 } aov_pval <- function(x) { noise.1 <- aov(noise ~ type, data = x) gg <- tidy(noise.1) gg$p.value[1] }  * Test it: {r bAnova-38 } autonoise %>% filter(size == "L") %>% aov_pval()  * Check. ## Combine * Apply this function to each of the nested data frames (one per engine size): {r bAnova-39} autonoise %>% nest_by(size) %>% rowwise() %>% mutate(p_val = aov_pval(data))  ## Tidy up * The data column was stepping-stone to getting answer. Don't need it any more: \small {r bAnova-40} autonoise %>% nest_by(size) %>% rowwise() %>% mutate(p_val = aov_pval(data)) %>% select(-data) -> simple_effects simple_effects  \normalsize ## Simultaneous tests * When testing simple effects, doing several tests at once. (In this case, 3.) Have to adjust P-values for this. Eg.\ Holm: \footnotesize {r bAnova-41} simple_effects %>% ungroup() %>% arrange(p_val) %>% mutate(multiplier = 4 - row_number()) %>% mutate(p_val_adj = p_val * multiplier)  \normalsize \footnotesize * No change in rejection decisions. * Octel filters sig.\ better in terms of noise for medium cars, and not sig.\ different for other sizes. * Octel filters never significantly worse than standard ones. \normalsize ## Confidence intervals * Perhaps better way of assessing simple effects: look at *confidence intervals* rather than tests. * Gives us sense of accuracy of estimation, and thus whether non-significance might be lack of power: absence of evidence is not evidence of absence''. * Works here because *two* filter types, using t.test for each engine type. * Want to show that the Octel filter is equivalent to or better than the standard filter, in terms of engine noise. ## Equivalence and noninferiority * Known as "equivalence testing" in medical world. A good read: [link](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3019319/). Basic idea: decide on size of difference $\delta$ that would be considered "equivalent", and if CI entirely inside $\pm \delta$, have evidence in favour of equivalence. * We really want to show that the Octel filters are "no worse" than the standard one: that is, equivalent *or better* than standard filters. * Such a "noninferiority test" done by checking that upper limit of CI, new minus old, is *less* than $\delta$. (This requires careful thinking about (i) which way around the difference is and (ii) whether a higher or lower value is better.) ## CI for small cars Same idea as for simple effect test: {r bAnova-42 } autonoise %>% filter(size == "S") %>% t.test(noise ~ type, data = .) %>% pluck("conf.int")  ## CI for medium cars {r bAnova-43 } autonoise %>% filter(size == "M") %>% t.test(noise ~ type, data = .) %>% pluck("conf.int")  ## CI for large cars {r bAnova-44 } autonoise %>% filter(size == "L") %>% t.test(noise ~ type, data = .) %>% pluck("conf.int")  ## Or, all at once: split/apply/combine {r bAnova-45} ci_func <- function(x) { tt <- t.test(noise ~ type, data = x) tt$conf.int } autonoise %>% nest_by(size) %>% rowwise() %>% mutate(ci = list(ci_func(data))) %>% unnest_wider(ci, names_sep = "_") -> cis  \normalsize ## Results {r bAnova-46} cis %>% select(size, starts_with("ci"))  ## Procedure * Function to get CI of difference in noise means for types of filter on input data frame * Nest by size (mini-df data per size) * Calculate CI for each thing in data: CI is two numbers long * unnest ci column (wider) to see two numbers in each CI. ## CIs and noninferiority test * Suppose we decide that a 20 dB difference would be considered equivalent. (I have no idea whether that is reasonable.) * Intervals: \vspace{2ex} \small {r bAnova-47} cis %>% select(-data)  \normalsize ## Comments * In all cases, upper limit of CI is less than 20 dB. The Octel filters are "noninferior" to the standard ones. * Caution: we did 3 procedures at once again. The true confidence level is not 95\%. (Won't worry about that here.) ## Contrasts in ANOVA * Sometimes, don't want to compare *all* groups, only *some* of them. * Might be able to specify these comparisons ahead of time; other comparisons of no interest. * Wasteful to do ANOVA and Tukey. ## Example: chainsaw kickback * From [link](http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/lectures/ANOVA2.pdf). * Forest manager concerned about safety of chainsaws issued to field crew. 4 models of chainsaws, measure "kickback" (degrees of deflection) for 5 of each:  A B C D ----------- 42 28 57 29 17 50 45 29 24 44 48 22 39 32 41 34 43 61 54 30  * So far, standard 1-way ANOVA: what differences are there among models? ## chainsaw kickback (2) * But: models A and D are designed to be used at home, while models B and C are industrial models. * Suggests these comparisons of interest: * home vs.\ industrial * the two home models A vs.\ D * the two industrial models B vs.\ C. * Don't need to compare *all* the pairs of models. ## What is a contrast? * Contrast is a linear combination of group means. * Notation:$\mu_A$for (population) mean of group$A$, and so on. * In example, compare two home models:$H_0: \mu_A-\mu_D=0$. * Compare two industrial models:$H_0: \mu_B-\mu_C=0$. * Compare average of two home models vs.\ average of two industrial models:$H_0: \frac{1}{2}(\mu_A+\mu_D)-{1\over 2}(\mu_B+\mu_C)=0$or$H_0: 0.5\mu_A-0.5\mu_B-0.5\mu_C+0.5\mu_D=0$. * Note that coefficients of contrasts add to 0, and right-hand side is 0. ## Contrasts in R * Comparing two home models A and D ($\mu_A-\mu_D=0$): {r bAnova-48 } c.home <- c(1, 0, 0, -1)  * Comparing two industrial models B and C ($\mu_B-\mu_C=0$): {r bAnova-49 } c.industrial <- c(0, 1, -1, 0)  * Comparing home average vs.\ industrial average ($0.5\mu_A-0.5\mu_B-0.5\mu_C+0.5\mu_D=0$): {r bAnova-50 } c.home.ind <- c(0.5, -0.5, -0.5, 0.5)  ## Orthogonal contrasts * What happens if we multiply the contrast coefficients one by one? {r bAnova-51 } c.home * c.industrial c.home * c.home.ind c.industrial * c.home.ind  * in each case, the results **add up to zero**. Such contrasts are called **orthogonal**. ## Orthogonal contrasts (2) * Compare these: \normalsize {r bAnova-52 } c1 <- c(1, -1, 0) c2 <- c(0, 1, -1) sum(c1 * c2)  \normalsize Not zero, so c1 and c2 are *not* orthogonal. * Orthogonal contrasts are much easier to deal with. * Can use non-orthogonal contrasts, but more trouble (beyond us). ## Read in data \small {r bAnova-53, message=F} url <- "http://ritsokiguess.site/datafiles/chainsaw.txt" chain.wide <- read_table(url) chain.wide  \normalsize ## Tidying Need all the kickbacks in *one* column: {r bAnova-54 } chain.wide %>% pivot_longer(A:D, names_to = "model", names_ptypes = list(model=factor()), values_to = "kickback") -> chain  ## Starting the analysis (2) The proper data frame: {r bAnova-55 } chain  ## Setting up contrasts {r bAnova-56 } m <- cbind(c.home, c.industrial, c.home.ind) m contrasts(chain$model) <- m  ## ANOVA *as if regression* \scriptsize {r bAnova-57 } chain.1 <- lm(kickback ~ model, data = chain) summary(chain.1)  \normalsize ## Conclusions {r bAnova-58 } tidy(chain.1) %>% select(term, p.value)  * Two home models not sig.\ diff.\ (P-value 0.51) * Two industrial models not sig.\ diff.\ (P-value 0.34) * Home, industrial models *are* sig.\ diff.\ (P-value 0.0032). ## Means by model * The means: \footnotesize {r bAnova-59 } chain %>% group_by(model) %>% summarize(mean.kick = mean(kickback)) %>% arrange(desc(mean.kick))  \small * Home models A \& D have less kickback than industrial ones B \& C. * Makes sense because industrial users should get training to cope with additional kickback. \normalsize