--- title: "Multivariate Analysis of Variance" editor: markdown: wrap: 72 --- ## Multivariate analysis of variance - Standard ANOVA has just one response variable. - What if you have more than one response? - Try an ANOVA on each response separately. - But might miss some kinds of interesting dependence between the responses that distinguish the groups. ## Packages ```{r bManova-1, results='hide', message=FALSE} library(car) # may need to install first library(tidyverse) library(MVTests) # also may need to install ``` ## Small example - Measure yield and seed weight of plants grown under 2 conditions: low and high amounts of fertilizer. - Data (fertilizer, yield, seed weight): ```{r bManova-2 } url <- "http://ritsokiguess.site/datafiles/manova1.txt" hilo <- read_delim(url, " ") ``` - 2 responses, yield and seed weight. ## The data ```{r bManova-3 } hilo ``` ## Boxplot for yield for each fertilizer group ```{r ferto,size="small",fig.height=3} ggplot(hilo, aes(x = fertilizer, y = yield)) + geom_boxplot() ``` Yields overlap for fertilizer groups. ## Boxplot for weight for each fertilizer group ```{r casteldisangro,size="small",fig.height=3} ggplot(hilo, aes(x = fertilizer, y = weight)) + geom_boxplot() ``` Weights overlap for fertilizer groups. ## ANOVAs for yield and weight \small ```{r bManova-4 } hilo.y <- aov(yield ~ fertilizer, data = hilo) summary(hilo.y) hilo.w <- aov(weight ~ fertilizer, data = hilo) summary(hilo.w) ``` \normalsize Neither response depends significantly on fertilizer. But... ## Plotting both responses at once - Have two response variables (not more), so can plot the response variables against *each other*, labelling points by which fertilizer group they're from. \footnotesize - First, create data frame with points $(31,14)$ and $(38,10)$ (why? Later): ```{r bManova-5, size="footnotesize"} d <- tribble( ~line_x, ~line_y, 31, 14, 38, 10 ) ``` - Then plot data as points, and add line through points in `d`: ```{r bManova-6 } ggplot(hilo, aes(x = yield, y = weight, colour = fertilizer)) + geom_point() + geom_line(data = d, aes(x = line_x, y = line_y, colour = NULL)) -> g ``` \normalsize ## The plot ```{r charlecombe, echo=F, fig.height=4} g ``` ## Comments - Graph construction: - Joining points in `d` by line. - `geom_line` inherits `colour` from `aes` in `ggplot`. - Data frame `d` has no `fertilizer` (previous `colour`), so have to unset. - Results: - High-fertilizer plants have both yield and weight high. - True even though no sig difference in yield or weight individually. - Drew line separating highs from lows on plot. ## MANOVA finds multivariate differences - Is difference found by diagonal line significant? MANOVA finds out. \footnotesize ```{r bManova-7, echo=FALSE} options(width = 60) ``` ```{r bManova-8} response <- with(hilo, cbind(yield, weight)) hilo.1 <- manova(response ~ fertilizer, data = hilo) summary(hilo.1) ``` \normalsize - Yes! Difference between groups is *diagonally*, not just up/down (weight) or left-right (yield). The *yield-weight combination* matters. ## Strategy - Create new response variable by gluing together columns of responses, using `cbind`. - Use `manova` with new response, looks like `lm` otherwise. - With more than 2 responses, cannot draw graph. What then? - If MANOVA test significant, cannot use Tukey. What then? - Use *discriminant analysis* (of which more later). ## Another way to do MANOVA using `Manova` from package `car`: ```{r, include=FALSE} w <- getOption("width") options(width = 132) ``` \tiny ```{r bManova-10} hilo.2.lm <- lm(response ~ fertilizer, data = hilo) hilo.2 <- Manova(hilo.2.lm) summary(hilo.2) ``` \normalsize ```{r, include=FALSE} options(width = w) ``` ## Comments - Same result as small-m `manova`. - `Manova` will also do *repeated measures*, coming up later. ## Assumptions - normality of each response variable within each treatment group - this is actually *multivariate* normality, with correlations - equal spreads: each response variable has same variances and correlations (with other response variables) within each treatment group. Here: - yield has same spread for low and high fertilizer - weight has same spread for low and high fertilizer - correlation between yield and weight is same for low and high fertilizer - test equal spread using Box's $M$ test - a certain amount of unequalness is OK, so only a concern if P-value from $M$-test is very small (eg. less than 0.001). ## Assumptions for yield-weight data For normal quantile plots, need "extra-long" with all the data values in one column: ```{r} hilo %>% pivot_longer(-fertilizer, names_to = "xname", values_to = "xvalue") %>% ggplot(aes(sample = xvalue)) + stat_qq() + stat_qq_line() + facet_grid(xname ~ fertilizer, scales = "free") -> g ``` There are only four observations per response variable - treatment group combination, so graphs are not very informative (over): ## The plots ```{r, fig.height=4} g ``` ## Box M test - Make sure package `MVTests` loaded first. - inputs: - the response matrix (or, equivalently, the response-variable columns from your dataframe) - the column with the grouping variable in it (most easily gotten with `$`). ```{r} hilo %>% select(yield, weight) -> numeric_values summary(BoxM(numeric_values, hilo$fertilizer)) ``` - No problem at all with unequal spreads. ## Another example: peanuts - Three different varieties of peanuts (mysteriously, 5, 6 and 8) planted in two different locations. - Three response variables: `y`, `smk` and `w`. ```{r bManova-11, size="footnotesize"} u <- "http://ritsokiguess.site/datafiles/peanuts.txt" peanuts.orig <- read_delim(u, " ") ``` ## The data \small ```{r bManova-12} peanuts.orig ``` \normalsize ## Setup for analysis ```{r bManova-13 } peanuts.orig %>% mutate( location = factor(location), variety = factor(variety) ) -> peanuts response <- with(peanuts, cbind(y, smk, w)) head(response) ``` ## Analysis (using `Manova`) ```{r, include=FALSE} w <- getOption("width") options(width = 100) ``` \tiny ```{r bManova-14} peanuts.1 <- lm(response ~ location * variety, data = peanuts) peanuts.2 <- Manova(peanuts.1) summary(peanuts.2) ``` \normalsize ```{r, include=FALSE} options(width = w) ``` ## Comments - Interaction not quite significant, but main effects are. - Combined response variable `(y,smk,w)` definitely depends on location and on variety - Weak dependence of `(y,smk,w)` on the location-variety *combination.* - Understanding that dependence beyond our scope right now. ## Comments - this time there are only six observations per location and four per variety, so normality is still difficult to be confident about - `y` at location 1 seems to be the worst for normality (long tails / outliers), and maybe `y` at location 2 is skewed left, but the others are not bad - there is some evidence of unequal spread (slopes of lines), but is it bad enough to worry about? (Box M-test, over). ## Box's M tests - One for location, one for variety: ```{r} summary(BoxM(response, peanuts$location)) summary(BoxM(response, peanuts$variety)) ``` - Neither of these P-values is low enough to worry about. (Remember, the P-value here has to be *really* small to indicate a problem.)