--- title: "Multivariate models" date: '`r paste("First created on 2015-02-15. Updated on", Sys.Date())`' output: html_document: toc: true toc_float: true --- Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods ### Source Code Link https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/multivariate_models.Rmd The goal of this lesson is to introduce multivariate ordination analyses. ## Readings * Chapters 5 and 6 of *Numerical Ecology with R* * Chapters 9 and 11 of *Numerical Ecology* 2nd edition ## Online Docs * [Ordination analysis by David Zeleny](https://www.davidzeleny.net/anadat-r/doku.php/en:ordination) * [The Ordination Webpage by Mike Palmer](http://ordination.okstate.edu/) - great for term definitions, layman's explanation of how the methods differ, and how ecologists should interpret * [Vegan: an introduction to ordination by Jari Oksanen](http://cran.r-project.org/web/packages/vegan/vignettes/intro-vegan.pdf) - A brief demonstration of an ordination analysis in the R package vegan * [Multivariate Analysis of Ecological Communities in R: vegan tutorial by Jari Oksanen](https://john-quensen.com/wp-content/uploads/2018/10/Oksanen-Jari-vegantutor.pdf) - A more thorough of ordination in the R package vegan ## Outline * Overview of ordination methods * Create a community matrix. * Indirect or Unconstrained Ordination - Principal Components Analysis (PCA) - Correspondence Analysis (CA) - Detrended Correspondence Analysis (DCA) - Non-metric Multidimensional Scaling (NMDS) * Direct or Constrained Ordination - Redundancy Analysis (RDA) - Canonical Correspondence Analysis (CCA) - Hypothesis Testing - Model Comparison - Variance partitioning ## Overview of multivariate ordination methods Multivariate statistical analyses are generalizations of the univariate approaches we examined in the last lesson. In the case of a univariate approach we considered a model in which $$\hat{y_i} = b_0 1 + b_1 x_{i1} + \cdots + b_p x_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n, \qquad \text{(equ. 1)}$$ We are interested in using multivariate methods when we no longer are just interested in explaining variance in the vector *y* instead we would like to understand the dominant patterns and sources of variation in a matrix of response variables **Y** that are all of the same type (i.e., same units). Often times, we may be simply be interested in if we can represent our response matrix **Y** using a relatively small number of dimensions (without any reference to an explanatory matrix). For example, if **Y** has *n* samples (i.e., rows) and *p* descriptors (i.e., columns) then we can fully represent all the variation in **Y** using *p* hypothetical axes; however, what would a p-dimensional graph look like - any dimensionality > 3 is difficult for the human mind to comprehend. Furthermore, is that degree of complexity even needed to see the major trends in **Y**? Picture trying to interpret a pairs plot that has $p*(p-1)/2$ graphs (one graph for each unique combination of columns in **Y**). The good news is that in general, the information in matrices can be reduced down to a few major axes of variation. Ordination is a term that describes this process of reducing the information needed to represent a matrix. The term derives from the idea to ordinate or put things into order. With ordination approaches we are attempting to take a high-dimensional data matrix and explain its patterns with a small number of axes. Although this is a very general concept across scientific disciplines, the term ordination is generally only used in ecology. There are generally considered to be two types of ordination. 1. Indirect or unconstrained ordination in which only a single matrix is analyzed 2. Direct or constrained ordination in which one matrix is used to explain the variance of another matrix. The figure below (Legendre and Legendre 1998, Fig 11.1) visually illustrates this conceptual shift from modeling a vector *y* (panel b) to a matrix **Y**. Indirect ordination does not use an explanatory matrix **X** (panel a) while direct ordination methods do (panel c) ![](./figures/Fig11-1_legendre.png) Today we will examine both indirect and direction methods of ordination. In general, ordination is frequently used when exploring patterns in datasets graphically; however, it can also be used to carry out hypothesis testing. Despite their sometimes terrifying names ordination has a close kinship with multiple regression (as illustrated in the figure above). One key difference is that the response variable is multivariate rather than univariate; however, with many approaches the underlying algebra is very similar between regression and ordination. Overview of Ordination methods (adapted from ) ```{r, echo = FALSE} library(knitr) # setup the R environment for kniting markdown doc properly opts_knit$set(root.dir='../') ord <- read.csv('./ordination_table.csv') names(ord) <- c("Acronym", "Name", "Type of analysis", "Algorithm", "R function") kable(ord) ``` This [presentation](./community_structure_slides_with_notes.pdf) by [Abel Valdivia](https://twitter.com/AbelValdivia) provides a review of the types of ordination and provides examples of their graphical output. Additionally this [key](http://ordination.okstate.edu/key.htm) created by Mike Palmer provides a decision tree that can help to guide your choice of methods. ## Create a community matrix The first very common challenge when working with multivariate analyses is to construct the multivariate matrix we wish to analyze. Essentially a community matrix is a cross-tab structure in which you have each descriptor element (e.g., species identities) as column ids and each sample element (e.g., site identities) as row ids. A cross-tab structure is a very inefficient way to store your raw data and in many cases we must aggregate the raw data which can be accomplished using a for loop or other simple functions. Using the Great Smokey Mountains tree dataset that was using in HW 3, I will demonstrate how to do this with a computationally inefficient but easy to understand for loop. As a reminder the metadata for the tree data is located [here](../data/tree_metadata.txt). Quick note: In this lesson we will use the R package `dummies` to create dummy variables. This package is now defunct and no longer on the main package repository CRAN. To install you will have to use the package `remotes` and the function `install_github` specifically once `remotes` is installed use `remotes::install_github('cran/dummies')` ```{r load data} # load relevant packages and code for today's lesson library(vegan) library(dummies) # this package is no longer on cran see note above source('./scripts/utility_functions.R') # load data tree <- read.csv('./data/treedata_subset.csv') # create a community site x species matrix by summing species cover values # we can do this with a for loop but it take a while to run uni_sp <- unique(tree$spcode) uni_site <- unique(tree$plotID) ``` ```{r for loop, eval=FALSE} comm <- matrix(0, ncol=length(uni_sp), nrow=length(uni_site)) colnames(comm) <- uni_sp rownames(comm) <- uni_site for(i in seq_along(uni_sp)) { for(j in seq_along(uni_site)) { comm[j , i] <- mean(tree$cover[tree$spcode == uni_sp[i] & tree$plotID == uni_site[j]]) } } comm[1:5, 1:5] ``` Alternatively we could use the function `tapply` which is much more efficient. ```{r tapply} # alternatively we can use a tapply function comm <- tapply(tree$cover, INDEX = list(tree$plotID, tree$spcode), mean) # examine the community matrix comm[1:5, 1:5] # replace the NAs with zeros comm <- ifelse(is.na(comm), 0, comm) comm[1:5, 1:5] ``` Now that we've created our matrix it is usually a good idea to make sure everything looks ok - the row and column sums (i.e., marginal sums) provide one reasonable metric to see overall how the data is structured in a very coarse way. ```{r maginal sums} # visually explore the cover variable between species and sites uni_sp <- unique(tree$spcode) sp_sum <- apply(comm, 2, sum) site_sum <- apply(comm, 1, sum) par(mfrow=c(2,2)) hist(sp_sum) col <- colorRamp(c('red', 'orange', 'blue')) sp_cols <- col(length(uni_sp)) plot(sp_sum[order(sp_sum, decreasing=T)], type='o', col='red', lwd=2, xlab='Sp Rank', ylab='Sum Cover') hist(site_sum) plot(site_sum[order(site_sum, decreasing=T)], type='o', col='red', lwd=2, xlab='Site Rank', ylab='Sum Cover') par(mfrow=c(1,1)) ``` Above we can see that most species have small total covers (i.e., most species are rare) which is a fairly universal ecological law so this seems correct. Also we can see that most sites have a total cover of approximately 40 and that this distribution isn't as highly skewed as the species marginal totals. NOTE: These plots are not terribly helpful at understanding the variability in the community matrix. They are just useful for looking at gross aggregate properties of the matrix; therefore, they can be useful for spotting an error in the matrix. Let's get our environmental variables ready to use and start to conduct some actual analyses. ## Create an explanatory matrix In the tree dataset, each site has one set of environmental measurements. These are replicated across the rows of the `tree` data object ```{r, echo=FALSE} with(tree, head(tree[plotID == "ATBN-01-0403", ])) ``` So we just need to pull out the environmental data for a single one of those rows. In the end we need the explanatory matrix to have the same number of rows in the same order as the community matrix. We could write a for loop to do this but we'll take the easier approach of using the function `aggregate`. ```{r create env matrix} cols_to_keep <- c('elev', 'tci', 'streamdist', 'disturb', 'beers') env <- aggregate(tree[ , cols_to_keep], by = list(tree$plotID), function(x) x[1]) # aggregate does have a side effect of adding a column to the output which # is the unique plotID in this case. We just need to move that so it is a # row name instead. row.names(env) <- env[ , 1] env <- env[ , -1] head(env) # before we can use this explanatory matrix we need to check # that its rows are in the same order as our response matrix all.equal(rownames(comm), rownames(env)) # now that we've completed that check we can rename the rows to something # more manageable rownames(comm) <- 1:nrow(comm) rownames(env) <- 1:nrow(env) ``` ## Indirect or Unconstrained Ordination ### Principle Components Analysis (PCA) Principle Components Analysis (PCA) is a useful tool for either 1. examining correlations between the columns of a matrix 2. potentially reducing the set of explanatory variables included in model Sometimes PCA is also known as Reciprocal Averaging (RA). Simplified description of PCA algorithm (from [Zelený 2018](https://www.davidzeleny.net/anadat-r/doku.php/en:pca)): a. Use the matrix of samples x species (or, generally, samples x descriptors), and display each sample into the multidimensional space where each dimension is defined by an abundance of one species (or descriptor). In this way, the samples will produce a cloud located in the multidimensional space. b. Calculate the centroid of the cloud. c. Move the centers of axes to this centroid. d. Rotate the axes in such a way that the first axis goes through the cloud in the direction of the highest variance, the second is perpendicular to the first and goes in the direction of the second highest variance, and so on. The position of samples on resulting rotated axes are sample scores on ordination axes. Figure 1: PCA ordination of five samples and two species. (Fig. 9.2 from Legendre & Legendre 1998.) Figure 2: 3D schema of PCA ordination algorithm Check out the nice interactive visualization at [Explained Visually](http://setosa.io/ev/principal-component-analysis/) We'll start by using PCA to simply examine the correlations in tree species covers that are contained in the community matrix. ```{r comm pca} tree_pca <- rda(comm) tree_pca ``` Like the object returned by the function `lm`, the output of `rda` is just a named list. We can get a sense of the structure of this named list using the function `str` ```{r using str} str(tree_pca) ``` This is useful because for example if we wanted to pull out the eigenvalues of the analysis we would see that they are stored under `$CA$eig`. We can access each of the objects stored in a named list as we would reference columns in a `data.frame` using the `$`. ```{r eigenvalues} tree_pca$tot.chi tree_pca$CA$eig # the eigenvalues sum up to equal the total inertia sum(tree_pca$CA$eig) # inertia in a PCA is equal to the total variance of the matrix because PCA # uses the usual Euclidean distance as its metric of how different samples are: # 1 / (n-1) * sum(x - xbar)^2 where n is sample size and xbar is the mean x value # you can verify that this is equal to the sum of the variances (i.e., diagonal # elements) in the species covariance matrix sum(diag(cov(comm))) # of if you want to do it by hand with matrix algebra # first center the community matrix by subtracting the means cen_comm <- apply(comm, 2, function(x) x - mean(x)) # then compute the covariance matrix cov_comm <- nrow(comm)^-1 * t(cen_comm) %*% cen_comm sum(diag(cov_comm)) # the ratio of the eigenvalue to the total variance is the amount of # variance explained by each PCA axis round(tree_pca$CA$eig / tree_pca$tot.chi, 2) ``` We can see from above that the PCA axis 1 captures approximately 20% of the total variance in the community matrix. Let's graph the data to better get a sense of the correlation structure. ```{r plot pca} plot(tree_pca) biplot(tree_pca) cleanplot.pca(tree_pca) # p120-121 of Numerical Ecology in R: # Scaling 1 = distance biplot: the eigenvectors are scaled to unit length. (1) # Distances among objects in the biplot are approximations of their # Euclidean distances in multidimensional space. (2) The angles among # descriptor vectors are meaningless. # Scaling 2 = correlation biplot: each eigenvector is scaled to the square root of # its eigenvalue. (1) Distances among objects in the biplot are not approximations # of their Euclidean distances in multidimensional space. (2) The angles # between descriptors in the biplot reflect their correlations. # scaling 2 is the default for vegan ordination graphing functions ordiplot(tree_pca, display = 'sp') ``` ### Correspondance Anlysis (CA), Detrended Coresspondance Analysis (DCA), and NMDS ```{r, eval=FALSE} # each of these different indirect ordination approaches # has different strengths and weaknesses # Correspondence analysis (CA) examines differences on weighted # averages of the columns (i.e., species in this case) tree_ca <- cca(comm) # Detrended correspondence analysis (DCA) is identical except it attempts # to account for a well known artifact of CA known as the arch-effect # by detrending subsequent axes from previous axes. tree_dca <- decorana(comm) # Non-metric multidimensional scaling (MDS) is unique in that you may # specify one of a number of different distance metrics to use. By # default the Bray-Curtis distance is used by metaMDS. tree_mds <- metaMDS(comm) ``` NMDS Maximizes rank-order correlation between distance measures and distance in ordination space. Points are moved to minimize "stress". Stress is a measure of the mismatch between the two kinds of distance. ## Direct or Constrained Ordination ### Redundancy Analysis (RDA) First let's carry out an RDA which expects a linear response of each species to the environmental variables. RDA is the most direct analog of OLS regression to the multivariate response variable. ```{r, error=TRUE} rda_tree <- rda(comm, env) # the above breaks b/c we have a categorical factor in env # vegan requires that we write out each term if we are not going to # convert the factor to a dummy matrix rda_tree <- rda(comm ~ env$elev + env$tci + env$streamdist + env$disturb + env$beers) # alternatively we could use a shorthand approach rda_tree <- rda(comm ~ . , data=env) rda_tree RsquareAdj(rda_tree) ``` The output above provides us with some useful information. Inertia is another name for variation or variance in this case. "Total" refers to total variance, "Constrained" refers to the amount of variance explained by the explanatory variables, "Unconstrained" refers to the residual variance. Constrained + Unconstrained = Total. An $R^2$ statistic can be derived simply as Constrained / Total. The function `RsquareAdj` computes $R^2$ and $R^2$-adjusted. The variable "Rank" indicates the number of variables included. The eigenvalues are displayed for both the constrained and unconstrained axes. In this context these eigenvalues indicate how much variance each of the axes contribute to. We can plot our model result to get a sense of which variables are correlating with with species along which axes. ```{r} plot(rda_tree, type='n', scaling=1) orditorp(rda_tree, display='sp', cex=0.5, scaling=1, col='blue') text(rda_tree, display='cn', col='red') ``` We interpret the plot above as we have interpreted the previously ordination plots with one important difference. The environmental variables are now displayed and their placement indicates their loading on the two displayed RDA axes. `elev` is loading heavily on RDA1 indicating that this variable explains a larger portion of the variance associated with axis 1. The location of the species relative to the environmental variables indicates how strongly a species is associated with a particular environmental variable. So for example ABIEFRA or *Abies fraseri* increases as elevation increases. ### Canonical Correspondence Analysis (CCA) Let's carry out a Canonical Correspondence Analysis (CCA) as well. CCA is appropriate for modeling unimodal or hump-shaped responses to explanatory variables (rather than linear as with RDA). ```{r} cca_tree <- cca(comm ~ ., data=env) RsquareAdj(cca_tree, 100) anova(cca_tree, permutations = 999) anova(cca_tree, by='margin', permutations = 999) plot(cca_tree, type='n', scaling=1) orditorp(cca_tree, display='sp', cex=0.5, scaling=1, col='blue') text(cca_tree, display='bp', col='red') ``` The CCA models don't explain as much variation and their plots look slightly different but the general take home message has not changed much. ### Hypothesis testing Now let's carry out hypothesis testing. ```{r} anova(rda_tree, permutations=10) anova(rda_tree, by='margin', permutations=10) ``` In a real analysis you would specify a much larger number of permutations (at least 1000). The first test examines overall model fit relative to a randomized or permuted matrix of data. The second test examines the partial effects of the individual variables included in the model. ### Model Comparison Model comparison can be carried out using direct ordination methods either using nested models of different degrees of complexity as we did with univariate `lm` type models ```{r} rda_tree_simple <- update(rda_tree, . ~ . - beers) anova(rda_tree_simple, rda_tree) ``` The above tests suggests that the more complex model including the variable `beers` is strongly supported. Note this specific example is identical to the output for the `beers` parameter when using `anova(tree_rda, by = 'margin')`. Model comparison can also be undertaken using adjusted $R^2$. This is often easier to discuss and communicate then a myriad of significance tests. ### Variation Paritioning Lastly let's carry out variance partitioning. We can use this approach to examine how much of the explained variance is due to different groups of variables. In other words this approach is really only useful if you are interested in comparing the relative importance of several variables to another set of variables. ```{r} ## variance partitioning moisture <- env[ , c('elev', 'tci', 'beers', 'streamdist')] # because the variable disturb is a factor we need to convert it into # a dummy matrix using the function dummies::dummy disturb <- dummy(env$disturb) # examine the explanatory variable of each class of variables. varpart(comm, moisture, disturb) showvarparts(2) ``` The output indicates that the moisture group of variables has the largest individual fraction of explained variance (10%), whereas, the disturbance groups of variables explain only approximately 2%. We can also see that there are not any really large fractions of shared variance which indicates the variables effects are somewhat independent of one another.