--- title: "Applied Multivariate: Breaking multivariate data into groups. Part 1." output: html_notebook editor_options: chunk_output_type: inline --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r message = FALSE} library(tidyverse) library(vegan) library(cluster) library(factoextra) library(fpc) ``` ## Cluster analysis ### The background - Cluster analysis is a broad group of multivariate techniques to identify homogenous groups - maximizes between group variation and minimizing within group variation - outcome: reduction of observations into fewer groups - often used in data mining or exploratory approaches - works best when there are inherent discontinuities in the data - if the data is continuous, ordination techniques may be preferred - ordination may force groups that do not exist - Occurs in two basic steps: 1. measure of similarity betewen observations is specified 2. Using this distance (and a clustering rule) observations are grouped based on either a hierarchical or partitioning technique 3. Once a new cluster is formed, distances between clusters are based on single linkage (minimum distance), complete linkage (maximum method), or average linkage - Hiercharchical techniques are useful because they can reveal relationships in a nested fashion (i.e., phylogenetic tree) - not efficient for large data sets (> 500 obs) - Unlike hierarchical, partitioning does not require dissimilarity matrices - Partitioning methods follow four iterative steps: 1. randomly assign cluster centroids 2. classify clusters based on the closest centroid 3. recaculate the centroid after each observation is added 4. repeat steps1-3 until within cluster variation is minimized - Limitations of clustering - exploratory or hypothesis generating tool - Be considerate of using mixed data types. Gower's distance should not be used in hierarchical analysis - Assumes distance measures follow a normal or multinomial distribution - clustering variables are appropriate for group separation - Can be influenced by scale and units - visual classifications are selective ### Now on to the doing We are going to use non-ecological data in this excersize to illustrate the different types of data that can be incorporated into this type of analysis ```{r} data("USArrests") glimpse(USArrests) ``` Lets scale the data ```{r} USArrests %>% scale() -> arrest.scale head(arrest.scale) ``` lets convert this to a distance matrix using the `factoextra::get_dist()` function. ```{r} arrest.scale %>% get_dist(upper = TRUE, diag = TRUE) -> arrest.dist ``` Visualizing the distance matrix ```{r} arrest.dist.df <- as.data.frame(as.matrix(arrest.dist)) arrest.dist.df$row <- rownames(arrest.dist.df) arrest.dist.df %>% gather(col, value, -row) -> arrest_long ``` ```{r} ggplot(data = arrest_long) + geom_raster(aes(x = col,y=row, fill = value)) + coord_equal(expand = F) + scale_fill_gradient2( low = "red", mid = "white", high = "blue", midpoint = 3) + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.title.x = element_blank(), axis.title.y = element_blank()) ```