--- title: "Applied Multivariate: Data standardization (continued)" output: html_notebook editor_options: chunk_output_type: inline --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r message = FALSE} library(tidyverse) library(vegan) ``` ## Challenge from last week Step 1 ```{r message = FALSE} mydata <- read_csv("https://ndownloader.figshare.com/files/2292169") glimpse(mydata) ``` ```{r} mydata %>% filter(taxa == "Rodent") %>% group_by(month, year, plot_id, species_id) %>% summarise(N = n_distinct(record_id)) %>% group_by(year, plot_id, species_id) %>% summarise(MaxN = max(N)) %>% group_by(plot_id, species_id) %>% summarise(MaxN2 = max(MaxN)) %>% spread(species_id, MaxN2, fill = 0) %>% ungroup() -> mydata_wide mydata_wide ``` ```{r} mydata_wide %>% select(-plot_id) %>% as.matrix() -> mydata_wide2 mydata_rel <- decostand(mydata_wide2, method = "total", MARGIN = 1) head(mydata_rel) ``` Plot ```{r} mydata_wide %>% gather(Species, MaxN, AH:UR) %>% group_by(plot_id) %>% mutate(Total = sum(MaxN)) %>% arrange(plot_id, Species) %>% mutate(Prop = MaxN / Total) -> mydata_prop ``` ```{r} ggplot(data = mydata_prop) + geom_bar(aes(x = plot_id, y = Prop, fill = Species), stat = "identity", position = "stack") + theme_classic() ``` ## Data standardizations (continued from [last week](https://chrischizinski.github.io/SNR_R_Group/2017-09-15-Introtomultivariate) ### Maximums - can be applied to any range of x - Outputs will range from 0 to 1, largest values will scale to 1\ - Converts vlaues to a relative value (equalized the peak abundance) - Useful when you have differences in total abundance #### Rows - tells us the relative relationship within a site - Species with the greatest abundance will be 1 ```{r} rawdata <- matrix(c(1,1,1,3,3,1, 2,2,4,6,6,0, 10,10,20,30,30,0, 3,3,2,1,1,0, 0,0,0,20,0,0), ncol = 6, byrow = TRUE) colnames(rawdata) <- paste("species",toupper(letters[1:6]), sep = "_") rawdata rowsum_data <- rawdata / apply(rawdata,1,sum) rowmax_data <- rawdata / apply(rawdata,1,max) ``` Compare (visually) the sum and max standardizations for all species for sites 1, 3, and 5 ```{r} rowsum.df <- as.data.frame(rowsum_data) # convert to data frame rowsum.df$type <- "sum" # add a column of type rowsum.df$site <- 1:nrow(rowsum.df) # add column for site rowmax.df <- as.data.frame(rowmax_data) # convert to data frame rowmax.df$type <- "max" # add a column of type rowmax.df$site <- 1:nrow(rowmax.df) # add column for site rowall.df <- rbind(rowsum.df,rowmax.df) rowall.df %>% gather(Species, Number, contains("species")) -> rowall.long ggplot(data = rowall.long %>% filter(site %in% c(1,3,5))) + geom_bar(aes(Species, y = Number, fill = type), stat = "identity", position = "dodge") + facet_wrap(~site, ncol = 1, labeller = label_both) + theme_classic() ``` #### Columns ```{r} decostand(rawdata, method = "max", MARGIN = 2) ``` ### Z-score standadization - Can be applied to any value of x - Output can be any value - Converts values to a z-score (mean = 0, variance = 1) - Commonly used to put variables on an equal scaling - Tend to use this across sites (i.e., by columns) To do this transformation, we find the column mean and column standard deviation. Subtract mean from the cell value and divide by standard deviation ```{r} mvals <- apply(rawdata, 2, mean) sdvals <- apply(rawdata, 2, sd) centered <- sweep(rawdata, 2, colMeans(rawdata)) sweep(centered, 2, sdvals, '/') scale(rawdata, center = TRUE, scale = TRUE) decostand(rawdata, method = "standardize", MARGIN = 2) ``` ### Normalization - Can be applied to any value of x - Output will range between 0 and 1 - Important to use when some rows have large variances and some with small variance - Common standardization in Principal Component Analysis (PCA) ```{r} decostand(rawdata, method = "normalize") ``` ### Hellinger standardization - Can be applied to any value of x - Output can be any value but it tends to be below 1 - Similar to relativization by site - Hellinger distance has good statistical properties with respect to R2 and monotonicity. [Legendre and Gallagher (2001)](https://link.springer.com/article/10.1007/s004420100716) In this standardization, each element is divided by its row sum. After that, the square root of that relativized value is calculated ```{r} decostand(rawdata, method = "hellinger") ``` ### Wisconsin double standardization - Can be applied to any value of X > 0 - Output will range between 0 and 1 - Equalizes the emphasis among sample units and among species - Difficult to understand individual data after the standardization To do this, each element is divided by its column max and then divided by the row total ```{r} col_max <- apply(rawdata, 2, max) wtd_data.1 <- rawdata %*% diag(1/col_max) row_ttl <- apply(wtd_data.1,1,sum) wtd_data <- wtd_data.1/ row_ttl wisconsin(rawdata) ```