---
title: "Applied Multivariate:  Data standardization (continued)"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r message = FALSE}
library(tidyverse)
library(vegan)
```


## Challenge from last week

Step 1

```{r message = FALSE}
mydata <- read_csv("https://ndownloader.figshare.com/files/2292169")
glimpse(mydata)
```


```{r}
mydata %>% 
  filter(taxa == "Rodent") %>% 
  group_by(month, year, plot_id, species_id) %>% 
  summarise(N = n_distinct(record_id)) %>% 
  group_by(year, plot_id, species_id) %>% 
  summarise(MaxN = max(N)) %>% 
  group_by(plot_id, species_id) %>% 
  summarise(MaxN2 = max(MaxN)) %>% 
  spread(species_id, MaxN2, fill = 0) %>%
  ungroup()  ->  mydata_wide

mydata_wide
```

```{r}
mydata_wide %>% 
  select(-plot_id) %>% 
  as.matrix() -> mydata_wide2

mydata_rel <- decostand(mydata_wide2, method = "total", MARGIN = 1)
head(mydata_rel) 
```

Plot

```{r}
mydata_wide %>% 
  gather(Species, MaxN, AH:UR) %>% 
  group_by(plot_id) %>% 
  mutate(Total = sum(MaxN)) %>% 
  arrange(plot_id, Species) %>% 
  mutate(Prop = MaxN / Total) -> mydata_prop
```

```{r}
ggplot(data = mydata_prop) + 
  geom_bar(aes(x = plot_id, y = Prop, fill = Species), stat = "identity", position = "stack") + 
  theme_classic()
```

## Data standardizations (continued from [last week](https://chrischizinski.github.io/SNR_R_Group/2017-09-15-Introtomultivariate)

### Maximums

- can be applied to any range of x
- Outputs will range from 0 to 1, largest values will scale to 1\
- Converts vlaues to a relative value (equalized the peak abundance)
- Useful when you have differences in total abundance 

#### Rows

- tells us the relative relationship within a site
- Species with the greatest abundance will be 1

```{r}
rawdata <- matrix(c(1,1,1,3,3,1,
                    2,2,4,6,6,0,
                    10,10,20,30,30,0,
                    3,3,2,1,1,0,
                    0,0,0,20,0,0), ncol = 6, byrow = TRUE)
colnames(rawdata) <- paste("species",toupper(letters[1:6]), sep = "_")

rawdata

rowsum_data <- rawdata / apply(rawdata,1,sum)
rowmax_data <- rawdata / apply(rawdata,1,max)
```

Compare (visually) the sum and max standardizations for all species for sites 1, 3, and 5

```{r}
rowsum.df <- as.data.frame(rowsum_data) # convert to data frame
rowsum.df$type <- "sum" # add a column of type
rowsum.df$site <- 1:nrow(rowsum.df) # add column for site

rowmax.df <- as.data.frame(rowmax_data) # convert to data frame
rowmax.df$type <- "max" # add a column of type
rowmax.df$site <- 1:nrow(rowmax.df) # add column for site

rowall.df <- rbind(rowsum.df,rowmax.df)

rowall.df %>% 
  gather(Species, Number, contains("species")) -> rowall.long

ggplot(data = rowall.long %>%  filter(site %in% c(1,3,5))) +
  geom_bar(aes(Species, y = Number, fill = type), stat = "identity", position = "dodge") + 
  facet_wrap(~site, ncol = 1, labeller = label_both) + 
  theme_classic()
```

#### Columns

```{r}
decostand(rawdata, method = "max", MARGIN = 2)
```

### Z-score standadization

- Can be applied to any value of x
- Output can be any value
- Converts values to a z-score (mean = 0, variance = 1)
- Commonly used to put variables on an equal scaling 
- Tend to use this across sites (i.e., by columns)

To do this transformation, we find the column mean and column standard deviation. Subtract mean from the cell value and divide by standard deviation

```{r}
mvals <- apply(rawdata, 2, mean)
sdvals <- apply(rawdata, 2, sd)

centered <- sweep(rawdata, 2, colMeans(rawdata))
sweep(centered, 2, sdvals, '/')

scale(rawdata, center = TRUE, scale = TRUE)

decostand(rawdata, method = "standardize", MARGIN = 2)
```

### Normalization 
- Can be applied to any value of x
- Output will range between 0 and 1
- Important to use when some rows have large variances and some with small variance
- Common standardization in Principal Component Analysis (PCA)

```{r}
decostand(rawdata, method = "normalize")
```

### Hellinger standardization 
- Can be applied to any value of x
- Output can be any value but it tends to be below 1
- Similar to relativization by site
- Hellinger distance has good statistical properties with respect to R<sup>2</sup> and monotonicity.  [Legendre and Gallagher (2001)](https://link.springer.com/article/10.1007/s004420100716)

In this standardization, each element is divided by its row sum. After that, the square root of that relativized value is calculated 

```{r}
decostand(rawdata, method = "hellinger")
```

### Wisconsin double standardization

- Can be applied to any value of X > 0
- Output will range between 0 and 1
- Equalizes the emphasis among sample units and among species
- Difficult to understand individual data after the standardization 

To do this, each element is divided by its column max and then divided by the row total

```{r}
col_max <- apply(rawdata, 2, max)

wtd_data.1 <- rawdata %*% diag(1/col_max)

row_ttl <- apply(wtd_data.1,1,sum)

wtd_data <- wtd_data.1/ row_ttl

wisconsin(rawdata)
```