---
title: "Intro to Multivariate Stats"
output:
  html_notebook: default
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Difference between Transformations and Standardizations

- Transformations are applied to each element in a matrix
- Standardization adjust elements in a matrix by a row or column statistic 

### Create some data

```{r}

rawdata <- matrix(c(1,1,1,3,3,1,
                    2,2,4,6,6,0,
                    10,10,20,30,30,0,
                    3,3,2,1,1,0,
                    0,0,0,20,0,0), ncol = 6, byrow = TRUE)
colnames(rawdata) <- paste("species",toupper(letters[1:6]), sep = "_")

rawdata
```

### Calculating row and column statistics

#### Rows

```{r}
# Row sums
rowSums(rawdata)

apply(rawdata, 1, sum)

# Max values
apply(rawdata, 1, max)
```

#### Columns 

```{r}
# Sums
apply(rawdata, 2, sum)
colSums(rawdata)

# Max
apply(rawdata, 2, max)
```

### Monotonic transformations 

#### Log transformations 

- Useful for when you have a wide spread in data values

- Ir is important that you add 1 to values to account for zeros `log10(x+1)`

```{r}
logdata <- apply(rawdata , c(1,2), function(x) log10(x + 1))
```

```{r message = FALSE}
library(tidyverse)

hemlock <- read_csv("https://raw.githubusercontent.com/chrischizinski/SNR_R_Group/master/data/hemlock_cover.csv")


hemlock$logTsuga<- log10(hemlock$Tsuga.canadensis +1) # log transform 

glimpse(hemlock)

```

```{r}
ggplot(data = hemlock) + 
  geom_histogram(aes(Tsuga.canadensis), binwidth = 5, colour = "black", fill = "dodgerblue") + 
  coord_cartesian(ylim = c(0, 30), expand = FALSE) +
  theme_bw()
```

```{r}
ggplot(data = hemlock) + 
  geom_histogram(aes(logTsuga), bins = 30, colour = "black", fill = "red") + 
  coord_cartesian(ylim = c(0, 30), expand = FALSE) +
  theme_bw()
```

#### Power tranformations

- Square root transformation is most often used for Poisson type date (count data)
- Greater the power, the greater the compression of the data
- Flexible for a wide range of data
- Applied when the data is > 0

##### Write power function

```{r}
pwr_trans <- function(x, trans){
  x <- ifelse(x>0,x^(1/trans),0)
  return(x)
}

pwr_trans(25,2)

pwr_trans(0,2)

```

#### Display the effect of the power function

```{r}
newdata <- data.frame(x = 0:100, 
                      cubic = pwr_trans(x=0:100, trans = 3),
                      power10 = pwr_trans(x=0:100, trans = 10))

head(newdata)
```

```{r}
ggplot(data = newdata) +
  geom_line(aes(x = x, y = cubic), size = 1, colour = "blue") +
  geom_line(aes(x = x, y = power10), size = 1, colour = "red") +
  labs(y = "Value") +
  coord_cartesian(xlim = c(0,100.5), ylim = c(0,5), expand = F)+
  theme_classic()
  
```


#### Presence absence transformation

- Transforms quantitative data to non-quanitative (binary)
- Applicable to species data
- Most useful when there is not a lot of quantitative data available (i.e., LOTS of zeros)
- Severe transformation (i.e., loose lots of information) 

```{r}
library(vegan)
decostand(rawdata, method = "pa")
```

#### Arcsine transformation

Please NOTE: [The arcsine is asinine: the analysis of proportions in ecology](http://onlinelibrary.wiley.com/doi/10.1890/10-0340.1/abstract)

- Transformations on proportion data (0-1)
- Useful when you have a positive skew in data
    - Spreads the end of the scale while compressing the middle 

### Standardizations

### Sums 
- Can be applied to any range of x
- Output will range 0 - 1
- Converts values to a relative value (equalizes the area under the curve) 
- Useful when you have large difference in total abundance

#### Rows

```{r}
ttl_species <- apply(rawdata, 1, sum)
rowprop_data <- rawdata / ttl_species
  
rowprop_data

decostand(rawdata, margin = 1, method = "total")
```

#### Columns

```{r}
colprop_data <- rawdata %*% diag(1/apply(rawdata,2,sum))
```