---
title: "Applied Multivariate:  Distance matrices"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
.libPaths("P:/RLibrary2")
```

Load the libraries we will use today

```{r message = FALSE}
library(tidyverse)
library(vegan)
library(cluster)
library(RColorBrewer)
```

### Similarity and distances

To illustrate the concept of similarity and distance, lets envison a data matrix with 4 sites and 2 species

```{r}
hyp_data <- matrix(c(1,9,1,8,6,6,9,1), byrow = TRUE, ncol = 2)
colnames(hyp_data)<-c("SpeciesA", "SpeciesB")
hyp_data
```

Lets plot these in 2 dimensions to show the relationships

```{r}
ggplot(data = as.data.frame(hyp_data)) + 
  geom_point(aes(x = SpeciesA, y=SpeciesB),size = 3, colour = "red") +
  geom_text(aes(x = SpeciesA, y=SpeciesB, label= paste("Site",1:4))) +
  coord_cartesian(xlim = c(0,10), ylim = c(0,10), expand = F) +
  theme_classic()
```

How can we quantify that distance?  One of the simplest methods is the Euclidean distance

```{r}
euc_distance <- function(x,y){
  x = x[2] - x[1]
  y = y[2] - y[1]
  h = sqrt(x^2 + y^2)
  return(h)
}

euc_distance(hyp_data[c(2,3),"SpeciesA"], hyp_data[c(2,3),"SpeciesB"])
```

The problem with this function that we wrote is that its not easily able to calculate all the distances.

```{r}
dist(hyp_data)

dist(hyp_data, diag = TRUE, upper = TRUE)
```

```{r}
hyp_data2 <- cbind(data.frame(hyp_data), data.frame(x = 9, y = 1))

ggplot(data = as.data.frame(hyp_data)) + 
  geom_segment(data = hyp_data2, aes(x = x, y = y, xend = SpeciesA, yend = SpeciesB), linetype = "dashed") +
  geom_point(aes(x = SpeciesA, y=SpeciesB),size = 3, colour = "red") +
  geom_text(aes(x = SpeciesA, y=SpeciesB, label= paste("Site",1:4))) +
  coord_cartesian(xlim = c(0,10), ylim = c(0,10), expand = F) +
  theme_classic()
```

### Common distance measures 

There are approximately 30 similarity or distances commonly used. [Legendre and Legendre 2012](http://www.ievbras.ru/ecostat/Kiril/R/Biblio/Statistic/Legendre%20P.,%20Legendre%20L.%20Numerical%20ecology.pdf) 

The choice of which distance you are going to use depends on the data type and the type of analysis you will do. 

#### Euclidean distance

$$ ED_{ij} = \sum_{i = 1}^p \sqrt{(x_{ij} - x_{ik})^2} $$
- Most appealing measure because it has true "metric" properties
- Column standardization to remove potential issues with scale
- Applied to any data of any scale
- Used in eigenvector ordinations (e.g., PCA)
- Assume that variables are not correlated
- Emphasizes outliers 
- Loose sensitivity with heterogeneous data 
- Distances are not proportional 

```{r}
vegdist(hyp_data, method = "euclidean")
```

```{r}
euc_dist<-as.data.frame(as.matrix(vegdist(hyp_data, method = "euclidean", diag = TRUE, upper = TRUE)))

euc_dist$row <-rownames(euc_dist)

euc_dist %>% 
  gather(col, value, -row) %>% 
  mutate(col = as.numeric(col),
         row = as.numeric(row)) -> euc_long

ggplot(data = euc_long) + 
  geom_raster(aes(x = col, y = row, fill = value))+
  coord_equal(expand = F) +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white", midpoint = 5) +
  theme_classic()

```

#### City block (Manhattan) distance

- Most ecologically meaningful dissimilarities are Manhattan types
- Less weight to outliers compared to Euclidean
- Retains sensitivity with heterogenous data
- Distances are not proportional 

```{r}
vegdist(hyp_data, method = "manhattan")
```

#### Proportional distances

- Manhattan distances expressed as a proportion of the max distance
- 2 communities with nothing in common would be 1

```{r}
max_dist <- max(vegdist(hyp_data, method = "manhattan"))

vegdist(hyp_data, method = "manhattan")/max_dist
```

#### Sorensen or Bray-Curtis distance

- Percent dissimilarity
- Commonly used with species abundance but it can be used with data of any scale
- Gives less weight to outliers than euclidean
- Retains sensitivity with heteregenous data
- Max when no species are in common
- NOT metric and can not be used with DA, PCA, or CCA

```{r}
vegdist(hyp_data, method = "bray")
```

Some other proportional distances exist and differ how they weigh the dissimilarity.  Two examples are 

- Jaccards distance

```{r}
vegdist(hyp_data, method = "jaccard")
```

- Kulczynski distance 

```{r}
vegdist(hyp_data, method = "kulczynski")
```

### Euclidean distances based on species profiles

#### Chord distance 

- Similar conceptually to euclidean, but data are row normalized
- Useful in species abundance because it removes differences in total abundance
- Gives low weights to variables with low counts and many zeros

```{r}
vegdist(decostand(hyp_data, method = "normalize"), method = "euclidean")
```

#### Chi-square distances

- Euclidean distances after completing a chi-square standardization
- Distance used in correspondance analysis (CA) and canonical correspondance analysis (CCA)


```{r}
vegdist(decostand(hyp_data, method = "chi.square"), method = "euclidean")
```

#### Species profile distance 

- Euclidean distances on relative abundance
- Variables with higher values and fewer zeros contribute more to the distance

```{r}

vegdist(decostand(hyp_data, method = "total", MARGIN = 1), method = "euclidean")
```

#### Hellinger distance

- Euclidean distance on the hellinger standardization
- Give low weights to variables with low counts and many zeros

```{r}
vegdist(decostand(hyp_data, method = "hellinger"), method = "euclidean")
```

### Distances on binary data

```{r}
set.seed(1234)
pa_data <- matrix(c(sample(c(0,1), 8, replace = TRUE),
                    sample(c(0,1), 8, prob = c(0.65, 0.35),replace = TRUE)),
                  byrow = TRUE, ncol = 4)
pa_data


```

```{r}
vegdist(pa_data, binary = TRUE, method = "jaccard")
```

#### Binomial

- Null hypothesis  two communites are equal

```{r}
vegdist(pa_data, binary = TRUE, method = "binomial")
```

#### Raup 
- Probablistic index based on presence/absence data
- Non-metric

```{r}
vegdist(pa_data, binary = TRUE, method = "raup")
```


#### Categorical and mixed data 

#### Gowers distance
- For each variable, a particular distance metric that works well for that data type and is used to scale between 0-1
- Then a linear combination of those user specied weights (most simply an average) is calculated to create the final distance matrix 
    - for quantitative data = range normalzed Manhattan distance
    - ordinal = variable is first ranked then Manhattan with adjustment for ties
    - nominal = variables of k categories are first converted into k binary columns and then a Dice coefficient is used
    
```{r}
set.seed(123)
# Create a nomial variable 
nom <- factor(rep(letters[1:3], each = 3))

# Create some binary data
bin <- as.matrix(replicate(2, rep(sample(c(0,1), 9, replace = TRUE))))

#Numerical variables
vars <- as.matrix(replicate(3, rnorm(9)))

df <- data.frame(nom, bin, vars)


as.matrix(daisy(df, metric = "gower", type = list(asym = c(2,3))))
```