---
title: "Correlation"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(tidyverse)
library(gridExtra)
library(RCurl)
library(MASS)
```
The RMarkdown file for this lesson can be found [here](https://raw.githubusercontent.com/chrischizinski/SNR_R_Group/master/Rmd/2016-10-20-Correlation.Rmd)
## Correlation analysis
Consider a study, where we are interested in the relationship between two random variables.
### Bivariate normal distribution
We need to think of our data as a population of \\( y_{i1} \\) and \\(y_{i2} \\) pairs (a joint distribution of two variables or **a bivariate distribution**).
The bivariate normal distribution is defined by the mean and standard deviation of each variable and a parameter called the correlation coefficient, which measures the strength of the relationship between the two variables. A bivariate normal distribution implies that the individual variables are also normally distributed and also implies that any relationship between the two variables is a linear one.
```{r}
set.seed(12345)
mean.x = 70 # mean of variable 1
sd.x=3 # sd of variable 1
mean.y=162 # mean of variable 2
sd.y=14 # sd of variable 1
r=0.55 # correlation between the two
z1 <- rnorm(100)
z2 <- rnorm(100)
x1 <- sqrt(1-r^2)*sd.x*z1 + r*sd.x*z2 + mean.x
y1 <- sd.y*z2 + mean.y
r=0.90 # correlation between the two
x2 <- sqrt(1-r^2)*sd.x*z1 + r*sd.x*z2 + mean.x
y2 <- sd.y*z2 + mean.y
data_for_plot<- rbind(data.frame(x = x1, y = y1, r =0.55),
data.frame(x = x2, y = y2, r = 0.90))
```
```{r}
data_for_plot %>%
gather(var, val, x:y) %>%
ggplot() +
geom_density(aes(x=val, fill = var), alpha = 0.55) +
facet_wrap(~r, ncol = 1,labeller = label_both) +
theme_bw()
ggplot(data=data_for_plot) +
geom_point(aes(x = x, y = y), color = "red", size =1) +
facet_wrap(~r, ncol = 1,labeller = label_both) +
theme_bw()
```
### Covariance and correlation
Covariance is the linear relationship between two continuous variables.
#### Covariance
$$ s_{Y1Y2} = \frac{\sum_{i =1}^n (y_{i1} - \bar{y_1})(y_{i2} - \bar{y_2})}{n-1} $$
and goes from \\( -\infty \\) to \\( +\infty \\)
```{r}
#Covariance
dev.x<- x1 - mean(x1)
dev.y<- y1 - mean(y1)
sum(dev.x * dev.y)/ (length(dev.x) - 1)
cov(x1,y1)
```
One problem with covariance is that the absolute magnitude depends on the units of the two variables
```{r}
cov(x1,y1)
cov(x1*1000,y1*100)
```
### Pearson correlation
The covariance can be standardized by dividing by the standard deviations of the two variables so that the value range between -1 and +1. This is called the Pearson (product-moment) correlation.
$$ s_{Y1Y2} = \frac{\sum_{i =1}^n (y_{i1} - \bar{y_1})(y_{i2} - \bar{y_2})}{\sqrt{\sum_{i =1}^n (y_{i1} - \bar{y_1})^2 \sum_{i =1}^n (y_{i2} - \bar{y_2})^2}} $$
The Pearson correlation measures the "strength" of the linear relationship between the two continuous variables.
```{r}
sum(dev.x * dev.y)/ (sqrt(sum(dev.x^2) * sum(dev.y^2)))
cor(x1,y1, method = "pearson")
cor.test(x1,y1, method = "pearson")
```
Remember up above when we generated `x1` and `y1` that we used a correlation value, `r`, of 0.55.
### Robust correlation (Spearman's rank correlation)
We may have a situation where the joint distribution of our two variables is not bivariate normal.
- non-normality in either variable
- monotonic relationships that are not linear
```{r}
r_x1<-rank(x1)
r_y1<-rank(y1)
cor(r_x1, r_y1)
cor(x1, y1, method ="spearman")
# Kendalls tau is based on concordant and disconcordant pairs. is more conservative than spearman
cor(x1, y1, method ="kendall")
```
### Parametric and non-parametric confidence regions
When representing a bivariate relationship with a scatterplot, it is often useful to include confidence regions. The confidence region is the region within which we would expect the observation represented by the population mean of the two variables to occur a percent of the time under repeated sampling from this population.
```{r}
crabs<-read_csv(getURL("https://raw.githubusercontent.com/chrischizinski/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.csv"))
crabs$SITE<-factor(crabs$SITE, levels = c("LS","DS"))
ggplot(data = crabs) +
geom_point(aes(x = TOTMASS, y = BURROWS), size = 2) +
facet_wrap(~SITE, ncol = 2) +
coord_cartesian(xlim = c(0,8), ylim = c(0, 120), expand = F) +
scale_x_continuous(breaks = seq(0,8, by=2)) +
labs(x = "Total crab biomass", y = "Number of burrows") +
theme_bw()
```
#### Confidence ellipse
Assuming our two variables follow a bivariate normal distribution, the confidence band will always be an ellipse centered on the sample means of \\( y_{i1} \\) and \\(y_{i2} \\) and the orientation of the ellipse is determined by the covariance or correlation.
```{r}
p<-ggplot(data = crabs) +
geom_point(aes(x = TOTMASS, y = BURROWS), size = 2) +
stat_ellipse(geom = "polygon", aes(x = TOTMASS, y = BURROWS), type = "norm", level = 0.95, fill = "pink", alpha = 0.25, color = "black") +
facet_wrap(~SITE, ncol = 2) +
coord_cartesian(xlim = c(0,8), ylim = c(0, 120), expand = F) +
scale_x_continuous(breaks = seq(0,8, by=2)) +
labs(x = "Total crab biomass", y = "Number of burrows") +
theme_bw()
p
```
#### Kernel density
Sometimes we are not interested in the population mean of \\( y_{i1} \\) and \\(y_{i2} \\) but we just want a confidence interval based on the observed data. The kernel density for a value of *y
* is the sum of hte estimates from a series of symmetrical distributuoins fitted to groups of local observations. Note that they are not constrained to an elliptical shape.
```{r}
## For site LS
dens1 <- kde2d(crabs$TOTMASS[crabs$SITE =="LS"], crabs$BURROWS[crabs$SITE =="LS"], n = 100, lims = c(0,8,0,120))
densdf <- data.frame(expand.grid(TOTMASS = dens1$x, BURROWS =dens1$y), z = as.vector(dens1$z), SITE = "LS")
densdf$SITE<-factor(densdf$SITE, levels = c("LS","DS"))
getLevel <- function(x,y,prob=0.95) {
kk <- MASS::kde2d(x,y)
dx <- diff(kk$x[1:2])
dy <- diff(kk$y[1:2])
sz <- sort(kk$z)
c1 <- cumsum(sz) * dx * dy
approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(crabs$TOTMASS[crabs$SITE=="LS"],crabs$BURROWS[crabs$SITE=="LS"])
## For site DS
dens2 <- kde2d(crabs$TOTMASS[crabs$SITE =="DS"], crabs$BURROWS[crabs$SITE =="DS"], n = 100, lims = c(0,8,0,120))
densdf2 <- data.frame(expand.grid(TOTMASS = dens2$x, BURROWS =dens2$y), z = as.vector(dens2$z), SITE = "DS")
densdf2$SITE<-factor(densdf2$SITE, levels = c("LS","DS"))
L952 <- getLevel(crabs$TOTMASS[crabs$SITE=="DS"],crabs$BURROWS[crabs$SITE=="DS"])
ggplot(data = crabs) +
geom_point(aes(x = TOTMASS, y = BURROWS), size = 2) +
geom_contour(data=densdf,aes(x = TOTMASS, y = BURROWS,z=z), breaks=L95, color="red", linetype = "dashed") +
geom_contour(data=densdf2,aes(x = TOTMASS, y = BURROWS,z=z), breaks=L952, color="blue", linetype = "dashed") +
facet_wrap(~SITE, ncol = 2) +
coord_cartesian(xlim = c(0,8), ylim = c(0, 120), expand = F) +
scale_x_continuous(breaks = seq(0,8, by=2)) +
labs(x = "Total crab biomass", y = "Number of burrows") +
theme_bw()
```