```{r, include = F}
knitr::opts_chunk$set(echo = F, comment = NA)
```
```{r, include = F}
# a data set used on some plots
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
learning2014 <- learning2014[learning2014$points > 0,]
```
Linear regression and model validation
========================================================
type: sub-section
For IODS by Tuomo Nieminen
Powered by Rpresentation. The code for this presentation is [here.](https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/docs/regression.Rpres)
Linear regression models
========================================================
type: prompt
incremental: false
Simple regression
```{r}
library(ggplot2)
theme_set(theme_grey(base_size = 18))
p <- qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm", col = "red")
p
```
***
Multiple regression
```{r}
library(scatterplot3d)
set.seed(666)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 2*x1 + -1*x2 + rnorm(100, sd = 2)
mycol <- rgb(t(col2rgb("blue")), alpha = 150, maxColorValue = 255)
p <- scatterplot3d(x = x1, y = x2, z = y, grid = T, box = F, mar = c(3,3,3,3), tick.marks = F, angle = 20,pch = 16, color = mycol)
p$plane3d(lm(y~x1+x2))
```
What is a statistical model?
========================================================
A statistical model:
- Embodies a set of assumptions and describes the generation of a sample from a population
- Represents the data generating process
- The uncertainty related to a sample of data is described using probability distributions
Linear regression models
========================================================
Linear regression is an approach for modeling the relationship between a dependent variable $\boldsymbol{y}$ and one or more explanatory variables $\boldsymbol{X}$.
There are many applications for linear models such as:
- Prediction or forecasting
- Quantifying the strength of the relationship between $\boldsymbol{y}$ and $\boldsymbol{x}$
Simple regression
========================================================
In a simple case, the model includes one explanatory variable $\boldsymbol{x}$
$\boldsymbol{y} = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon}$
R: `lm(y ~ x)`
***
```{r}
library(ggplot2)
x <- rnorm(100)
y <- 1.5*x + rnorm(100)
qplot(x, y) + geom_smooth(method = "lm", se = F) + theme(
axis.ticks = element_blank(),
axis.text = element_blank())
```
Multiple regression
========================================================
The model can also include more than one explanatory variable
$$\boldsymbol{y} = \alpha + \beta_1 \boldsymbol{x}_1 + \beta_2 \boldsymbol{x}_2 + \boldsymbol{\epsilon}$$
R: `lm(y ~ x1 + x2)`
***
```{r}
library(scatterplot3d)
set.seed(666)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 2*x1 + -1*x2 + rnorm(100, sd = 2)
mycol <- rgb(t(col2rgb("blue")), alpha = 150, maxColorValue = 255)
p <- scatterplot3d(x = x1, y = x2, z = y, grid = T, box = F, mar = c(3,3,3,3), tick.marks = F, angle = 20,pch = 16, color = mycol)
p$plane3d(lm(y~x1+x2))
```
Assumptions of linear regression models
========================================================
In linear regression, it is assumed that the relationship between the target variable $\boldsymbol{y}$ and the parameters ($\alpha$, $\boldsymbol{\beta}$) is *linear*:
$$\boldsymbol{y} = \boldsymbol{\alpha} + \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$
- The goal is to estimate the parameters $\alpha$ and $\boldsymbol{\beta}$, which describe the relationship with the explanatory variables $\boldsymbol{X}$
- An unobservable random variable ($\boldsymbol{\epsilon}$) is assumed to add noise to the observations
- Often it is reasonable to assume $\boldsymbol{\epsilon} \sim N(0, \sigma^2)$
Structure of a linear model
========================================================
In the simple linear equation $\boldsymbol{y} = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon}$
- $\boldsymbol{y}$ is the target variable: we wish to predict the values of $\boldsymbol{y}$ using the values of $\boldsymbol{x}$
- $\alpha + \beta \boldsymbol{x}$ is the systematic part of the model
- $\beta$ quantifies the relationship between $\boldsymbol{y}$ and $\boldsymbol{x}$
- $\boldsymbol{\epsilon}$ describes the errors (or the uncertainty) of the model
Finding the model
========================================================
The best model is found by minimizing the prediction errors that the model would make
- $\hat{\boldsymbol{y}} = \hat{\alpha} + \hat{\beta} \boldsymbol{x}$ are the predictions
- $\boldsymbol{\hat{\epsilon}} = \hat{\boldsymbol{y}} - \boldsymbol{y}$ are the prediction errors, called residuals
- The model is found by minimizing the sum of squared residuals
***
```{r}
library(ggplot2)
n <- 50
x <- rnorm(n)
y <- 1.5*x + rnorm(n)
mod <- lm(y ~ x)
y <- transform(y, Fitted = fitted(mod))
colnames(y)[1] <- "y"
ggplot(y, aes(x, y)) +
geom_point(col = "red") +
geom_smooth(method = "lm", se = F) +
theme(axis.ticks = element_blank(), axis.text = element_blank()) +
geom_segment(aes(x = x, y = y,
xend = x, yend = Fitted))
```
Interpreting the parameters
========================================================
When the model is $$\boldsymbol{y} = \alpha + \beta_1 \boldsymbol{x}_1 + \beta_2 \boldsymbol{x}_2 + \boldsymbol{\epsilon}$$
- The main interest is to estimate the $\boldsymbol{\beta}$ parameters
- Interpretation of an estimate $\hat{\beta_1} = 2$:
- When $x_1$ increases by one unit, the average change in $y$ is 2 units, given that the other variables (here $x_2$) do not change.
R linear model summary()
========================================================
class: small-code
incremental: false
For a quick rundown of interpreting R's regression summary, see the 'Calling summary' section of [this blog post](http://blog.yhat.com/posts/r-lm-summary.html) or read about coefficients and p-values [here](http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)
```{r}
some_variable <- rnorm(30)
Y <- 2*some_variable + rnorm(30, sd = 3)
summary(lm(Y ~ some_variable))
```
Advanced example: Polynomial terms
========================================================
The linearity assumption isn't as restrictive as one could imagine.
It is possible to add polynomial terms to the model if the effect of a variable is non-linear
$$\boldsymbol{y} = \alpha + \beta_1 \cdot \boldsymbol{x} + \beta_2 \cdot \boldsymbol{x}^2 + \boldsymbol{\epsilon}$$
R: `lm(y ~ x + I(x^2))`
***
```{r}
x <- rnorm(100)
x2 <- x^2
y <- 1.5*x + -1*x2 + rnorm(100)
qplot(x, y) + geom_smooth(method = "lm", formula = y ~ poly(x,2), se = F) + theme(
axis.ticks = element_blank(),
axis.text = element_blank())
```
Model validation
========================================================
type: prompt
```{r, fig.align = "center", fig.width = 14, fig.height = 8}
library(ggfortify)
lm_fit <- lm(attitude~points, data = learning2014)
autoplot(lm_fit, which = c(1,2,5), smooth.linetype = "blank")
```
Model assumptions
========================================================
A statistical model always includes several assumptions which describe the data generating process.
- How well the model describes the phenomenom of interest, depends on how well the assumptions fit reality
- In a linear regression model an obvious assumption is linearity: The target variable is modelled as a linear combination of the model parameters.
- Usually it is assumed that the errors are normally distributed
Assumptions of linear regression models
========================================================
Analyzing the *residuals* of the model provides a method to explore the validity of the model assumptions. A lot of interesting assumptions are included in the expression
$$\boldsymbol{\epsilon} \sim N(0, \sigma^2)$$
- The errors are normally distributed
- The errors are not correlated
- The errors have constant variance, $\sigma^2$
- The size of a given error does not depend on the explanatory variables
Normality of the errors (QQ-plot)
========================================================
QQ-plot of the residuals provides a method to explore the assumption that the errors of the model are normally distributed
```{r, fig.width = 14, fig.height = 7}
set.seed(444)
x <- rpois(40, lambda = 10)
y1 <- x + rnorm(40)
y2 <- x + x*rnorm(40)
lm1 <- lm(y1~x)
lm2 <- lm(y2~x)
p1 <- autoplot(lm1, which = 2, size = 3, smooth.linetype = "blank") + ggtitle("Reasonable")
p2 <- autoplot(lm2, which = 2, size = 3, smooth.linetype = "blank") + ggtitle("Questionable")
p1 + p2
```
Constant variance of errors
========================================================
The constant variance assumption implies that the size of the errors should not depend on the explanatory variables.
- This can be explored with a simple scatter plot of residuals versus model predictions
- **Any** patter in the scatter plot implies a problem with the assumptions
***
```{r, fig.height = 8}
library(gridExtra)
set.seed(444)
n <- 100
x <- abs(rnorm(n, sd = 1.5))
y1 <- x + rnorm(n)
y2 <- sort(x) + sapply(sort(x), function(s) rnorm(n = 1, sd = s))
lm1 <- lm(y1~x)
lm2 <- lm(y2~sort(x))
p1 <- ggplot(fortify(lm1), aes(.fitted, .resid)) + geom_point(size = 3) + ggtitle("Reasonable") + ylim(c(-5,5))
p2 <- ggplot(fortify(lm2), aes(.fitted, .resid)) + geom_point(size = 3) + ggtitle("Questionable") + ylim(c(-5,5))
grid.arrange(p1,p2, ncol = 1)
```
Leverage of observations (1)
========================================================
Leverage measures how much impact a single observation has on the model.
- Residuals vs leverage plot can help identify which observations have an unusually high impact.
- The next two slides show four examples.
- Each row of two plots defines a *data - model validation* pair.
Leverage of observations (2)
========================================================
```{r}
library(ggplot2)
library(broom)
library(gridExtra)
set.seed(666)
n <- 30
# regular relationship
x1 <- rnorm(n, mean = 2)
y1 <- x1 + rnorm(n)
df1 <- augment(lm(y1 ~ x1))
# unexpectedly high x value with no error
x2 <- c(x1[-n], 2*max(x1))
y2 <- x2 + c(rnorm(n-1),0)
df2 <- augment(lm(y2 ~ x2))
# unexpectedly high y value at the mean of x
x3 <- c(x1[-n], mean(x1))
y3 <- c(y1[-n], 2*max(y1))
df3 <- augment(lm(y3 ~ x3))
# high error high leverage
x4 <- x2
y4 <- c(y1[-which.max(x4)], min(y1))
df4 <- augment(lm(y4 ~ x4))
# plots
size <- 3
my_plot <- function(x, y, size = 3) {
df <- cbind(x = x, y = y)
p <- ggplot(df, aes(x = x, y = y))
p <- p + geom_point(size = size)
p + geom_smooth(method = "lm", se =F)
}
sc1 <- my_plot(x1, y1) + ggtitle("data: regular errors")
sc2 <- my_plot(x2,y3) + ggtitle("data: regular errors with an outlier")
sc3 <- my_plot(x3,y3) + ggtitle("data: high error with an outlier at the mean")
sc4 <- my_plot(x4,y4) + ggtitle("data: high error with an outlier")
lv1 <- ggplot(data = df1, aes(.cooksd, .std.resid)) + geom_point(size = size) +
ggtitle("diagnostic: regular leverage")
lv2 <- ggplot(data = df2, aes(.cooksd, .std.resid)) + geom_point(size = size) +
ggtitle("diagnostic: high leverage")
lv3 <- ggplot(data = df3, aes(.cooksd, .std.resid)) + geom_point(size = size) +
ggtitle("diagnostic: regular leverage")
lv4 <- ggplot(data = df4, aes(.cooksd, .std.resid)) + geom_point(size = size) +
ggtitle("diagnostic: high leverage")
```
```{r, fig.width = 14, fig.height = 9}
grid.arrange(sc1, lv1, sc2, lv2, nrow = 2)
```
Leverage of observations (3)
========================================================
```{r, fig.width = 14, fig.height = 9}
grid.arrange(sc3,lv3, sc4, lv4, nrow = 2)
```