```{r, include = F} knitr::opts_chunk$set(echo = F, comment = NA) ``` ```{r, include = F} # a data set used on some plots learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE) learning2014 <- learning2014[learning2014$points > 0,] ``` Linear regression and model validation ======================================================== type: sub-section For IODS by Tuomo Nieminen



Powered by Rpresentation. The code for this presentation is [here.](https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/docs/regression.Rpres) Linear regression models ======================================================== type: prompt incremental: false Simple regression ```{r} library(ggplot2) theme_set(theme_grey(base_size = 18)) p <- qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm", col = "red") p ``` *** Multiple regression ```{r} library(scatterplot3d) set.seed(666) x1 <- rnorm(100) x2 <- rnorm(100) y <- 2*x1 + -1*x2 + rnorm(100, sd = 2) mycol <- rgb(t(col2rgb("blue")), alpha = 150, maxColorValue = 255) p <- scatterplot3d(x = x1, y = x2, z = y, grid = T, box = F, mar = c(3,3,3,3), tick.marks = F, angle = 20,pch = 16, color = mycol) p$plane3d(lm(y~x1+x2)) ``` What is a statistical model? ======================================================== A statistical model: - Embodies a set of assumptions and describes the generation of a sample from a population - Represents the data generating process - The uncertainty related to a sample of data is described using probability distributions Linear regression models ======================================================== Linear regression is an approach for modeling the relationship between a dependent variable $\boldsymbol{y}$ and one or more explanatory variables $\boldsymbol{X}$. There are many applications for linear models such as: - Prediction or forecasting - Quantifying the strength of the relationship between $\boldsymbol{y}$ and $\boldsymbol{x}$ Simple regression ======================================================== In a simple case, the model includes one explanatory variable $\boldsymbol{x}$ $\boldsymbol{y} = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon}$ R: `lm(y ~ x)` *** ```{r} library(ggplot2) x <- rnorm(100) y <- 1.5*x + rnorm(100) qplot(x, y) + geom_smooth(method = "lm", se = F) + theme( axis.ticks = element_blank(), axis.text = element_blank()) ``` Multiple regression ======================================================== The model can also include more than one explanatory variable $$\boldsymbol{y} = \alpha + \beta_1 \boldsymbol{x}_1 + \beta_2 \boldsymbol{x}_2 + \boldsymbol{\epsilon}$$ R: `lm(y ~ x1 + x2)` *** ```{r} library(scatterplot3d) set.seed(666) x1 <- rnorm(100) x2 <- rnorm(100) y <- 2*x1 + -1*x2 + rnorm(100, sd = 2) mycol <- rgb(t(col2rgb("blue")), alpha = 150, maxColorValue = 255) p <- scatterplot3d(x = x1, y = x2, z = y, grid = T, box = F, mar = c(3,3,3,3), tick.marks = F, angle = 20,pch = 16, color = mycol) p$plane3d(lm(y~x1+x2)) ``` Assumptions of linear regression models ======================================================== In linear regression, it is assumed that the relationship between the target variable $\boldsymbol{y}$ and the parameters ($\alpha$, $\boldsymbol{\beta}$) is *linear*: $$\boldsymbol{y} = \boldsymbol{\alpha} + \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$ - The goal is to estimate the parameters $\alpha$ and $\boldsymbol{\beta}$, which describe the relationship with the explanatory variables $\boldsymbol{X}$ - An unobservable random variable ($\boldsymbol{\epsilon}$) is assumed to add noise to the observations - Often it is reasonable to assume $\boldsymbol{\epsilon} \sim N(0, \sigma^2)$ Structure of a linear model ======================================================== In the simple linear equation $\boldsymbol{y} = \alpha + \beta \boldsymbol{x} + \boldsymbol{\epsilon}$ - $\boldsymbol{y}$ is the target variable: we wish to predict the values of $\boldsymbol{y}$ using the values of $\boldsymbol{x}$ - $\alpha + \beta \boldsymbol{x}$ is the systematic part of the model - $\beta$ quantifies the relationship between $\boldsymbol{y}$ and $\boldsymbol{x}$ - $\boldsymbol{\epsilon}$ describes the errors (or the uncertainty) of the model Finding the model ======================================================== The best model is found by minimizing the prediction errors that the model would make - $\hat{\boldsymbol{y}} = \hat{\alpha} + \hat{\beta} \boldsymbol{x}$ are the predictions - $\boldsymbol{\hat{\epsilon}} = \hat{\boldsymbol{y}} - \boldsymbol{y}$ are the prediction errors, called residuals - The model is found by minimizing the sum of squared residuals *** ```{r} library(ggplot2) n <- 50 x <- rnorm(n) y <- 1.5*x + rnorm(n) mod <- lm(y ~ x) y <- transform(y, Fitted = fitted(mod)) colnames(y)[1] <- "y" ggplot(y, aes(x, y)) + geom_point(col = "red") + geom_smooth(method = "lm", se = F) + theme(axis.ticks = element_blank(), axis.text = element_blank()) + geom_segment(aes(x = x, y = y, xend = x, yend = Fitted)) ``` Interpreting the parameters ======================================================== When the model is $$\boldsymbol{y} = \alpha + \beta_1 \boldsymbol{x}_1 + \beta_2 \boldsymbol{x}_2 + \boldsymbol{\epsilon}$$ - The main interest is to estimate the $\boldsymbol{\beta}$ parameters - Interpretation of an estimate $\hat{\beta_1} = 2$: - When $x_1$ increases by one unit, the average change in $y$ is 2 units, given that the other variables (here $x_2$) do not change. R linear model summary() ======================================================== class: small-code incremental: false For a quick rundown of interpreting R's regression summary, see the 'Calling summary' section of [this blog post](http://blog.yhat.com/posts/r-lm-summary.html) or read about coefficients and p-values [here](http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit) ```{r} some_variable <- rnorm(30) Y <- 2*some_variable + rnorm(30, sd = 3) summary(lm(Y ~ some_variable)) ``` Advanced example: Polynomial terms ======================================================== The linearity assumption isn't as restrictive as one could imagine. It is possible to add polynomial terms to the model if the effect of a variable is non-linear $$\boldsymbol{y} = \alpha + \beta_1 \cdot \boldsymbol{x} + \beta_2 \cdot \boldsymbol{x}^2 + \boldsymbol{\epsilon}$$ R: `lm(y ~ x + I(x^2))` *** ```{r} x <- rnorm(100) x2 <- x^2 y <- 1.5*x + -1*x2 + rnorm(100) qplot(x, y) + geom_smooth(method = "lm", formula = y ~ poly(x,2), se = F) + theme( axis.ticks = element_blank(), axis.text = element_blank()) ``` Model validation ======================================================== type: prompt ```{r, fig.align = "center", fig.width = 14, fig.height = 8} library(ggfortify) lm_fit <- lm(attitude~points, data = learning2014) autoplot(lm_fit, which = c(1,2,5), smooth.linetype = "blank") ``` Model assumptions ======================================================== A statistical model always includes several assumptions which describe the data generating process. - How well the model describes the phenomenom of interest, depends on how well the assumptions fit reality - In a linear regression model an obvious assumption is linearity: The target variable is modelled as a linear combination of the model parameters. - Usually it is assumed that the errors are normally distributed Assumptions of linear regression models ======================================================== Analyzing the *residuals* of the model provides a method to explore the validity of the model assumptions. A lot of interesting assumptions are included in the expression $$\boldsymbol{\epsilon} \sim N(0, \sigma^2)$$ - The errors are normally distributed - The errors are not correlated - The errors have constant variance, $\sigma^2$ - The size of a given error does not depend on the explanatory variables Normality of the errors (QQ-plot) ======================================================== QQ-plot of the residuals provides a method to explore the assumption that the errors of the model are normally distributed ```{r, fig.width = 14, fig.height = 7} set.seed(444) x <- rpois(40, lambda = 10) y1 <- x + rnorm(40) y2 <- x + x*rnorm(40) lm1 <- lm(y1~x) lm2 <- lm(y2~x) p1 <- autoplot(lm1, which = 2, size = 3, smooth.linetype = "blank") + ggtitle("Reasonable") p2 <- autoplot(lm2, which = 2, size = 3, smooth.linetype = "blank") + ggtitle("Questionable") p1 + p2 ``` Constant variance of errors ======================================================== The constant variance assumption implies that the size of the errors should not depend on the explanatory variables. - This can be explored with a simple scatter plot of residuals versus model predictions - **Any** patter in the scatter plot implies a problem with the assumptions *** ```{r, fig.height = 8} library(gridExtra) set.seed(444) n <- 100 x <- abs(rnorm(n, sd = 1.5)) y1 <- x + rnorm(n) y2 <- sort(x) + sapply(sort(x), function(s) rnorm(n = 1, sd = s)) lm1 <- lm(y1~x) lm2 <- lm(y2~sort(x)) p1 <- ggplot(fortify(lm1), aes(.fitted, .resid)) + geom_point(size = 3) + ggtitle("Reasonable") + ylim(c(-5,5)) p2 <- ggplot(fortify(lm2), aes(.fitted, .resid)) + geom_point(size = 3) + ggtitle("Questionable") + ylim(c(-5,5)) grid.arrange(p1,p2, ncol = 1) ``` Leverage of observations (1) ======================================================== Leverage measures how much impact a single observation has on the model. - Residuals vs leverage plot can help identify which observations have an unusually high impact. - The next two slides show four examples. - Each row of two plots defines a *data - model validation* pair. Leverage of observations (2) ======================================================== ```{r} library(ggplot2) library(broom) library(gridExtra) set.seed(666) n <- 30 # regular relationship x1 <- rnorm(n, mean = 2) y1 <- x1 + rnorm(n) df1 <- augment(lm(y1 ~ x1)) # unexpectedly high x value with no error x2 <- c(x1[-n], 2*max(x1)) y2 <- x2 + c(rnorm(n-1),0) df2 <- augment(lm(y2 ~ x2)) # unexpectedly high y value at the mean of x x3 <- c(x1[-n], mean(x1)) y3 <- c(y1[-n], 2*max(y1)) df3 <- augment(lm(y3 ~ x3)) # high error high leverage x4 <- x2 y4 <- c(y1[-which.max(x4)], min(y1)) df4 <- augment(lm(y4 ~ x4)) # plots size <- 3 my_plot <- function(x, y, size = 3) { df <- cbind(x = x, y = y) p <- ggplot(df, aes(x = x, y = y)) p <- p + geom_point(size = size) p + geom_smooth(method = "lm", se =F) } sc1 <- my_plot(x1, y1) + ggtitle("data: regular errors") sc2 <- my_plot(x2,y3) + ggtitle("data: regular errors with an outlier") sc3 <- my_plot(x3,y3) + ggtitle("data: high error with an outlier at the mean") sc4 <- my_plot(x4,y4) + ggtitle("data: high error with an outlier") lv1 <- ggplot(data = df1, aes(.cooksd, .std.resid)) + geom_point(size = size) + ggtitle("diagnostic: regular leverage") lv2 <- ggplot(data = df2, aes(.cooksd, .std.resid)) + geom_point(size = size) + ggtitle("diagnostic: high leverage") lv3 <- ggplot(data = df3, aes(.cooksd, .std.resid)) + geom_point(size = size) + ggtitle("diagnostic: regular leverage") lv4 <- ggplot(data = df4, aes(.cooksd, .std.resid)) + geom_point(size = size) + ggtitle("diagnostic: high leverage") ``` ```{r, fig.width = 14, fig.height = 9} grid.arrange(sc1, lv1, sc2, lv2, nrow = 2) ``` Leverage of observations (3) ======================================================== ```{r, fig.width = 14, fig.height = 9} grid.arrange(sc3,lv3, sc4, lv4, nrow = 2) ```