This page was last updated on November 19, 2018.


Getting started

In this tutorial you will:

  • learn about using least-squares linear regression to model the relationship between two numeric variables, and to make predictions

Required packages

  • tigerstats
  • visreg

The latter package is likely new to you, so install it first by typing the following in the command console:

install.packages("visreg")

Load the packages:

library(tigerstats)
library(visreg)

Required data

The “plantbiomass” dataset:

plantbiomass <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/plantbiomass.csv"), header = TRUE)

Data exploration

The plantbiomass (see p. 551 in text) dataset includes data describing plant biomass measured after 10 years within each of 5 experimental “species richness” treatments, wherein plants were grown in groups of 1, 2, 4, 8, or 16 species. The treatment variable is called nSpecies, and the response variable biomassStability is a measure of ecosystem stability. The research hypothesis was that increasing diversity (species richness) would lead to increased ecosystem stability.

head(plantbiomass)
##   nSpecies biomassStability
## 1        1             7.47
## 2        1             6.74
## 3        1             6.61
## 4        1             6.40
## 5        1             5.67
## 6        1             5.26
inspect(plantbiomass)
## 
## quantitative variables:  
##               name   class  min   Q1 median  Q3   max     mean       sd
## 1         nSpecies integer 1.00 2.00      4 8.0 16.00 6.316770 5.639617
## 2 biomassStability numeric 1.34 3.07      4 5.2 15.76 4.421801 1.947736
##     n missing
## 1 161       0
## 2 161       0

We see that both variables are numeric, and that there are 161 observations overall.


Regression analysis

When we are only interested in the strength of a linear association between two numerical variables, we use a correlation analysis.

When we are interest in making predictions, we use regression analysis.

Often in the literature you see authors reporting regression results when in fact there was no rationale for using regression; a correlation analysis would have been more appropriate.

The two analyses are mathematically related.

Regression analysis is extremely common in biology, and unfortunately it is also common to see incorrect implementation of regression analysis. In particular, one often sees scatterplots that clearly show violations of the assumptions of regression analysis (see below). Failure to appropriately check assumptions can lead to misleading and incorrect conclusions.


Hypothesis testing - not really!

Recall from your high-school training that the equation for a straight line is:

Y = a + bX

In regression analysis, the “least-squares line” is the line for which the sum of all the squared deviations in Y is smallest.

In a regression context, a is the Y-intercept and b is the slope of the regression line.

Regression analysis falls under the heading of inferential statistics, which means we use it to draw inferences about a linear relationship between Y and X within a population of interest. So in the actual population, the true least-squares line is represented as follows:

Y = \(\alpha\) + \(\beta\)X

In practice, we estimate the “parameters” \(\alpha\) and \(\beta\) using a random sample from the population, and by calculating a and b using regression analysis.

NOTE: the \(\alpha\) in the regression equation has no relation to the \(\alpha\) that sets the significance level for a test!

In least-squares regression analysis, the null hypothesis is that the slope of the relationship between Y and X is equal to zero. That is:

H0: \(\beta\) = 0
HA: \(\beta\) \(\neq\) 0

However, one does not typically state the null and alternative hypotheses when conducting a regression analysis. Rather, they are implied.

As such, there is no need to explicitly state these hypotheses for a regression analysis.

You must, however, know what is being tested: the null hypothesis that the slope is equal to zero.


Steps to conducting regression analysis

In regression analysis one must conduct the actual analysis before one can check the assumptions. Thus, the order of operations is a bit different from that of other tests.

Follow these steps when conducting regression analysis:

  • Set an \(\alpha\) level (default is 0.05)
  • Provide an appropriate figure, including figure caption, to visualize the data
  • Provide a line or two interpreting your figure, and this may inform your concluding statement
  • Conduct the regression analysis, save the output, and use the residuals from the analysis to check assumptions
  • Assumptions
    • state the assumptions of the test
    • use appropriate figures and / or tests to check whether the assumptions of the statistical test are met
    • transform data to meet assumptions if required
    • if assumptions can’t be met (e.g. after transformation), use alternative methods (you are not required to know these alternative methods)
    • if data transformation is required, then do the transformation and provide another appropriate figure, including figure caption, to visualize the transformed data, and a line or two interpreting this new figure
  • Calculate the confidence interval for the slope and the R2 value, report these in concluding statement
  • if the regression is significant, provide a scatterplot that includes the regression line and confidence bands, with appropriate figure caption
  • Use the output from the regression analysis (above) draw an appropriate conclusion and communicate it clearly

Plant biomass example

Using the plant biomass data, let’s test the null hypothesis that ecosystem stability cannot be predicted from plant species richness.


Visualize the data

We learned in an earlier tutorial that the best way to visualize an association between two numeric variables is with a scatterplot, and that we can create a scatterplot using the plot function:

plot(biomassStability ~ nSpecies, data = plantbiomass, 
           xlab = "Species number treatment",
           ylab = "Biomass stability",
           pch = 1,  # pch changes the symbol type
           col = "firebrick",
           las = 1)  # orients y-axis tick labels properly
Figure 1: Stability of biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots.

Figure 1: Stability of biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots.

NOTE: This figure that you first produce for visualizing the data may or may not be necessary to present with your results. If the regression analysis ends up being significant, then you’ll create a new figure to reference in your concluding statement (see below). If your regression analysis is non-significant, then you should present and refer to this figure (which does not include a best-fit regression line).

In Figure 1 we do not report units for the response variable, because it does not have units


Interpreting a scatterplot

In an earlier tutorial, we learned how to properly interpret a scatterplot, and what information should to include in your interpretation. Be sure to consult that tutorial.

We see in Figure 1 that there is a weak, positive, and somewhat linear association between biomass stability and species number treatment. There appears to be increasing spread of the response data with increasing values of the independent (x) variable, and perhaps outliers in the highest group of species richness. We should keep this in mind.


Assumptions of regression analysis

Regression analysis assumes that:

  • the true relationship between X and Y is linear
  • for every value of X the corresponding values of Y are normally distributed
  • the variance of Y-values is the same at all values of X
  • at each value of X, the Y measurements represent a random sample from the population of possible Y values

Checking the assumptions of regression analysis

Strangely, we must conduct the regression analysis before checking the assumptions. This is because we check the assumptions using the residuals from the regression analysis.

We therefore implement the regression analysis and be sure to store the output for future use.

We’ll use an \(\alpha\) level 0.05.

The function to use is the lm function, which we saw in tutorial 15, but here our independent (explanatory) variable is a numeric variable rather than a categorical variable.

Let’s run the regression analysis and assign the output to an object as follows:

biomass.lm <- lm(biomassStability ~ nSpecies, data = plantbiomass)

Before we do anything else (e.g. look at the results of the regression), we need to first check the assumptions!

TIP: The plot.lm function in the base stats package can be used on the regression output object to produce a series of so-called “regression diagnostic plots”. See the help file for info.

To get the residuals, which are required to check the assumptions, you simply use the residuals function (from the base stats package) on the object holding the regression output from above (biomass.lm). We’ll create a new variable (biomass.lm.resids) in the dataframe plantbiomass to store these residuals:

plantbiomass$biomass.lm.resids <- residuals(biomass.lm)

“Residuals” represent the vertical difference between each observed value of Y (here, biomass stability) the least-squares line (the predicted value of Y) at each value of X.

You might think that one could check the normality assumption by simply conducting usual diagnostics on the raw Y values. However, the appropriate method is to use the residuals from the regression, and check that these are normally distributed.

We can check the normality of residuals assumption with a normal quantile plot, which you’ve seen previously.

{
qqnorm(plantbiomass$biomass.lm.resids, las = 1)
qqline(plantbiomass$biomass.lm.resids)
}
Figure 2: Normal quantile plot of the residuals from a regression of biomass stability on species number for 161 plots

Figure 2: Normal quantile plot of the residuals from a regression of biomass stability on species number for 161 plots

TIP: The outer curly brackets in the code above ensure that the qqline function is implemented properly and added to the main quantile plot in the R chunk (it avoids an error).

Figure 2 shows that the residuals don’t really fall consistently near the line in the normal quantile plot; there is curviture, and increasing values tend to fall further from the line. This suggests the need to log-transform the data.

Before trying a transformation, let’s also check the assumption that (i) the variance of Y is the same at all values of X, and (ii) the assumption that the association is linear. To do this, we plot the residuals against the original “X” variable, which here is the number of species initially used in the treatment. Note that we’ll add a horizontal line (“segement”) at zero for reference:

{
plot(biomass.lm.resids ~ nSpecies, data = plantbiomass, 
           ylab = "Residual",
           xlab = "Species number treatment",
           pch = 1,  # pch changes the symbol type
           col = "firebrick",
           las = 1)  # orients y-axis tick labels properly
abline(0, 0, lty = 2) # add horizontal dashed line 
}
Figure 3: Residual plot from a regression of biomass stability on species number for 161 plots.

Figure 3: Residual plot from a regression of biomass stability on species number for 161 plots.

Here we are watching out for:

  • a “funnel” shape, or substantial differences in the spread of Y at each X
  • outliers to the general trend
  • a curved pattern

Figure 3 shows that the variance in biomass stability is greatest within the largest species richness treatment. This again suggests the need for a log transformation.

TIP: If ever you see a clear outlier when checking assumptions, then a safe approach is to report the regression with and without the outlier point(s) included. If the exclusion of the outlier changes your conclusions, then you should make this clear.

Transform the data

Let’s log-transform the response variable, creating a new variable log.biomass:

plantbiomass$log.biomass <- log(plantbiomass$biomassStability)

Now let’s re-run the regression analysis and save the residuals again:

log.biomass.lm <- lm(log.biomass ~ nSpecies, data = plantbiomass)
plantbiomass$log.biomass.lm.resids <- residuals(log.biomass.lm)

And re-plot the residual diagnostic plots:

{
qqnorm(plantbiomass$log.biomass.lm.resids, las = 1)
qqline(plantbiomass$log.biomass.lm.resids)
}
Figure 4: Normal quantile plot of the residuals from a regression of biomass stability (log transformed) on species number for 161 plots

Figure 4: Normal quantile plot of the residuals from a regression of biomass stability (log transformed) on species number for 161 plots

{
plot(log.biomass.lm.resids ~ nSpecies, data = plantbiomass, 
           ylab = "Residual",
           xlab = "Species number treatment",
           pch = 1,  # pch changes the symbol type
           col = "firebrick",
           las = 1)  # orients y-axis tick labels properly
abline(0, 0, lty = 2) # add horizontal dashed line 
}
Figure 5: Residual plot from a regression of biomass stability (log transformed) on species number for 161 plots.

Figure 5: Residual plot from a regression of biomass stability (log transformed) on species number for 161 plots.

Figure 4 shows that the residuals are reasonably normally distributed, and Figure 5 shows no strong pattern of changing variance in Y along X, and nor is there an obvious curved pattern to the residuals (which would indicate a non-linear association). We therefore proceed with the analyses using the log-transformed response variable.


Conduct the regression analysis

We have already conducted the regression analysis above, using the lm function. We stored the output in the log.biomass.lm object. Now, we use the summary function on the regression output to view the results:

summary(log.biomass.lm)
## 
## Call:
## lm(formula = log.biomass ~ nSpecies, data = plantbiomass)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97148 -0.25984 -0.00234  0.23100  1.03237 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.198294   0.041298  29.016  < 2e-16 ***
## nSpecies    0.032926   0.004884   6.742 2.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3484 on 159 degrees of freedom
## Multiple R-squared:  0.2223, Adjusted R-squared:  0.2174 
## F-statistic: 45.45 on 1 and 159 DF,  p-value: 2.733e-10

What do we need to focus on in this output?

Under the “Estimate” heading, you see the estimated value of the Intercept (a) and of the slope (b), which is reported to the right of the predictor variable name.

NOTE: There are two values of t reported in the table; one associated with the intercept, and one with the slope. These are testing the implied hypotheses that (i) the intercept equals zero and (ii) the slope equals zero. We are typically only interested in the slope.

At the bottom of the output you’ll see a value of F, just like you saw with the ANOVA output in Tutorial 15. It is this value of F that you will report in statements about regression (see below).

We also see the P-value associated with the test of the zero slope null hypothesis, which is identical to the P-value reported for the overall regression (at the bottom of the output).

NOTE: It is important to recognize that an overall significant regression (i.e. slope significantly different from zero) does not necessarily mean that the regression will yield accurate predictions. For this we need to assess the “coefficient of determination”, or R2 value (called “multiple R-squared” in the output), which here is 0.22, and represents the fraction of the variation in Y that is accounted for by the linear regression model. A value of this magnitude is indicative of a comparatively weak regression, i.e. one that does has predictive power, but not particularly good predictive power (provided assumptions are met, and we’re not extrapolating beyond the range of our X values).

Before getting to our concluding statement, we need to additionally calculate the 95% confidence interval for the slope, \(\beta\).


Confidence interval for the slope

To calculate the confidence interval for the true slope, \(\beta\), we use the confint function from the base stats package:

?confint

We run this function on the lm model output:

confint(log.biomass.lm)
##                  2.5 %     97.5 %
## (Intercept) 1.11673087 1.27985782
## nSpecies    0.02328063 0.04257117

This provides the lower and upper limits of the 95% confidence interval for the intercept (top row) and slope (bottom row).


Scatterplot with regression confidence bands

If your regression is significant, then it is recommended to accompany your concluding statement with a scatterplot that includes so-called confidence bands around the regression line.

If your regression is not significant, do not include a regression line in your scatterplot!

We can calculate two types of confidence intervals to show on the scatterplot:

  • the 95% “confidence bands”
  • the 95% “prediction intervals”

Most of the time we’re interested in showing confidence bands, which show the 95% confidence limits for the mean value of Y in the population at each value of X. In other words, we’re more interested in predicing an average value in the population (easier), rather than the value of an individual in the population (much more difficult).

To display the confidence bands, we use the visreg function from the visreg package:

?visreg

The function has a lot of options that we’re going to ignore, becaue the default arguments are almost all good for our purposes.

This “visreg” function takes the regression output object, here log.biomass.lm, and plots the regression line with confidence bands included. The argument “alpha” is used to set the level of confidence, and the default is alpha = 0.05, so 95% confidence bands. Note that you provide 2 arguments to the function: the regression object, and the name (in quotes) of the “X” variable!

Let’s give it a try, making sure to spruce it up a bit with some additional options, and including a good figure caption, shown here for your reference (this is the code you’d put after the “r” in the header of the R chunk:

fig.cap = "Figure 5. Stability of biomass production (log transformed) over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown is the significant least-square regression line (black solid line; see text for details) and the 95% confidence bands (grey shading)."
visreg(log.biomass.lm, 
       "nSpecies",
       alpha = 0.05, 
       line = list(col = "black"), 
       points = list(cex = 1.5, pch = 1,col = "black", lwd = 1.75), 
       las = 1, 
       xlab = "Species number treatment", 
       ylab = "Biomass stability (log transformed)")
Figure 6: Stability of biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown is the significant least-square regression line (black solid line; see text for details) and the 95% confidence bands (grey shading).

Figure 6: Stability of biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown is the significant least-square regression line (black solid line; see text for details) and the 95% confidence bands (grey shading).

Here we can interpret the figure as follows:

Figure 6 shows that biomass stability (log-transformed) is positively and linearly related to the initial number of plant species assigned to experimental plots, but there is considerable variation that remains unexplained by the least-square regression model.

NOTE: Although the Guidelines for Data Presentation document indicates that you should report key information about the regression within the figure caption (e.g. the equation of the line), it is sufficient to refer the reader to the text.


Concluding statement

Now that we have our new figure including confidence bands (because our regression was significant), and all our regression output (including confidence intervals for the slope), we’re ready to provide a concluding statement.

NOTE: We simply report the sample size rather than the degrees of freedom in the parentheses.

As seen in Figure 6, species number is a significant predictor of biomass stability: Biomass stability (log) = 1.2 + 0.03(Species number treatment); F = 45.45; n = 161; 95% confidence limits for the slope: 0.023, 0.043; R2 = 0.22; P < 0.001). Given the relatively low R2 value, predictions from our regression model will not be particularly accurate.

NOTE: It is OK to report the concluding statement and associated regression output using the log-transformed data. However, if you wanted to make predictions using the regression model, it is often desirable to report the back-transformed predicted values (see below).


Model-I versus Model-II regression

Model-I regression, which we use here, assumes that there is negligible error or uncertainty associated with the “X” variable measurements, relative to that associated with the “Y” variable measurements. In other words, it effectively assumes that we have control over the “X” values (like we do in this biomass experiment). Model-II regression would be used in a situation where both X and Y have similar error or uncertainty, such as using measurements of height to predict values of weight among individuals.


Making predictions

Even though the R2 value from our regression was rather low, we’ll proceed with using the model to make predictions.

It is not advisable to make predictions beyond the range of X-values upon which the regression was built. So in our case (see Figure 6), we would not wish to make predictions using species richness values beyond 16. Such “extrapolations” are inadvisable.

To make a prediction using our regression model, use the predict.lm function from the base stats package:

?predict.lm

If you do not provide new values of nSpecies (note that the variable name must be the same as was used when building the model), then it will simply use the old values that were supplied when building the regression model.

Here, let’s create a new set of nSpecies data to make predictions with. For example, perhaps we wish to know what the predicted value of biomass stability would be if we were to grow plants in mixtures of 1 through 16 species, rather than just the five different values used in the experiment.

A quirky thing with the predict.lm function is that we need to supply the exact same number of values of X that were used to build the model, and we also need to use the exact same variable name for X.

So, let’s use 161 values, all randomly selected from integers 1 through 16. Here’s the code that creates a new data frame (new.data) and a variable nSpecies (it must be the same name as the original X variable) that holds our new values of “X” to use for predicting “Y”.

set.seed(123)
new.X.data <- data.frame(nSpecies = sample(1:16, 161, replace = T)) # sampling with replacement 

Let’s make the predictions, and we’ll store the predicted values in our plantbiomass data frame, in a new variable:

plantbiomass$predicted.vals <- predict.lm(log.biomass.lm, newdata = new.X.data)

Let’s superimpose these predicted values over the original scatterplot:

{
plot(log.biomass ~ nSpecies, data = plantbiomass, 
           ylab = "Biomass stability (log)",
           xlab = "Species number treatment",
           pch = 1,  # pch changes the symbol type
           col = "firebrick",
           las = 1)  # orients y-axis tick labels properly
points(new.X.data$nSpecies, plantbiomass$predicted.vals, pch = 16) # add predicted points 
}
Figure 7: Stability of biomass production (log transformed) over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown are predicted values of Y across values of X.

Figure 7: Stability of biomass production (log transformed) over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown are predicted values of Y across values of X.

Figure 7 shows pretty clearly the overal positive effect of species richness on biomass stability, and it also shows the considerable scatter in the relationship, hence the relatively low R2 value.

Back-transforming regression predictions

Now remember that the predicted values above are log-transformed values, and it is often desirable to report predicted values back in their original non-transformed state. To do this, follow the instructions in the earlier tutorial dealing with data transformations.


List of functions

Getting started:

  • read.csv
  • url
  • library

Data management / manipulation:

  • inspect (tigerstats / mosaic packages)
  • head

Data visualization:

  • plot
  • jitter

Assumptions:

  • qqnorm
  • qqline
  • residuals
  • shapiro.test
  • abline

Regression:

  • lm
  • summary
  • confint
  • visreg (from the visreg package)

Prediction:

  • predict.lm
  • seq
  • data.frame
  • sample
  • points

Markdown File

You can access the Rmd file that created this page here. For it to work, you also need to download a “css” file linked here (called “tutorial.css”), and place it in a directory one down from your working directory.