This page was last updated on November 19, 2018.
In this tutorial you will:
tigerstats
visreg
The latter package is likely new to you, so install it first by typing the following in the command console:
install.packages("visreg")
Load the packages:
library(tigerstats)
library(visreg)
The “plantbiomass” dataset:
plantbiomass <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/plantbiomass.csv"), header = TRUE)
The plantbiomass
(see p. 551 in text) dataset includes data describing plant biomass measured after 10 years within each of 5 experimental “species richness” treatments, wherein plants were grown in groups of 1, 2, 4, 8, or 16 species. The treatment variable is called nSpecies
, and the response variable biomassStability
is a measure of ecosystem stability. The research hypothesis was that increasing diversity (species richness) would lead to increased ecosystem stability.
head(plantbiomass)
## nSpecies biomassStability
## 1 1 7.47
## 2 1 6.74
## 3 1 6.61
## 4 1 6.40
## 5 1 5.67
## 6 1 5.26
inspect(plantbiomass)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd
## 1 nSpecies integer 1.00 2.00 4 8.0 16.00 6.316770 5.639617
## 2 biomassStability numeric 1.34 3.07 4 5.2 15.76 4.421801 1.947736
## n missing
## 1 161 0
## 2 161 0
We see that both variables are numeric, and that there are 161 observations overall.
When we are only interested in the strength of a linear association between two numerical variables, we use a correlation analysis.
When we are interest in making predictions, we use regression analysis.
Often in the literature you see authors reporting regression results when in fact there was no rationale for using regression; a correlation analysis would have been more appropriate.
The two analyses are mathematically related.
Regression analysis is extremely common in biology, and unfortunately it is also common to see incorrect implementation of regression analysis. In particular, one often sees scatterplots that clearly show violations of the assumptions of regression analysis (see below). Failure to appropriately check assumptions can lead to misleading and incorrect conclusions.
Recall from your high-school training that the equation for a straight line is:
Y = a + bX
In regression analysis, the “least-squares line” is the line for which the sum of all the squared deviations in Y is smallest.
In a regression context, a is the Y-intercept and b is the slope of the regression line.
Regression analysis falls under the heading of inferential statistics, which means we use it to draw inferences about a linear relationship between Y and X within a population of interest. So in the actual population, the true least-squares line is represented as follows:
Y = \(\alpha\) + \(\beta\)X
In practice, we estimate the “parameters” \(\alpha\) and \(\beta\) using a random sample from the population, and by calculating a and b using regression analysis.
NOTE: the \(\alpha\) in the regression equation has no relation to the \(\alpha\) that sets the significance level for a test!
In least-squares regression analysis, the null hypothesis is that the slope of the relationship between Y and X is equal to zero. That is:
H0: \(\beta\) = 0
HA: \(\beta\) \(\neq\) 0
However, one does not typically state the null and alternative hypotheses when conducting a regression analysis. Rather, they are implied.
As such, there is no need to explicitly state these hypotheses for a regression analysis.
You must, however, know what is being tested: the null hypothesis that the slope is equal to zero.
In regression analysis one must conduct the actual analysis before one can check the assumptions. Thus, the order of operations is a bit different from that of other tests.
Follow these steps when conducting regression analysis:
Using the plant biomass data, let’s test the null hypothesis that ecosystem stability cannot be predicted from plant species richness.
We learned in an earlier tutorial that the best way to visualize an association between two numeric variables is with a scatterplot, and that we can create a scatterplot using the plot
function:
plot(biomassStability ~ nSpecies, data = plantbiomass,
xlab = "Species number treatment",
ylab = "Biomass stability",
pch = 1, # pch changes the symbol type
col = "firebrick",
las = 1) # orients y-axis tick labels properly
Figure 1: Stability of biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots.
NOTE: This figure that you first produce for visualizing the data may or may not be necessary to present with your results. If the regression analysis ends up being significant, then you’ll create a new figure to reference in your concluding statement (see below). If your regression analysis is non-significant, then you should present and refer to this figure (which does not include a best-fit regression line).
In Figure 1 we do not report units for the response variable, because it does not have units
In an earlier tutorial, we learned how to properly interpret a scatterplot, and what information should to include in your interpretation. Be sure to consult that tutorial.
We see in Figure 1 that there is a weak, positive, and somewhat linear association between biomass stability and species number treatment. There appears to be increasing spread of the response data with increasing values of the independent (x) variable, and perhaps outliers in the highest group of species richness. We should keep this in mind.
Regression analysis assumes that:
Strangely, we must conduct the regression analysis before checking the assumptions. This is because we check the assumptions using the residuals from the regression analysis.
We therefore implement the regression analysis and be sure to store the output for future use.
We’ll use an \(\alpha\) level 0.05.
The function to use is the lm
function, which we saw in tutorial 15, but here our independent (explanatory) variable is a numeric variable rather than a categorical variable.
Let’s run the regression analysis and assign the output to an object as follows:
biomass.lm <- lm(biomassStability ~ nSpecies, data = plantbiomass)
Before we do anything else (e.g. look at the results of the regression), we need to first check the assumptions!
TIP: The plot.lm
function in the base stats package can be used on the regression output object to produce a series of so-called “regression diagnostic plots”. See the help file for info.
To get the residuals, which are required to check the assumptions, you simply use the residuals
function (from the base stats package) on the object holding the regression output from above (biomass.lm
). We’ll create a new variable (biomass.lm.resids
) in the dataframe plantbiomass
to store these residuals:
plantbiomass$biomass.lm.resids <- residuals(biomass.lm)
“Residuals” represent the vertical difference between each observed value of Y (here, biomass stability) the least-squares line (the predicted value of Y) at each value of X.
You might think that one could check the normality assumption by simply conducting usual diagnostics on the raw Y values. However, the appropriate method is to use the residuals from the regression, and check that these are normally distributed.
We can check the normality of residuals assumption with a normal quantile plot, which you’ve seen previously.
{
qqnorm(plantbiomass$biomass.lm.resids, las = 1)
qqline(plantbiomass$biomass.lm.resids)
}
Figure 2: Normal quantile plot of the residuals from a regression of biomass stability on species number for 161 plots
TIP: The outer curly brackets in the code above ensure that the qqline
function is implemented properly and added to the main quantile plot in the R chunk (it avoids an error).
Figure 2 shows that the residuals don’t really fall consistently near the line in the normal quantile plot; there is curviture, and increasing values tend to fall further from the line. This suggests the need to log-transform the data.
Before trying a transformation, let’s also check the assumption that (i) the variance of Y is the same at all values of X, and (ii) the assumption that the association is linear. To do this, we plot the residuals against the original “X” variable, which here is the number of species initially used in the treatment. Note that we’ll add a horizontal line (“segement”) at zero for reference:
{
plot(biomass.lm.resids ~ nSpecies, data = plantbiomass,
ylab = "Residual",
xlab = "Species number treatment",
pch = 1, # pch changes the symbol type
col = "firebrick",
las = 1) # orients y-axis tick labels properly
abline(0, 0, lty = 2) # add horizontal dashed line
}
Figure 3: Residual plot from a regression of biomass stability on species number for 161 plots.
Here we are watching out for:
Figure 3 shows that the variance in biomass stability is greatest within the largest species richness treatment. This again suggests the need for a log transformation.
TIP: If ever you see a clear outlier when checking assumptions, then a safe approach is to report the regression with and without the outlier point(s) included. If the exclusion of the outlier changes your conclusions, then you should make this clear.
Let’s log-transform the response variable, creating a new variable log.biomass
:
plantbiomass$log.biomass <- log(plantbiomass$biomassStability)
Now let’s re-run the regression analysis and save the residuals again:
log.biomass.lm <- lm(log.biomass ~ nSpecies, data = plantbiomass)
plantbiomass$log.biomass.lm.resids <- residuals(log.biomass.lm)
And re-plot the residual diagnostic plots:
{
qqnorm(plantbiomass$log.biomass.lm.resids, las = 1)
qqline(plantbiomass$log.biomass.lm.resids)
}
Figure 4: Normal quantile plot of the residuals from a regression of biomass stability (log transformed) on species number for 161 plots
{
plot(log.biomass.lm.resids ~ nSpecies, data = plantbiomass,
ylab = "Residual",
xlab = "Species number treatment",
pch = 1, # pch changes the symbol type
col = "firebrick",
las = 1) # orients y-axis tick labels properly
abline(0, 0, lty = 2) # add horizontal dashed line
}
Figure 5: Residual plot from a regression of biomass stability (log transformed) on species number for 161 plots.
Figure 4 shows that the residuals are reasonably normally distributed, and Figure 5 shows no strong pattern of changing variance in Y along X, and nor is there an obvious curved pattern to the residuals (which would indicate a non-linear association). We therefore proceed with the analyses using the log-transformed response variable.
We have already conducted the regression analysis above, using the lm
function. We stored the output in the log.biomass.lm
object. Now, we use the summary
function on the regression output to view the results:
summary(log.biomass.lm)
##
## Call:
## lm(formula = log.biomass ~ nSpecies, data = plantbiomass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97148 -0.25984 -0.00234 0.23100 1.03237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.198294 0.041298 29.016 < 2e-16 ***
## nSpecies 0.032926 0.004884 6.742 2.73e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3484 on 159 degrees of freedom
## Multiple R-squared: 0.2223, Adjusted R-squared: 0.2174
## F-statistic: 45.45 on 1 and 159 DF, p-value: 2.733e-10
What do we need to focus on in this output?
Under the “Estimate” heading, you see the estimated value of the Intercept (a) and of the slope (b), which is reported to the right of the predictor variable name.
NOTE: There are two values of t reported in the table; one associated with the intercept, and one with the slope. These are testing the implied hypotheses that (i) the intercept equals zero and (ii) the slope equals zero. We are typically only interested in the slope.
At the bottom of the output you’ll see a value of F, just like you saw with the ANOVA output in Tutorial 15. It is this value of F that you will report in statements about regression (see below).
We also see the P-value associated with the test of the zero slope null hypothesis, which is identical to the P-value reported for the overall regression (at the bottom of the output).
NOTE: It is important to recognize that an overall significant regression (i.e. slope significantly different from zero) does not necessarily mean that the regression will yield accurate predictions. For this we need to assess the “coefficient of determination”, or R2 value (called “multiple R-squared” in the output), which here is 0.22, and represents the fraction of the variation in Y that is accounted for by the linear regression model. A value of this magnitude is indicative of a comparatively weak regression, i.e. one that does has predictive power, but not particularly good predictive power (provided assumptions are met, and we’re not extrapolating beyond the range of our X values).
Before getting to our concluding statement, we need to additionally calculate the 95% confidence interval for the slope, \(\beta\).
To calculate the confidence interval for the true slope, \(\beta\), we use the confint
function from the base stats package:
?confint
We run this function on the lm
model output:
confint(log.biomass.lm)
## 2.5 % 97.5 %
## (Intercept) 1.11673087 1.27985782
## nSpecies 0.02328063 0.04257117
This provides the lower and upper limits of the 95% confidence interval for the intercept (top row) and slope (bottom row).
If your regression is significant, then it is recommended to accompany your concluding statement with a scatterplot that includes so-called confidence bands around the regression line.
If your regression is not significant, do not include a regression line in your scatterplot!
We can calculate two types of confidence intervals to show on the scatterplot:
Most of the time we’re interested in showing confidence bands, which show the 95% confidence limits for the mean value of Y in the population at each value of X. In other words, we’re more interested in predicing an average value in the population (easier), rather than the value of an individual in the population (much more difficult).
To display the confidence bands, we use the visreg
function from the visreg
package:
?visreg
The function has a lot of options that we’re going to ignore, becaue the default arguments are almost all good for our purposes.
This “visreg” function takes the regression output object, here log.biomass.lm
, and plots the regression line with confidence bands included. The argument “alpha” is used to set the level of confidence, and the default is alpha = 0.05, so 95% confidence bands. Note that you provide 2 arguments to the function: the regression object, and the name (in quotes) of the “X” variable!
Let’s give it a try, making sure to spruce it up a bit with some additional options, and including a good figure caption, shown here for your reference (this is the code you’d put after the “r” in the header of the R chunk:
fig.cap = "Figure 5. Stability of biomass production (log transformed) over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown is the significant least-square regression line (black solid line; see text for details) and the 95% confidence bands (grey shading)."
visreg(log.biomass.lm,
"nSpecies",
alpha = 0.05,
line = list(col = "black"),
points = list(cex = 1.5, pch = 1,col = "black", lwd = 1.75),
las = 1,
xlab = "Species number treatment",
ylab = "Biomass stability (log transformed)")
Figure 6: Stability of biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown is the significant least-square regression line (black solid line; see text for details) and the 95% confidence bands (grey shading).
Here we can interpret the figure as follows:
Figure 6 shows that biomass stability (log-transformed) is positively and linearly related to the initial number of plant species assigned to experimental plots, but there is considerable variation that remains unexplained by the least-square regression model.
NOTE: Although the Guidelines for Data Presentation document indicates that you should report key information about the regression within the figure caption (e.g. the equation of the line), it is sufficient to refer the reader to the text.
Now that we have our new figure including confidence bands (because our regression was significant), and all our regression output (including confidence intervals for the slope), we’re ready to provide a concluding statement.
NOTE: We simply report the sample size rather than the degrees of freedom in the parentheses.
As seen in Figure 6, species number is a significant predictor of biomass stability: Biomass stability (log) = 1.2 + 0.03(Species number treatment); F = 45.45; n = 161; 95% confidence limits for the slope: 0.023, 0.043; R2 = 0.22; P < 0.001). Given the relatively low R2 value, predictions from our regression model will not be particularly accurate.
NOTE: It is OK to report the concluding statement and associated regression output using the log-transformed data. However, if you wanted to make predictions using the regression model, it is often desirable to report the back-transformed predicted values (see below).
Model-I regression, which we use here, assumes that there is negligible error or uncertainty associated with the “X” variable measurements, relative to that associated with the “Y” variable measurements. In other words, it effectively assumes that we have control over the “X” values (like we do in this biomass experiment). Model-II regression would be used in a situation where both X and Y have similar error or uncertainty, such as using measurements of height to predict values of weight among individuals.
Even though the R2 value from our regression was rather low, we’ll proceed with using the model to make predictions.
It is not advisable to make predictions beyond the range of X-values upon which the regression was built. So in our case (see Figure 6), we would not wish to make predictions using species richness values beyond 16. Such “extrapolations” are inadvisable.
To make a prediction using our regression model, use the predict.lm
function from the base stats package:
?predict.lm
If you do not provide new values of nSpecies
(note that the variable name must be the same as was used when building the model), then it will simply use the old values that were supplied when building the regression model.
Here, let’s create a new set of nSpecies
data to make predictions with. For example, perhaps we wish to know what the predicted value of biomass stability would be if we were to grow plants in mixtures of 1 through 16 species, rather than just the five different values used in the experiment.
A quirky thing with the predict.lm
function is that we need to supply the exact same number of values of X that were used to build the model, and we also need to use the exact same variable name for X.
So, let’s use 161 values, all randomly selected from integers 1 through 16. Here’s the code that creates a new data frame (new.data
) and a variable nSpecies
(it must be the same name as the original X variable) that holds our new values of “X” to use for predicting “Y”.
set.seed(123)
new.X.data <- data.frame(nSpecies = sample(1:16, 161, replace = T)) # sampling with replacement
Let’s make the predictions, and we’ll store the predicted values in our plantbiomass
data frame, in a new variable:
plantbiomass$predicted.vals <- predict.lm(log.biomass.lm, newdata = new.X.data)
Let’s superimpose these predicted values over the original scatterplot:
{
plot(log.biomass ~ nSpecies, data = plantbiomass,
ylab = "Biomass stability (log)",
xlab = "Species number treatment",
pch = 1, # pch changes the symbol type
col = "firebrick",
las = 1) # orients y-axis tick labels properly
points(new.X.data$nSpecies, plantbiomass$predicted.vals, pch = 16) # add predicted points
}
Figure 7: Stability of biomass production (log transformed) over 10 years in 161 plots and the initial number of plant species assigned to plots. Also shown are predicted values of Y across values of X.
Figure 7 shows pretty clearly the overal positive effect of species richness on biomass stability, and it also shows the considerable scatter in the relationship, hence the relatively low R2 value.
Now remember that the predicted values above are log-transformed values, and it is often desirable to report predicted values back in their original non-transformed state. To do this, follow the instructions in the earlier tutorial dealing with data transformations.
Getting started:
read.csv
url
library
Data management / manipulation:
inspect
(tigerstats
/ mosaic
packages)head
Data visualization:
plot
jitter
Assumptions:
qqnorm
qqline
residuals
shapiro.test
abline
Regression:
lm
summary
confint
visreg
(from the visreg
package)Prediction:
predict.lm
seq
data.frame
sample
points