The correlation coefficient (\(\boldsymbol{r}\) or Pearson’s Product Moment Correlation Coefficient) measures the strength and direction of the linear relationship between two quantitative variables. Regression models estimate the impacts of changes in \(x\) on the mean of the response variable \(y\). Direction of the assumed relationship (which variable explains or causes the other) matters for regression models but does not matter for correlation. Regression lines only describe the pattern of the relationship; in regression, we use the coefficient of determination to describe the strength of the relationship between the variables as a percentage of the response variable that is explained by the model. If we are choosing between models, we prefer them to have higher \(R^2\) values for obvious reasons, but we will discover in Chapter ?? that maximizing the coefficient of determination is not a good way to pick a model when we have multiple candidate options.
In this chapter, a wide variety of potential problems were explored when using regression models. This included a discussion of the conditions that will be required for using the models to perform trustworthy inferences in the remaining chapters. It is important to remember that correlation and regression models only measure the linear association between variables and that can be misleading if a nonlinear relationship is present. Similarly, influential observations can completely distort the apparent relationship between variables and should be assessed before trusting any regression output. It is also important to remember that regression lines should not be used outside the scope of the original observations – extrapolation should be checked for and avoided whenever possible or at least acknowledged when it is being performed.
Regression models look like they
estimate the changes in \(y\) that are caused by changes in \(x\), especially when you use \(x\) to predict \(y\). This is not true
unless the levels of \(x\) are randomly assigned and only then we can make causal
inferences. Since this is not generally true, you should initially always
assume that any regression equation describes the relationship – if you observe
two subjects that are 1 unit of \(x\) apart, you can expect their mean to differ by
\(b_1\) – you should not, however, say that changing \(x\) causes a change in the
mean of the responses. Despite all these cautions,
regression models are very popular statistical methods. They provide detailed
descriptions of relationships between variables and can be extended to
situations where we are interested in multiple predictor variables. They also
share ties to the ANOVA models discussed previously. When you are running
R code, you will note that all the ANOVAs and the regression models are estimated
using lm
.
The assumptions and diagnostic plots are quite similar. And in the
next chapter, we will see that inference techniques
look similar. People still like to distinguish among the different types of
situations, but the underlying linear models are actually exactly the
same…