6.2 Estimating the correlation coefficient

In terms of quantifying relationships between variables, we start with the correlation coefficient, a measure that is the same regardless of your choice of variables as explanatory or response. We measure the strength and direction of linear relationships between two quantitative variables using Pearson’s r or Pearson’s Product Moment Correlation Coefficient. For those who really like acronyms, Wikipedia even suggests calling it the PPMCC. However, its use is so ubiquitous that the lower case r or just “correlation coefficient” are often sufficient to identify that you have used the PPMCC. Some of the extra distinctions arise because there are other ways of measuring correlations in other situations (for example between two categorical variables), but we will not consider them here.

The correlation coefficient, r, is calculated as

\[r=\frac{1}{n-1}\sum^n_{i=1}\left(\frac{x_i-\bar{x}}{s_x}\right) \left(\frac{y_i-\bar{y}}{s_y}\right),\]

where \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\). This formula can also be written as

\[r=\frac{1}{n-1}\sum^n_{i=1}z_{x_i}z_{y_i}\]

where \(z_{x_i}\) is the z-score (observation minus mean divided by standard deviation) for the \(i^{th}\) observation on \(x\) and \(z_{y_i}\) is the z-score for the \(i^{th}\) observation on \(y\). We won’t directly use this formula, but its contents inform the behavior of r. First, because it is a sum divided by (\(n-1\)) it is a bit like an average – it combines information across all observations and, like the mean, is sensitive to outliers. Second, it is a dimension-less measure, meaning that it has no units attached to it. It is based on z-scores which have units of standard deviations of \(x\) or \(y\) so the original units of measurement are cancelled out going into this calculation. This also means that changing the original units of measurement, say from Fahrenheit to Celsius or from miles to km for one or the other variable will have no impact on the correlation. Less obviously, the formula guarantees that r is between -1 and 1. It will attain -1 for a perfect negative linear relationship, 1 for a perfect positive linear relationship, and 0 for no linear relationship. We are being careful here to say linear relationship because you can have a strong nonlinear relationship with a correlation of 0. For example, consider Figure 2.101.

Scatterplot of an amusing (and strong) relationship that has \(r=0\).

Figure 2.101: Scatterplot of an amusing (and strong) relationship that has \(r=0\).

There are some conditions for trusting the results that the correlation coefficient provides:

  1. Two quantitative variables measured.

    • This might seem silly, but categorical variables can be coded numerically and a meaningless correlation can be estimated if you are not careful what you correlate.
  2. The relationship between the variables is relatively linear.

    • If the relationship is nonlinear, the correlation is meaningless since it only measures linear relationships and can be misleading if applied to a nonlinear relationship.
  3. There should be no outliers.

    • The correlation is very sensitive (technically not resistant) to the impacts of certain types of outliers and you should generally avoid reporting the correlation when they are present.

    • One option in the presence of outliers is to report the correlation with and without outliers to see how they influence the estimated correlation.

The correlation coefficient is dimensionless but larger magnitude values (closer to -1 OR 1) mean stronger linear relationships. A rough interpretation scale based on experiences working with correlations follows, but this varies between fields and types of research and variables measured. It depends on the levels of correlation researchers become used to obtaining, so can even vary within fields. Use this scale until you develop your own experience:

  • \(\left|\boldsymbol{r}\right|<0.3\): weak linear relationship,

  • \(0.3 < \left|\boldsymbol{r}\right|<0.7\): moderate linear relationship,

  • \(0.7 < \left|\boldsymbol{r}\right|<0.9\): strong linear relationship, and

  • \(0.9 < \left|\boldsymbol{r}\right|<1.0\): very strong linear relationship.

And again note that this scale only relates to the linear aspect of the relationship between the variables.

When we have linear relationships between two quantitative variables, \(x\) and \(y\), we can obtain estimated correlations from the cor function either using y~x or by running the cor function93 on the entire data set. When you run the cor function on a data set it produces a correlation matrix which contains a matrix of correlations where you can triangulate the variables being correlated by the row and column names, noting that the correlation between a variable and itself is 1. A matrix of correlations is useful for comparing more than two variables, discussed below.

## [1] 0.8943381
##           Beers       BAC
## Beers 1.0000000 0.8943381
## BAC   0.8943381 1.0000000

Based on either version of using the function, we find that the correlation between Beers and BAC is estimated to be 0.89. This suggests a strong linear relationship between the two variables. Examples are about the only way to build up enough experience to become skillful in using the correlation coefficient. Some additional complications arise in more complicated studies as the next example demonstrates.

Gude et al. (2009) explored the relationship between average summer temperature (degrees F) and area burned (natural log of hectares94 = log(hectares)) by wildfires in Montana from 1985 to 2007. The log-transformation is often used to reduce the impacts of really large observations with non-negative (strictly greater than 0) variables (more on transformations and their impacts on regression models in Chapter ??). Based on your experiences with the wildfire “season” and before analyzing the data, I’m sure you would assume that summer temperature explains the area burned by wildfires. But could it be that more fires are related to having warmer summers? That second direction is unlikely on a state-wide scale but could apply at a particular weather station that is near a fire. There is another option – some other variable is affecting both variables. For example, drier summers might be the real explanatory variable that is related to having both warm summers and lots of fires. These variables are also being measured over time making them examples of time series. In this situation, if there are changes over time, they might be attributed to climate change. So there are really three relationships to explore with the variables measured here (remembering that the full story might require measuring even more!): log-area burned versus temperature, temperature versus year, and log-area burned versus year.

With more than two variables, we can use the cor function on all the variables and end up getting a matrix of correlations or, simply, the correlation matrix. If you triangulate the row and column labels, that cell provides the correlation between that pair of variables. For example, in the first row (Year) and the last column (loghectares), you can find that the correlation coefficient is r=0.362. Note the symmetry in the matrix around the diagonal of 1’s – this further illustrates that correlation between \(x\) and \(y\) does not depend on which variable is viewed as the “response”. The estimated correlation between Temperature and Year is -0.004 and the correlation between loghectares (log-hectares burned) and Temperature is 0.81. So Temperature has almost no linear change over time. And there is a strong linear relationship between loghectares and Temperature. So it appears that temperatures may be related to log-area burned but that the trend over time in both is less clear (at least the linear trends).

##                   Year Temperature loghectares
## Year         1.0000000  -0.0037991   0.3617789
## Temperature -0.0037991   1.0000000   0.8135947
## loghectares  0.3617789   0.8135947   1.0000000

The correlation matrix alone is misleading – we need to explore scatterplots to check for nonlinear relationships, outliers, and clustering of observations that may be distorting the numerical measure of the linear relationship. The pairs.panels function from the psych package (Revelle 2019) combines the numerical correlation information and scatterplots in one display. There are some options to turn off for the moment but it is an easy function to use to get lots of information in one place. As in the correlation matrix, you triangulate the variables for the pairwise relationship. The upper right panel of Figure 2.102 displays a correlation of 0.36 for Year and loghectares and the lower left panel contains the scatterplot with Year on the x-axis and loghectares on the y-axis. The correlation between Year and Temperature is really small, both in magnitude and in display, but appears to be nonlinear (it goes down between 1985 and 1995 and then goes back up), so the correlation coefficient doesn’t mean much here since it just measures the overall linear relationship. We might say that this is a moderate strength (moderately “clear”) curvilinear relationship. In terms of the underlying climate process, it suggests a decrease in summer temperatures between 1985 and 1995 and then an increase in the second half of the data set.

Scatterplot matrix of Montana fires data.

Figure 2.102: Scatterplot matrix of Montana fires data.

As one more example, the Australian Institute of Sport collected data on 102 male and 100 female athletes that are available in the ais data set from the alr3 package (Weisberg (2018), Weisberg (2005)). They measured a variety of variables including the athlete’s Hematocrit (Hc, units of percentage of red blood cells in the blood), Body Fat Percentage (Bfat, units of percentage of total body weight), and height (Ht, units of cm). Eventually we might be interested in predicting Hc based on the other variables, but for now the associations are of interest.

##        Ht              Hc             Bfat       
##  Min.   :148.9   Min.   :35.90   Min.   : 5.630  
##  1st Qu.:174.0   1st Qu.:40.60   1st Qu.: 8.545  
##  Median :179.7   Median :43.50   Median :11.650  
##  Mean   :180.1   Mean   :43.09   Mean   :13.507  
##  3rd Qu.:186.2   3rd Qu.:45.58   3rd Qu.:18.080  
##  Max.   :209.4   Max.   :59.70   Max.   :35.520
Scatterplot matrix of athlete data.

Figure 2.103: Scatterplot matrix of athlete data.

##              Ht         Hc       Bfat
## Ht    1.0000000  0.3711915 -0.1880217
## Hc    0.3711915  1.0000000 -0.5324491
## Bfat -0.1880217 -0.5324491  1.0000000

Ht (Height) and Hc (Hematocrit) have a moderate positive relationship that may contain a slight nonlinearity. It also contains one clear outlier for a middle height athlete (around 175 cm) with an Hc of close to 60% (a result that is extremely high). One might wonder about whether this athlete has been doping or if that measurement involved a recording error. We should consider removing that observation to see how our results might change without it impacting the results. For the relationship between Bfat (body fat) and Hc (hematocrit), that same high Hc value is a clear outlier. There is also a high Bfat (body fat) athlete (35%) with a somewhat low Hc value. This also might be influencing our impressions so we will remove both “unusual” values and remake the plot. The two offending observations were found for individuals numbered 56 and 166 in the data set:

## # A tibble: 2 x 3
##      Ht    Hc  Bfat
##   <dbl> <dbl> <dbl>
## 1  180.  37.6 35.5 
## 2  175.  59.7  9.56

We can create a reduced version of the data (aisR2) by removing those two rows using [-c(56, 166),] and then remake the plot:

(ref:fig6-5) Scatterplot matrix of athlete data with two potential outliers removed.

(ref:fig6-5)

Figure 2.104: (ref:fig6-5)

After removing these two unusual observations, the relationships between the variables are more obvious (Figure 2.104). There is a moderate strength, relatively linear relationship between Height and Hematocrit. There is almost no relationship between Height and Body Fat % \((\boldsymbol{r}=-0.20)\). There is a negative, moderate strength, somewhat curvilinear relationship between Hematocrit and Body Fat % \((\boldsymbol{r}=-0.54)\). As hematocrit increases initially, the body fat percentage decreases but at a certain level (around 45% for Hc), the body fat percentage seems to level off. Interestingly, it ended up that removing those two outliers had only minor impacts on the estimated correlations – this will not always be the case.

Sometimes we want to just be able to focus on the correlations, assuming we trust that the correlation is a reasonable description of the results between the variables. To make it easier to see patterns of positive and negative correlations, we can employ a different version of the same display from the corrplot package (Wei and Simko 2017) with the corrplot.mixed function. In this case (Figure 2.105), it tells much the same story but also allows the viewer to easily distinguish both size and direction and read off the numerical correlations if desired.

(ref:fig6-6) Correlation plot of the athlete data with two potential outliers removed. Lighter (orange) circle for positive correlations and black for negative correlations.

(ref:fig6-6)

Figure 2.105: (ref:fig6-6)

References

Gude, Patricia H., J. Anthony Cookson, Mark C. Greenwood, and Mark Haggerty. 2009. “Homes in Wildfire-Prone Areas: An Empirical Analysis of Wildfire Suppression Costs and Climate Change.” www.headwaterseconomics.org.

Revelle, William. 2019. Psych: Procedures for Psychological, Psychometric, and Personality Research. https://CRAN.R-project.org/package=psych.

Wei, Taiyun, and Viliam Simko. 2017. Corrplot: Visualization of a Correlation Matrix. https://CRAN.R-project.org/package=corrplot.

Weisberg, Sanford. 2005. Applied Linear Regression, Third Edition. Hoboken, NJ: Wiley.

Weisberg, Sanford. 2018. Alr3: Data to Accompany Applied Linear Regression 3rd Edition. https://CRAN.R-project.org/package=alr3.


  1. This interface with the cor function only works after you load the mosaic package.

  2. The natural log (\(\log_e\) or \(\ln\)) is used in statistics so much that the function in R log actually takes the natural log and if you want a \(\log_{10}\) you have to use the function log10. When statisticians say log we mean natural log.