Correlation is a number that describes how strong of a relationship there is between two variables.

In digital analytics terms, you can use it to explore relationships between web metrics to see if an influence can be inferred, but be careful to not hastily jump to conclusions that do not account for other factors.

For instance, a high correlation between social shares and SEO position could mean:

It is, unfortunately, pretty common to see something like the first bullet used as the sole interpretation of a correlation, which is problematic for two reasons:

But, still, correlation can be very useful: identifying that a relationship exists can be a great place to start looking for the underlying drivers of that relationship which, ultimately, can lead to an insight than can drive an action!

Performing correlation analysis in R

The base function cor() will perform correlations on a data.frame. Let’s give this a go with some data. We’ve got two options here, so choose wisely, grasshopper! Option 1 is preferred, but Option 2 is a perfectly serviceable fallback.

Option 1: YOUR Data

To make this interesting, how about we use your data? If you have a Google account with access to Google Analytics, then follow the instructions to authenticate on the Analytics APIs page. Once you’ve done that, then the function below from the googleAnalyticsR package should give you the last 60 days of data for the viewID you enter:

start_date <- Sys.Date()-60
end_date <- Sys.Date()-1

web_data <- google_analytics_4([insert your view ID],
            date_range = c(start_date,end_date),
            metrics = c("sessions","pageviews","entrances","bounces"),
            dimensions = c("date","channelGrouping","deviceCategory"),
            anti_sample = TRUE)

head(web_data)

Structurally, the output should look very similar to the table shown in Option 2 below (note that you didn’t have to remove the first column!).

Option 2: Sample Data

If you don’t have a Google Analytics account or are not able to access it, then you can use the sample data set already downloaded in earlier exercises:

## used for kable that makes nice RMarkdown tables
library(knitr)
web_data <- read.csv("./data/gadata_example_2.csv", stringsAsFactors = FALSE)

## get rid of first colum as it's just row numbers
web_data <- web_data[, -1]

kable(head(web_data))
date channelGrouping deviceCategory sessions pageviews entrances bounces
2016-01-01 (Other) desktop 19 23 19 15
2016-01-01 (Other) mobile 112 162 112 82
2016-01-01 (Other) tablet 24 41 24 19
2016-01-01 Direct desktop 133 423 133 61
2016-01-01 Direct mobile 345 878 344 172
2016-01-01 Direct tablet 126 237 126 77

Quick Review: What all is going on with the simple code above?

  • web_data is an object (data frame) in this case that we’re creating
  • We’re using the stringsAsFactors parameter because…we usually just want to do that when reading in data.
  • kable() is a function that is really just used here so that the data displays a little more cleanly on this page. If you’re trying this simply within RStudio…you can omit it.
  • head() is a function that (by default) returns only the first six rows of the object.
  • web_data[, -1] is “the data.” the -1 is saying “don’t show the first column” (you could think of this as web_data[,-1], too)

Got it? Good. Let’s continue with the lesson at hand!

Let’s correlate!

Correlations will only work with numeric data, so we subset to just those columns and then run the base R function cor() to see a correlation table:

web_data_metrics <- web_data[,c("sessions","pageviews","entrances","bounces")]
## see correlation between all metrics
kable(cor(web_data_metrics))
sessions pageviews entrances bounces
sessions 1.0000000 0.8384321 0.9999923 0.9411201
pageviews 0.8384321 1.0000000 0.8377078 0.6364753
entrances 0.9999923 0.8377078 1.0000000 0.9416535
bounces 0.9411201 0.6364753 0.9416535 1.0000000

The table is mirrored in the diagonal and provides the correlation coefficient (aka, “r”) between each pair of values that intersect in the cell. 1 means a perfect correlation, 0 means no correlation and -1 means a perfect negative correlation.

Does the R that we’re working on learning today have anything to do with the correlation coefficient r? Well…no. Or, at least, only to the extent that you can use R-the-platform to calculate r-the-correlation-coefficient. Good question, though!

When working with correlations, its always a good idea to view an exploratory plot. A handy function for this is pairs() which creates a scatter plot of all the metrics passed in combination:

pairs(web_data_metrics)

Here you can see the correlation numbers in graphical form. For instance, the high correlation of 0.9999923 between sessions and entrances results in an almost perfect straight line. Since a session starts with an entrance, this makes perfect sense! A correlation of less than 1 may be a quick diagnostic that something is wrong with the tracking.

How do web channels correlate?

One useful piece of analysis is seeing how web channels possibly interact.

Data Prep

To get the data in the right format, the below code pivots via the reshape2 package:

## Use tidyverse to pivot the data
library(dplyr)
library(tidyr)

## Get only desktop rows, and the date, channelGrouping and sessions columns
pivoted <- web_data %>% 
  filter(deviceCategory == "desktop") %>% 
  select(date, channelGrouping, sessions) %>%
  spread(channelGrouping, sessions)

## Get rid of any NA's and replace with 0
pivoted[is.na(pivoted)] <- 0

kable(head(pivoted))
date (Other) Direct Display Email Organic Search Paid Search Referral Social Video
2016-01-01 19 133 307 17 431 555 131 68 0
2016-01-02 156 1003 196 43 1077 1060 226 158 3
2016-01-03 35 1470 235 29 696 489 179 66 90
2016-01-04 31 1794 321 70 1075 558 235 46 898
2016-01-05 27 1899 309 74 1004 478 218 47 461
2016-01-06 21 1972 204 299 974 494 246 47 418

Take a minute to examine what the pivoted data looks like? Is it tidy data? Not exactly. But, that’s good! In one since, we’ve got separate “metrics” for each day now – the channelGrouping-sessions combination.

Examining the Data

We can plot and correlate all the metrics for an overview. Because we don’t want to do exactly the same thing as we did earlier (where’s the fun in that?!), let’s go ahead and round the correlation coefficients to two decimal places using the round() function. Other than that, we’ll do exactly what we already did when we were simply correlating the metrics in our data set:

## can't include the date as its not numeric, so remove
cor_data <- pivoted[, -1]
## not including first column, so -1 subset
cor_table <- round(cor(cor_data),2)
kable(cor_table)
(Other) Direct Display Email Organic Search Paid Search Referral Social Video
(Other) 1.00 0.13 0.12 0.09 0.17 0.03 -0.04 0.46 0.37
Direct 0.13 1.00 0.04 0.07 0.22 -0.02 0.09 -0.05 0.01
Display 0.12 0.04 1.00 0.01 -0.05 0.05 -0.10 0.19 0.27
Email 0.09 0.07 0.01 1.00 0.26 0.03 0.13 0.03 0.13
Organic Search 0.17 0.22 -0.05 0.26 1.00 0.51 0.19 -0.04 0.08
Paid Search 0.03 -0.02 0.05 0.03 0.51 1.00 0.08 -0.01 0.22
Referral -0.04 0.09 -0.10 0.13 0.19 0.08 1.00 -0.11 -0.25
Social 0.46 -0.05 0.19 0.03 -0.04 -0.01 -0.11 1.00 0.50
Video 0.37 0.01 0.27 0.13 0.08 0.22 -0.25 0.50 1.00
pairs(cor_data)

Analysis

Now, when we compare channels, we see much looser correlations for this dataset, which makes sense, right? Correlations under 0.3 are, as a rule-of-thumb, not worth considering, so the standouts look to be Social vs. Video* and Paid** vs. Organic Search.

Plotting those channels, we can examine the trends to see the shape of the data

Correlation has help us zero in on possibly interesting relationships

library(ggplot2)
gg <- ggplot(data = pivoted) + 
      theme_minimal() + 
      ggtitle("Paid (blue) vs Organic (green) search")
gg <- gg + 
      geom_line(aes(x = as.Date(date), y = `Paid Search`), col = "blue")

gg + geom_line(aes(x = as.Date(date), y = `Organic Search`), col = "green")

We can see here the trends do look similar, but with a paid search peak in the first quarter (as we look at this, we might want to consider these spikes as outliers – either simply by the visual or using a more defined method for detecting outliers; it would be quite simple to remove this data from the data set and run the analysis again…but we’re not going to go down that particular rabbit hole right now).

library(ggplot2)
gg <- ggplot(data = pivoted) + 
              theme_minimal() + 
              ggtitle("Social (red) vs Video (orange)")
gg <- gg + 
      geom_line(aes(x = as.Date(date), y = Social), col = "red")
gg + geom_line(aes(x = as.Date(date), y = Video), col = "orange")

Here, a peak in Social late in the year looks to have coincided with a peak in Video: possibly a campaign driving video views?

Cross correlation

The correlations, above all, compare the same date point, but what if you expect a lagged effect? Perhaps the video traffic drove social traffic later on due to client advocacy?

Cross correlations are useful when dealing with time-series data and can examine if a metric has an influence on itself or another after some time has passed.

This can be a powerful way to find if, say, a TV or Display campaign increased SEO traffic over the course of the few weeks following the campaign.

The below compares paid search on SEO using the ccf() function. The result is the correlation for different lags of days. We can see a correlation at 0 lag at around 0.5, but the correlation increases if you lag the Social trend up to 10 days before.

ccf(pivoted$Social, pivoted$Video)

You could then conclude that Video was having a lagged effect on Social traffic up to 10 days beforehand. But, beware! The nature of cross-correlation is that if both datasets have a similar looking spike, cross-correlation will highlight it. Careful examination of the raw data trends should be performed to verify it. In some cases a smoother line will help get rid of spikes that affect the data (e.g., do the analysis on weekly or monthly data instead of daily). And, of course, there is no substitue for rational thought: if you find a relationship like this, can you explain it rationally? And, if so, can you conduct further analysis to validate that rationalization? (If only R had an Easy Button to do the actual thinking, too. Alas!)