Exercise 4 – Tidy / dplyr()

This exercise is dependent on the successful completion of Exercise 2.

Our web_data data is already in a tidy format. When data is not in a tidy format when you get it, the tidyr package has functions to help make it that way. Some day, perhaps we’ll have an exercise to do this!

For this exercise, complete the following tasks with web_data:

Calculate the average pageviews per day by channel using the summarise() function in the dplyr package
Add two columns to web_data using the mutate() function: pvs_per_session and bounce_rate
Calculate the average bounce rate by deviceCategory (be careful with the order of your calculations!).

Start by loading the dplyr package:

library(dplyr)

Now, dive into the exercises!

1. Average Pageviews

To get the averages for each combination of deviceCategory and channelGrouping we simply use summarise() with group_by:

summarise(group_by(web_data, deviceCategory, channelGrouping),mean(pageviews))

## Source: local data frame [27 x 3]
## Groups: deviceCategory [?]
## 
##    deviceCategory channelGrouping `mean(pageviews)`
##             <chr>           <chr>             <dbl>
## 1         desktop         (Other)          108.4413
## 2         desktop          Direct         2220.2394
## 3         desktop         Display          576.0329
## 4         desktop           Email          246.8169
## 5         desktop  Organic Search         2902.8826
## 6         desktop     Paid Search         1927.7136
## 7         desktop        Referral         1401.0094
## 8         desktop          Social          368.0798
## 9         desktop           Video          584.2358
## 10         mobile         (Other)          193.7981
## # ... with 17 more rows

This function uses the group_by function to take the data set (web_data) and specify that it should be grouped together by two fields: deviceCategory and channelGrouping. That, in and of itself, isn’t enough. If we’re going to group by those, then we need to also specify how we want to handle the multiple rows that get collapsed with that grouping.

For that, we put the group_by() function inside a summarise() function and tell the summarise() function that we want to include pageviews as a metric, and we want to collapse the grouped rows by averaging (mean()) pageviews.

We could include multiple metrics in the summarise() function. For instance, if we wanted to t total visits and the average pageviews, we could alter the code slightly:

summarise(group_by(web_data, deviceCategory, channelGrouping), mean(pageviews),sum(sessions))

## Source: local data frame [27 x 4]
## Groups: deviceCategory [?]
## 
##    deviceCategory channelGrouping `mean(pageviews)` `sum(sessions)`
##             <chr>           <chr>             <dbl>           <int>
## 1         desktop         (Other)          108.4413           10820
## 2         desktop          Direct         2220.2394          297580
## 3         desktop         Display          576.0329           88688
## 4         desktop           Email          246.8169           22350
## 5         desktop  Organic Search         2902.8826          156144
## 6         desktop     Paid Search         1927.7136          154535
## 7         desktop        Referral         1401.0094           75253
## 8         desktop          Social          368.0798           22434
## 9         desktop           Video          584.2358           92885
## 10         mobile         (Other)          193.7981           29115
## # ... with 17 more rows

2. Add Two Columns

In this case, we’re actually going to modify the web_data object by adding a couple of calculated columns. This is, really, just like working with an Excel Table and adding columns that are based on existing columns in the table. The mutate() function takes a data set and then adds new columns as specified in the remaining parameters:

web_data <- mutate(web_data, pvs_per_session = pageviews / sessions, 
                  bounce_rate = bounces / entrances)

# Display the results: truncated here so that it fits nicely on the screen
head(web_data[c("date","channelGrouping","deviceCategory",
               "pvs_per_session","bounce_rate")])

##         date channelGrouping deviceCategory pvs_per_session bounce_rate
## 1 2016-01-01         (Other)        desktop        1.210526   0.7894737
## 2 2016-01-01         (Other)         mobile        1.446429   0.7321429
## 3 2016-01-01         (Other)         tablet        1.708333   0.7916667
## 4 2016-01-01          Direct        desktop        3.180451   0.4586466
## 5 2016-01-01          Direct         mobile        2.544928   0.5000000
## 6 2016-01-01          Direct         tablet        1.880952   0.6111111

3. Bounce Rate

This is a little bit of a trick question. Since we’ve already calculated the bounce rate and added that as a new column in our web_data data frame, we may be tempted to simply average those values:

summarise(group_by(web_data,deviceCategory), mean(bounce_rate))

## # A tibble: 3 × 2
##   deviceCategory `mean(bounce_rate)`
##            <chr>               <dbl>
## 1        desktop           0.5614256
## 2         mobile           0.6041581
## 3         tablet           0.5981690

This would be incorrect, though, wouldn’t it? This would be the daily average bounce rate, but we want just the average bounce rate. Note that R is not going to point this out. We still have to use our analytical brains.

What we need to do is to first summarise the data and sum the metrics we need to use for the bounce rate calculation and then perform the bounce rate calculation. This is where the pipe really starts to come in handy for code readability:

summarise(group_by(web_data,deviceCategory), 
          entrances = sum(entrances), 
          bounces = sum(bounces)) %>%
  mutate(bounce_rate = bounces / entrances)

## # A tibble: 3 × 4
##   deviceCategory entrances bounces bounce_rate
##            <chr>     <int>   <int>       <dbl>
## 1        desktop    914621  565252   0.6180177
## 2         mobile    789386  479515   0.6074531
## 3         tablet    204921  122656   0.5985526

Note how the results from the second calculation differ from those in the first calculation.

Also, a note on why we included “entrances =” in the summarise() function. If we hadn’t, then the column heading would have been sum(entrances) and sum(bounces). That would have been fine, but we then would have needed to change the mutate() function:

mutate(bounce_Rate = sum(bounces) / sum(entrances))

Either would work, but it’s sometimes easier to clean up names as we go along. If anything, this should be reminiscent of working with pivot tables in Excel: if we added sessions as a value summarised by summing it, then the value would show as SUM of sessions in the pivot table. Right?