--- title: "Multilevel Modeling in R with NHL Power Play Data" author: "Ekarin Pongpipat & Matthew J. Kmiecik" date: "11 February 2018" output: html_document: toc: yes toc_depth: 4 toc_float: collapsed: yes smooth_scroll: yes --- ```{r setup, include=F} knitr::opts_chunk\$set(echo = TRUE, fig.align = 'center') options(knitr.table.format = 'html') # For the html tables ``` ## Introduction
In a [previous post](https://mattkmiecik.com/post-The-Last-Decade-NHLs-Best-and-Worst-Power-Play-Teams.html), I was interested in determining if there were any fluctuations in the frequency of powerplay goals (5 on 4) in the NHL across one decade (2007-2017). The statistial technique of choice was linear regression. However, there were a few issues with this statistical modeling procedure: 1. This model did not account for the variability within each team. In other words, the model did not appreciate each team's change in powerplay goals across time individually. When we do not account for this variability (within-team variance), the unsystematic error is seemingly reduced, and subsequently an increase in our test statistic, decrease in p-value, and ultimately an increase in Type I error rate 2. Time was treated as a factor (i.e., categorical variable). However, time is actually a continuous construct and may reveal more informaton when treated as such > I had a feeling something was off here. So, I turned to one of my fellow classmates, Ekarin Pongpipat, to teach me a technique that could remedy this situation. The following is what we produced after two Saturday afternoons! Thanks again Ekarin! To improve this model of NHL 5v4 powerplay goals, we utilized a multi-level linear regression model. The purpose of this post is three-fold: 1. Create a more accurate model of NHL powerplay goal fluctuations across the 2007-2017 decade 2. Explore these data with plenty of plots 3. Provide a tutorial of multi-level linear modeling in R ## Setup
Let's first load some useful packages for analysis and plotting: ```{r packages, message=F} library(tidyverse); library(RCurl); library(broom); library(gridExtra); library(knitr); library(kableExtra); library(RColorBrewer) ``` These data were retrieved via [Corsica](http://corsica.hockey/) shortly after the end of the 2017 regular season and are also available on this [website's Github repository](https://github.com/mkmiecik14/mkmiecik14.github.io). Feel free to pull these data from the repository like this: ```{r data} link <- 'https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/corsicaData_5v4.csv' data5v4 <- read.table(text = getURL(link), sep = ',', header = T) ``` The Atlanta Thrashers moved to Winnipeg (Jets) after the 2011 season. Therefore, we combined the entries of the Atlanta Thrashers and the Winnipeg Jets. **The Thrashers and the Jets are considered the same team in this analysis.** ```{r cleanup} # Combining ATL and WPG as ATL.WPG data5v4\$Team <- plyr::revalue(data5v4\$Team, c(ATL = 'ATL.WPG', WPG = 'ATL.WPG')) ``` For the regression models, it is more desirable to have the year be represented as a continuous variable. This way an increase of 1 year is the equivalent of 1 year of hockey played. This will translate to more interpretable regression coefficents. Year was recoded as the starting year of the season (e.g., 20162017 = 2016): ```{r year} data5v4 <- data5v4 %>% separate(Season, c('StartYear', 'EndYear'), sep = 4, remove = F) %>% mutate_at(c('StartYear', 'EndYear'), funs(as.numeric)) ``` Next, we coded each team into its division following the 2013-2014 realignment scheme. **Note: This has serious implications for our later analysis of conference and divisions. This realignment took place 6 years from the first season in this analysis. Therefore, identifying the teams using this realignment scheme is used for illustrative purposes.** ```{r div} # Initializing variables to store team names pacific <- c('S.J', 'CGY', 'L.A', 'ANA', 'EDM', 'VAN', 'ARI') central <- c('ATL.WPG', 'NSH', 'STL', 'DAL', 'COL', 'MIN', 'CHI') metropolitan <- c('WSH', 'N.J', 'PHI', 'CBJ', 'PIT', 'NYR', 'NYI', 'CAR') atlantic <- c('T.B', 'BOS', 'TOR', 'DET', 'MTL', 'FLA', 'OTT', 'BUF') # Creating a column to identify each team's division data5v4 <- data5v4 %>% mutate(division = case_when(data5v4\$Team %in% pacific ~ 'Pacific', data5v4\$Team %in% central ~ 'Central', data5v4\$Team %in% metropolitan ~ 'Metropolitan', data5v4\$Team %in% atlantic ~ 'Atlantic')) ``` Here is a table detailing the 2013 realignment: ```{r divTable} # Creates the data frame for html table divisions <- data.frame(Pacific = c(pacific, ''), Central = c(central, ''), Metropolitan = metropolitan, Atlantic = atlantic) # Prints table kable(divisions) %>% kable_styling(bootstrap_options = c('striped', 'hover', 'responsive')) ``` ## Data Exploration
Before analysis, let's first explore these data via various plots. One question we are interested in is, "Are the amount of standard 5 vs. 4 regular season power play goals changing over time across 10 years (2007-2017) in the NHL?" ```{r vis1} ggplot(data5v4, aes(StartYear, GF, group = 1)) + geom_line(aes(group = Team, color = division), alpha = 2/3) + geom_line(stat = 'summary', fun.y = 'mean', size = .9, color = 'red') + scale_color_brewer(palette = 'Blues', direction = -1) + labs(x = 'Season (Start Year)', y = 'Total Season Power Play Goals (5v4)', caption = 'Each team is a separate line; red line is the average') + guides(color = guide_legend(title = 'Division')) + scale_x_continuous(limits = c(2007, 2016), breaks = c(2007:2016)) + theme_minimal() ``` The plot above demonstrates a dramatic decrease in power play goals, league-wide, during the 2012-2013 season. This is clearly a result of the shortened season, which would greatly reduce the amount of powerplay goal attempts, due to the 2012-2013 NHL lockout. Let's control for this shortended season by dividing the powerplay goals (GF) by the amount of games played (GP) per season: ```{r vis2} ggplot(data5v4, aes(StartYear, GF/GP, group = 1)) + geom_line(aes(group = Team, color = division), alpha = 2/3) + geom_line(stat = 'summary', fun.y = 'mean', size = .9, color = 'red') + scale_color_brewer(palette = 'Blues', direction = -1) + labs(x = 'Season (Start Year)', y = 'Total Season Power Play Goals/Games Played (5v4)', caption = 'Each team is a separate line; red line is the average') + guides(color = guide_legend(title = 'Division')) + scale_x_continuous(limits = c(2007, 2016), breaks = c(2007:2016)) + theme_minimal() ``` Now that we've controlled for the number of games played per season, we can see that there seems to be a decrease in the average powerplay goals (red line) scored per season for each subsequent year of play. Further ways to plot these data include a boxplot and an average trend line: ```{r vis3} yaxlim <- c(.25,1) # Locks y-axis # Box plot boxplot <- ggplot(data5v4, aes(factor(StartYear), GF/GP, group = StartYear)) + geom_boxplot() + labs(x = 'Season (Start Year)', y = 'Total Season Power Play Goals/Games Played (5v4)', caption = ' ') + coord_cartesian(ylim = yaxlim) + theme_minimal() # Summary for line plot data5v4_sum <- data5v4 %>% group_by(StartYear) %>% summarise(mean = mean(GF/GP), sd = mean(GF/GP), n = n(), sem = sd/sqrt(n)) # Line plot lineplot <- ggplot(data5v4_sum, aes(factor(StartYear), mean, group = 1)) + geom_line(alpha = 1/3) + geom_pointrange(aes(ymax = mean + sem, ymin = mean - sem)) + coord_cartesian(ylim = yaxlim) + labs(x = 'Season (Start Year)', caption = 'Error bars are SEM') + theme_minimal() + theme(axis.title.y = element_blank()) # Arranges plots side-by-side grid.arrange(boxplot, lineplot, ncol = 2) ``` ## Multi-level Linear Modeling