--- title: "Mixed Effects Models Case Study" output: html_notebook editor_options: chunk_output_type: inline --- ```{r} suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(lme4)) suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(broom)) suppressPackageStartupMessages(library(Lahman)) ``` We will use the Baseball dataset `Teams` available in the `Lahman` package in R, with teams having at least 5 years on record. ```{r} dat <- Teams %>% count(teamID) %>% right_join(Teams, by = "teamID") %>% filter(n >= 5) %>% select(team = teamID, runs = R, hits = H, walks = BB) %>% mutate(hits = scale(hits), walks = scale(walks)) head(dat) ``` We will investigate how hits and walks affects total number of runs (data are scaled to improve convergence of model-fitting). Here is a plot on the effect of walks on runs: ```{r, warning=FALSE, fig.width=10} ggplot(dat, aes(walks, runs)) + facet_wrap(~team, ncol=7) + geom_point(alpha=0.5) + geom_smooth(method="lm", se=FALSE, fullrange=TRUE, size=1) + theme_bw() ``` ### Two extremes: pooling and separating Use a linear model to determine the effect of `hits` and `walks` on `runs`, on an average `team`. Best to allow each team to have its own intercept. Note the standard error on `hits` and `walks`. ```{r} lm(runs ~ walks + hits + team, data = dat) %>% tidy() %>% head() ``` How would you determine the effect of `hits` and `walks` _for each individual team_? Note the standard error on `hits` on a representative team. ```{r} lm(runs ~ walks*team + hits*team, data = dat) %>% tidy() %>% filter(str_detect(term, "hits")) ``` Which model yields a smaller SE? "Pooled" model: artificially small, since data are actually correlated. "Separate" model: artificially large, since we're only using data within a particular group. ### An intermediate: mixed effects models Fit a mixed effects model with random effects on both `hits` and `walks`. Note the SE's are _in between_ the "pooled" and "separate" linear models. ```{r} fit_lme <- lmer(runs ~ walks + hits + (walks + hits | team), data = dat) tidy(fit_lme) ``` Obtain estimates on individual teams: ```{r} coef(fit_lme)[[1]] ``` Extract the covariance matrix of the fixed effects: ```{r} vcov(fit_lme) ``` Extract the covariance matrix of the random effects: ```{r} summary(fit_lme)$varcor[[1]] ``` ### Questions 1. Order the models in terms of increasing bias (of regression coefficients) separate model -> LME -> pooled model 2. Order the models in terms of increasing (reported) variance (of regression coefficients). pooled model -> LME -> separate model 3. What is the effect of `hits` on `runs` for team ARI, using the LME? ```{r} coef(fit_lme)[[1]] %>% rownames_to_column("team") %>% filter(team == "ARI") %>% .[["hits"]] ``` 4. What is the exact distribution of effects (slopes) of `hits` on `runs`? > Normal with a mean equal to the fixed effect: ```{r, warning=FALSE} fit_lme %>% tidy() %>% filter(term == "hits") %>% .[["estimate"]] ``` > and a variance equal to the squared sd of the random effect: ```{r, warning=FALSE} fit_lme %>% tidy() %>% filter(term == "sd_hits.team") %>% .[["estimate"]] %>% .^2 ``` 5. What is the SE of the fixed effect of `hits`? ```{r, warning=FALSE} fit_lme %>% tidy() %>% filter(term == "hits") %>% .[["std.error"]] ``` 6. A new team has an average number of hits and walks. Their expected number of runs has what exact distribution? > Since we centered both predictors, this is the same as just taking the intercept (both fixed and random effects). This means that this expectation follows a Normal distribution with the following mean: ```{r, warning=FALSE} fit_lme %>% tidy() %>% filter(term == "(Intercept)") %>% .[["estimate"]] ``` > and the following variance: ```{r, warning=FALSE} fit_lme %>% tidy() %>% filter(term == "sd_(Intercept).team") %>% .[["estimate"]] %>% .^2 ```