Understanding (almost) everything as a Linear Model

TODO

Ease of reporting via automated reported has to be strongly balanced against thorough understanding of the models themselves.

notes on what to add from last year:

Dependencies

Code
library(faux)

************
Welcome to faux. For support and examples visit:
https://debruine.github.io/faux/
- Get and set global package options with: faux_options()
************
Code
library(lme4)
Loading required package: Matrix
Code
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(tidyr)

Attaching package: 'tidyr'
The following objects are masked from 'package:Matrix':

    expand, pack, unpack
Code
library(scales)
library(modelbased)
library(see)
library(lavaan)
This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.
Code
library(semPlot)

Wilkinson notation

Simple regression

~ regress variable to the right of this onto the left of this. e.g., y ~ x is “x causes/predicts/is associated with y”.

Dependent variables (the thing caused or predicted) are to the left of ~ whereas independent variables are to the right.

Intercepts are specified as 1. E.g., y ~ 1 + x. Note that intercepts are often implicit, and are usually calculated even if not specified. i.e., y ~ x == y ~ 1 + x

Understand beta regression coefficients by going back to your high school math for the slope of a line: \(y = m \times x + c\). This can be rewritten by swapping the place of the intercept (c) and slope (m), relabeling the variables, and adding the error term: \(y \sim \beta_i + \beta_xx + e\).

Multilevel models: Fixed vs random effects

When categorical, fixed effects variables are typically “exhaustive”, in that the levels present in the data are all the possible options (they ‘exhaust’ the possible options). In contrast, random effects variables are “non-exhaustive”, in that other levels might exist in the world. For example, I would be more likely to use ‘journal’ as a random effect variable because I only have 5 journals per subfield in my data, but many other journals exist in these subfields in the real world.

Additionally, random effects can be used to acknowledge dependencies in data. For example, if participants’ positive affect was measured 5 times a day for 1 week, you would have 35 data points per participant. Simple regressions are fixed effects only and assume independence of the data. Mixed effects models allow you to include random effects to acknowledge that the same participant produced these 35 data points.

Separately, fixed effects variables assume that there is a single true variable for each parameter in the population, whereas random effects assume that there is a distribution of true population parameters. Random effects estimate the hyperparameters of those distributions.

Packages like {lme4} and {brms} are excellent for fitting multilevel models. Equally importantly, packages such as those in the {easystat} universe of packages, e.g., {modelbased}, {parameters}, and {see} are very useful for extracting, interpreting and plotting results from multilevel models, as is the {marginaleffects} package.

Latent variable modelling/CFA/SEM

Packages like {lavaan} are excellent for fitting Confirmatory Factor Analyses, Path Analyses, Structural Equation Models, and also simple regressions.

  • ~ to specify regressions, as in other Wilkinson notation.
  • =~ to specify measurement models, i.e., unobserved latent variables defined by observed variables.
  • ~~ to specify correlated variances, e.g., for items in a measurement model that are known to be correlated or to acknowledge non-independence in data (e.g., between timepoints)
Code
model <- '
         # measurement model
         latent_x =~ x1 + x2 + x3   # Latent x is measured by observed variables x1, x2, and x3
         latent_m =~ m1 + m2 + m3 + m4 + m5 # Latent m is measured by observed variables y1, y2, y3 and y4
         latent_y =~ y1 + y2 + y3 + y4  # Latent y is measured by observed variables y1, y2, y3 and y4
         
         # correlated variances (ie without specifying causality)
         x1 ~~ x2  # Residuals of x1 and x2 are allowed to correlate
         
         # structural model: specify regressions
         latent_m ~ latent_x        
         latent_y ~ latent_x + latent_m  
         '
Code
dat <- lavaan::simulateData(model = model, sample.nobs = 100) 
Warning: lavaan->simulateData():  
   some regression coefficients are unspecified and will be set to zero
Code
res <- sem(model = model, data = dat) 
Warning: lavaan->lav_lavaan_step11_estoptim():  
   Model estimation FAILED! Returning starting values.
Code
semPaths(res, 
         whatLabels = "diagram", 
         edge.label.cex = 1.2, 
         sizeMan = 5)

More advanced lavaan: := is used to define user parameters.

Code
# specify mediation model which extracts custom parameters of interest
model <-  '
          M ~ a*X
          Y ~ b*M + c*X
          indirect := a*b
          direct := c
          total := c + (a*b)
          proportion_mediated := indirect / total
          '

Practicing Wilkinson notation

Generate data

The below generates some data - understanding it is not the point of the lesson.

Code
set.seed(42)

n_articles_per_journal <- 15
n_journals_per_subfield <- 5
n_subfields <- 6
total_n <- n_articles_per_journal * n_journals_per_subfield * n_subfields

# adjust the error term to ensure the variance of y is 1
beta_year <- 0.5
error_sd <- sqrt(1 - beta_year^2)  # adjusted to maintain total variance of 1

dat_fe <- tibble(year = sample(0:9, size = total_n, replace = TRUE)) |>
  mutate(count_selfreport_measures = beta_year*year + rnorm(n = total_n, mean = 3, sd = error_sd)) 

dat <- 
  bind_cols(
    dat_fe,
    add_random(subfield = 6) |>
      add_random(journal = 5, .nested_in = "subfield") |>
      add_random(article = 15, .nested_in = "journal")
  ) 

Fixed effects only model

Code
res <- lm(formula = count_selfreport_measures ~ 1 + year,
          data = dat)

summary(res)

Call:
lm(formula = count_selfreport_measures ~ 1 + year, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.53835 -0.54512 -0.00374  0.59542  2.88762 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  3.02020    0.07511   40.21 <0.0000000000000002 ***
year         0.48409    0.01349   35.88 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8523 on 448 degrees of freedom
Multiple R-squared:  0.7419,    Adjusted R-squared:  0.7413 
F-statistic:  1288 on 1 and 448 DF,  p-value: < 0.00000000000000022

Average Marginal Effect for year

Code
predictions <- estimate_expectation(res)

# plot(predictions) +
#   theme_linedraw()

Multi-level model

Code
res <- lmer(formula = count_selfreport_measures ~ 1 + year + (1 | subfield/journal),
            data = dat)
boundary (singular) fit: see help('isSingular')
Code
summary(res)
Linear mixed model fit by REML ['lmerMod']
Formula: count_selfreport_measures ~ 1 + year + (1 | subfield/journal)
   Data: dat

REML criterion at convergence: 1139

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8853 -0.6325 -0.0030  0.6870  3.2220 

Random effects:
 Groups           Name        Variance Std.Dev.
 journal:subfield (Intercept) 0.02859  0.1691  
 subfield         (Intercept) 0.00000  0.0000  
 Residual                     0.69876  0.8359  
Number of obs: 450, groups:  journal:subfield, 30; subfield, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.01578    0.08065   37.39
year         0.48503    0.01344   36.08

Correlation of Fixed Effects:
     (Intr)
year -0.784
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

Average Marginal Effect for year

Code
predictions <- estimate_expectation(res, at = c("subfield", "journal"))

# plot(predictions) +
#   theme_linedraw()

Facet for each level of the random intercept

Code
ggplot(predictions, aes(x = year, y = Predicted)) +
  geom_line() +
  labs(
    title = "Predicted Self-Report Measures by Subfield and Journal",
    x = "Year",
    y = "Predicted Count of Self-Report Measures"
  ) +
  scale_x_continuous(breaks = scales::breaks_pretty(n = 10)) +
  theme_linedraw() +
  facet_wrap(subfield ~ journal)