Ease of reporting via automated reported has to be strongly balanced against thorough understanding of the models themselves.
linear model
(almost) everything is a linear model
even means and SDs
even correlations
even named tests
even CFA and path analysis
notes on what to add from last year:
it’s all a linear model - resources and code
slides on relating y = mx + c to regression
the estimate for an intercept only model is the mean, that’s what a mean is. The error term is the standard deviation; that’s what a SD is.
meta-analyses are intercept only models with fixed or random effects for site. add code for this.
use meta-analysis as the entry point for random effect models and wilkinson notation for it.
sum scores are a specific factor model with equal loadings.
Dependencies
Code
library(faux)
************
Welcome to faux. For support and examples visit:
https://debruine.github.io/faux/
- Get and set global package options with: faux_options()
************
Code
library(lme4)
Loading required package: Matrix
Code
library(ggplot2)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Code
library(tidyr)
Attaching package: 'tidyr'
The following objects are masked from 'package:Matrix':
expand, pack, unpack
This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.
Code
library(semPlot)
Wilkinson notation
Simple regression
~ regress variable to the right of this onto the left of this. e.g., y ~ x is “x causes/predicts/is associated with y”.
Dependent variables (the thing caused or predicted) are to the left of ~ whereas independent variables are to the right.
Intercepts are specified as 1. E.g., y ~ 1 + x. Note that intercepts are often implicit, and are usually calculated even if not specified. i.e., y ~ x == y ~ 1 + x
Understand beta regression coefficients by going back to your high school math for the slope of a line: \(y = m \times x + c\). This can be rewritten by swapping the place of the intercept (c) and slope (m), relabeling the variables, and adding the error term: \(y \sim \beta_i + \beta_xx + e\).
Multilevel models: Fixed vs random effects
When categorical, fixed effects variables are typically “exhaustive”, in that the levels present in the data are all the possible options (they ‘exhaust’ the possible options). In contrast, random effects variables are “non-exhaustive”, in that other levels might exist in the world. For example, I would be more likely to use ‘journal’ as a random effect variable because I only have 5 journals per subfield in my data, but many other journals exist in these subfields in the real world.
Additionally, random effects can be used to acknowledge dependencies in data. For example, if participants’ positive affect was measured 5 times a day for 1 week, you would have 35 data points per participant. Simple regressions are fixed effects only and assume independence of the data. Mixed effects models allow you to include random effects to acknowledge that the same participant produced these 35 data points.
Separately, fixed effects variables assume that there is a single true variable for each parameter in the population, whereas random effects assume that there is a distribution of true population parameters. Random effects estimate the hyperparameters of those distributions.
Packages like {lme4} and {brms} are excellent for fitting multilevel models. Equally importantly, packages such as those in the {easystat} universe of packages, e.g., {modelbased}, {parameters}, and {see} are very useful for extracting, interpreting and plotting results from multilevel models, as is the {marginaleffects} package.
Latent variable modelling/CFA/SEM
Packages like {lavaan} are excellent for fitting Confirmatory Factor Analyses, Path Analyses, Structural Equation Models, and also simple regressions.
~ to specify regressions, as in other Wilkinson notation.
=~ to specify measurement models, i.e., unobserved latent variables defined by observed variables.
~~ to specify correlated variances, e.g., for items in a measurement model that are known to be correlated or to acknowledge non-independence in data (e.g., between timepoints)
Code
model <-' # measurement model latent_x =~ x1 + x2 + x3 # Latent x is measured by observed variables x1, x2, and x3 latent_m =~ m1 + m2 + m3 + m4 + m5 # Latent m is measured by observed variables y1, y2, y3 and y4 latent_y =~ y1 + y2 + y3 + y4 # Latent y is measured by observed variables y1, y2, y3 and y4 # correlated variances (ie without specifying causality) x1 ~~ x2 # Residuals of x1 and x2 are allowed to correlate # structural model: specify regressions latent_m ~ latent_x latent_y ~ latent_x + latent_m '
Code
dat <- lavaan::simulateData(model = model, sample.nobs =100)
Warning: lavaan->simulateData():
some regression coefficients are unspecified and will be set to zero
Code
res <-sem(model = model, data = dat)
Warning: lavaan->lav_lavaan_step11_estoptim():
Model estimation FAILED! Returning starting values.
More advanced lavaan: := is used to define user parameters.
Code
# specify mediation model which extracts custom parameters of interestmodel <-' M ~ a*X Y ~ b*M + c*X indirect := a*b direct := c total := c + (a*b) proportion_mediated := indirect / total '
Practicing Wilkinson notation
Generate data
The below generates some data - understanding it is not the point of the lesson.
Code
set.seed(42)n_articles_per_journal <-15n_journals_per_subfield <-5n_subfields <-6total_n <- n_articles_per_journal * n_journals_per_subfield * n_subfields# adjust the error term to ensure the variance of y is 1beta_year <-0.5error_sd <-sqrt(1- beta_year^2) # adjusted to maintain total variance of 1dat_fe <-tibble(year =sample(0:9, size = total_n, replace =TRUE)) |>mutate(count_selfreport_measures = beta_year*year +rnorm(n = total_n, mean =3, sd = error_sd)) dat <-bind_cols( dat_fe,add_random(subfield =6) |>add_random(journal =5, .nested_in ="subfield") |>add_random(article =15, .nested_in ="journal") )
Fixed effects only model
Code
res <-lm(formula = count_selfreport_measures ~1+ year,data = dat)summary(res)
Call:
lm(formula = count_selfreport_measures ~ 1 + year, data = dat)
Residuals:
Min 1Q Median 3Q Max
-2.53835 -0.54512 -0.00374 0.59542 2.88762
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.02020 0.07511 40.21 <0.0000000000000002 ***
year 0.48409 0.01349 35.88 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8523 on 448 degrees of freedom
Multiple R-squared: 0.7419, Adjusted R-squared: 0.7413
F-statistic: 1288 on 1 and 448 DF, p-value: < 0.00000000000000022
res <-lmer(formula = count_selfreport_measures ~1+ year + (1| subfield/journal),data = dat)
boundary (singular) fit: see help('isSingular')
Code
summary(res)
Linear mixed model fit by REML ['lmerMod']
Formula: count_selfreport_measures ~ 1 + year + (1 | subfield/journal)
Data: dat
REML criterion at convergence: 1139
Scaled residuals:
Min 1Q Median 3Q Max
-2.8853 -0.6325 -0.0030 0.6870 3.2220
Random effects:
Groups Name Variance Std.Dev.
journal:subfield (Intercept) 0.02859 0.1691
subfield (Intercept) 0.00000 0.0000
Residual 0.69876 0.8359
Number of obs: 450, groups: journal:subfield, 30; subfield, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.01578 0.08065 37.39
year 0.48503 0.01344 36.08
Correlation of Fixed Effects:
(Intr)
year -0.784
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
Average Marginal Effect for year
Code
predictions <-estimate_expectation(res, at =c("subfield", "journal"))# plot(predictions) +# theme_linedraw()
Facet for each level of the random intercept
Code
ggplot(predictions, aes(x = year, y = Predicted)) +geom_line() +labs(title ="Predicted Self-Report Measures by Subfield and Journal",x ="Year",y ="Predicted Count of Self-Report Measures" ) +scale_x_continuous(breaks = scales::breaks_pretty(n =10)) +theme_linedraw() +facet_wrap(subfield ~ journal)
Source Code
---title: "Understanding (almost) everything as a Linear Model"format: html: toc: true toc_float: true code-fold: show code-tools: true---```{r}#| include: false# settings, placed in a chunk that will not show in the .html file (because include=FALSE) # disables scientific notation so that small numbers appear as eg "0.00001" rather than "1e-05"options(scipen =999) ```TODOEase of reporting via automated reported has to be strongly balanced against thorough understanding of the models themselves.- linear model- (almost) everything is a linear model - even means and SDs - even correlations - even named tests - even CFA and path analysisnotes on what to add from last year:- it's all a linear model - resources and code- slides on relating y = mx + c to regression- the estimate for an intercept only model is the mean, that's what a mean is. The error term is the standard deviation; that's what a SD is.- meta-analyses are intercept only models with fixed or random effects for site. add code for this.- use meta-analysis as the entry point for random effect models and wilkinson notation for it.- sum scores are a specific factor model with equal loadings.# Dependencies```{r}library(faux)library(lme4)library(ggplot2)library(dplyr)library(tidyr)library(scales)library(modelbased)library(see)library(lavaan)library(semPlot)```# Wilkinson notation**Simple regression**`~` regress variable to the right of this onto the left of this. e.g., `y ~ x` is "x causes/predicts/is associated with y".Dependent variables (the thing caused or predicted) are to the left of `~` whereas independent variables are to the right.Intercepts are specified as 1. E.g., `y ~ 1 + x`. Note that intercepts are often implicit, and are usually calculated even if not specified. i.e., `y ~ x` == `y ~ 1 + x`Understand beta regression coefficients by going back to your high school math for the slope of a line: $y = m \times x + c$. This can be rewritten by swapping the place of the intercept (c) and slope (m), relabeling the variables, and adding the error term: $y \sim \beta_i + \beta_xx + e$.**Multilevel models: Fixed vs random effects**When categorical, fixed effects variables are typically "exhaustive", in that the levels present in the data are all the possible options (they 'exhaust' the possible options). In contrast, random effects variables are "non-exhaustive", in that other levels might exist in the world. For example, I would be more likely to use 'journal' as a random effect variable because I only have 5 journals per subfield in my data, but many other journals exist in these subfields in the real world.Additionally, random effects can be used to acknowledge dependencies in data. For example, if participants' positive affect was measured 5 times a day for 1 week, you would have 35 data points per participant. Simple regressions are fixed effects only and assume independence of the data. Mixed effects models allow you to include random effects to acknowledge that the same participant produced these 35 data points.Separately, fixed effects variables assume that there is a single true variable for each parameter in the population, whereas random effects assume that there is a distribution of true population parameters. Random effects estimate the hyperparameters of those distributions.Packages like {lme4} and {brms} are excellent for fitting multilevel models. Equally importantly, packages such as those in the {easystat} universe of packages, e.g., {modelbased}, {parameters}, and {see} are very useful for extracting, interpreting and plotting results from multilevel models, as is the {marginaleffects} package.**Latent variable modelling/CFA/SEM**Packages like {lavaan} are excellent for fitting Confirmatory Factor Analyses, Path Analyses, Structural Equation Models, and also simple regressions.- `~` to specify regressions, as in other Wilkinson notation.- `=~` to specify measurement models, i.e., unobserved latent variables defined by observed variables.- `~~` to specify correlated variances, e.g., for items in a measurement model that are known to be correlated or to acknowledge non-independence in data (e.g., between timepoints)```{r}model <-' # measurement model latent_x =~ x1 + x2 + x3 # Latent x is measured by observed variables x1, x2, and x3 latent_m =~ m1 + m2 + m3 + m4 + m5 # Latent m is measured by observed variables y1, y2, y3 and y4 latent_y =~ y1 + y2 + y3 + y4 # Latent y is measured by observed variables y1, y2, y3 and y4 # correlated variances (ie without specifying causality) x1 ~~ x2 # Residuals of x1 and x2 are allowed to correlate # structural model: specify regressions latent_m ~ latent_x latent_y ~ latent_x + latent_m '``````{r}dat <- lavaan::simulateData(model = model, sample.nobs =100) res <-sem(model = model, data = dat) semPaths(res, whatLabels ="diagram", edge.label.cex =1.2, sizeMan =5)```More advanced lavaan: `:=` is used to define user parameters.```{r}# specify mediation model which extracts custom parameters of interestmodel <-' M ~ a*X Y ~ b*M + c*X indirect := a*b direct := c total := c + (a*b) proportion_mediated := indirect / total '```# Practicing Wilkinson notation## Generate dataThe below generates some data - understanding it is not the point of the lesson.```{r}set.seed(42)n_articles_per_journal <-15n_journals_per_subfield <-5n_subfields <-6total_n <- n_articles_per_journal * n_journals_per_subfield * n_subfields# adjust the error term to ensure the variance of y is 1beta_year <-0.5error_sd <-sqrt(1- beta_year^2) # adjusted to maintain total variance of 1dat_fe <-tibble(year =sample(0:9, size = total_n, replace =TRUE)) |>mutate(count_selfreport_measures = beta_year*year +rnorm(n = total_n, mean =3, sd = error_sd)) dat <-bind_cols( dat_fe,add_random(subfield =6) |>add_random(journal =5, .nested_in ="subfield") |>add_random(article =15, .nested_in ="journal") ) ```## Fixed effects only model```{r}res <-lm(formula = count_selfreport_measures ~1+ year,data = dat)summary(res)```Average Marginal Effect for year```{r}predictions <-estimate_expectation(res)# plot(predictions) +# theme_linedraw()```## Multi-level model```{r}res <-lmer(formula = count_selfreport_measures ~1+ year + (1| subfield/journal),data = dat)summary(res)```Average Marginal Effect for year```{r}predictions <-estimate_expectation(res, at =c("subfield", "journal"))# plot(predictions) +# theme_linedraw()```Facet for each level of the random intercept```{r fig.height=9, fig.width=9}ggplot(predictions, aes(x = year, y = Predicted)) +geom_line() +labs(title ="Predicted Self-Report Measures by Subfield and Journal",x ="Year",y ="Predicted Count of Self-Report Measures" ) +scale_x_continuous(breaks = scales::breaks_pretty(n =10)) +theme_linedraw() +facet_wrap(subfield ~ journal)```