Frequency and Response Latencies with Regression
Erin M. Buchanan
Last Knitted: 2021-12-22
Language Topics Discussed
- English Lexicon Project
- The Subtitle Projects
- Extension to Priming Projects
English Lexicon Project
- http://elexicon.wustl.edu/
- ELP was a large undertaking that helped kick start the recent uptick in standardized behavioral data for language researchers
- Corpora have been around, so have databases, but in the last 10 years, this field has grown exponentially
- The data contains 40K + words and 40K + nonwords with many characteristics
- Lexical Decision and Naming Tasks included
English Lexicon Project
- https://www.psytoolkit.org/experiment-library/ldt.html
- However, you would only see one word at a time (in this example you see two)
- In the naming task, you would be asked to read those words aloud, one at a time
- Output we are interested in is how the lexical variables might predict the response latencies
Subtitle Projects
- Starting with Brysbaert and New (several papers), there was a movement to rethink frequency and its relation to predicting language results
- Traditionally, two sources of frequency were used:
- The Brown Corpus: Kucera and Francis (1967)
- HAL Corpus: Burgess and Livesay (1998)
- However, we weren’t sure these were the best estimators of frequency
Subtitle Projects
- http://subtlexus.lexique.org/
- Downloaded subtitles from www.opensubtitles.org with 50 million words+
- Provided both estimates for subtitles and music lyrics
- Models estimating lexical decision and naming times indicate these estimations of frequency are better predictors
- Expanded to 15-20 different languages
The Semantic Priming Project
- Priming occurs when cognitive processing is speeded because of a previous event
- Generally, we measure priming using lexical decision and naming tasks
- Let’s say you have two trials:
- DOCTOR –> TREE (unrelated)
- DOCTOR –> NURSE (related)
The Semantic Priming Project
- Priming is thought to occur by several different mechanisms: spreading activation, deliberate cognitive processes such as expectancy generation, etc.

The Semantic Priming Project
- Contains lexical decision and naming response latencies for related, unrelated, and nonword trials
- Is paired with the ELP and SUBTLEX projects
- Gives us more data to predict either response latencies or priming latencies
Regression
- Simple regression is the relationship between one independent and one dependent variable (also correlation)
- Multiple regression is the relationship between two or more independent variables and one dependent variable
- Useful because it allows you to examine the predictive ability of each variable adjusting for the other variables
- We can fit parametric (linear) models or nonparametric models, depending on the expectation of linearity, as well as the type of dependent variable
Understand Regression Models
\[{\hat{y_i}} = b_0 + b_1x_{1i} + b_2x_{2i} ... + \epsilon_i\]
- Y-hat is the predicted score for each person (i) on the dependent variable
- B-zero is the y-intercept
- B-1+ are the slope values for each predictor
- Slopes are interpreted as for every one unit increase in X, we see B unit increases in Y
- X is each individual predictor
- Error for each individual person, as we never get their predicted score exactly right
Understand Regression Models
- Least Squares estimation
- Creates the line of best fit by minimizing the residual error \(\epsilon\)

Understand Regression Models
- For the overall model including all variables:
- Determine statistical significance by using p values from an F-test for linear models
- Determine practical significance by using \(R^2\)
- For the individual predictors:
- Determine statistical significance by using p values from a t-test
- Determine practical significance by using partial correlation \(pr^2\)
Examples Using ELP
- Word is the word presented to the participant
- Length is the number of characters in each word
- SUBTLWF is the subtitle word frequency estimate
- POS is part of speech
- Mean_RT is the mean response latency in milliseconds
library(Rling)
data(ELP)
head(ELP)
## Word Length SUBTLWF POS Mean_RT
## 1 rackets 7 0.96 NN 790.87
## 2 stepmother 10 4.24 NN 692.55
## 3 delineated 10 0.04 VB 960.45
## 4 swimmers 8 1.49 NN 771.13
## 5 umpire 6 1.06 NN 882.50
## 6 cobra 5 3.33 NN 645.85
Dealing with Categorical Predictors
- How can we interpret and use categorical predictors?
- When X is continuous, the interpretation is that for every one unit increase in X, we see B unit increases in Y
- That doesn’t work as well for categorical predictors…
- Instead, we have to use Dummy Coding (well, R does it for us)
Dummy Coding
- A way to represent categorical data for regression/least squares analyses
- You will get (categories - 1) predictors
- How to interpret these predictors?
- Each predictor represents the difference in means between the coded group (the one with the 1) and the group coded as all zeroes (the “control” group)

Dummy Coding
- POS is a categorical predictor we want to use
- Three categories:
- JJ: Adjective
- NN: Noun
- VB: Verb
- Default is to make the first category the comparison category
##
## JJ NN VB
## 159 532 189
Dummy Coding
- Generally, nouns would be considered the comparison group, so let’s rearrange them so they are “first”.
ELP$POS <- factor(ELP$POS, #the column you want to update
#the values in the data in the order you want
levels = c("NN", "JJ", "VB"),
#give them better labels if you want
labels = c("Noun", "Adjective", "Verb"))
table(ELP$POS)
##
## Noun Adjective Verb
## 532 159 189
Dealing with Non-Normal Data
- One issue in language research is that often we have non-normal data
- Especially when working with frequency (as it is distributed by Zipf’s law)
Dealing with Non-Normal Data
hist(ELP$SUBTLWF, breaks = 100)

Dealing with Non-Normal Data
- The simplest solution is to take the
log of the variable.
- Does make interpretation a bit more difficult, but helps with the distribution of the data.
Dealing with Non-Normal Data
ELP$Log_SUB <- log(ELP$SUBTLWF)
hist(ELP$Log_SUB)

Build the Linear Model
- To be able to use our output for several purposes, we want to save it
- You can call it whatever you want, I like
model.
- Format for
lm function is:
- Y ~ X + X + …
- data = name of data frame
model <- lm(Mean_RT ~ Length + Log_SUB + POS,
data = ELP)
Summarize the Linear Model
##
## Call:
## lm(formula = Mean_RT ~ Length + Log_SUB + POS, data = ELP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.70 -62.55 -9.71 53.87 389.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 616.351 12.233 50.385 < 2e-16 ***
## Length 19.555 1.433 13.645 < 2e-16 ***
## Log_SUB -29.288 1.784 -16.420 < 2e-16 ***
## POSAdjective 6.115 8.506 0.719 0.47238
## POSVerb -23.069 7.918 -2.913 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.29 on 875 degrees of freedom
## Multiple R-squared: 0.4565, Adjusted R-squared: 0.454
## F-statistic: 183.7 on 4 and 875 DF, p-value: < 2.2e-16
Residuals
- A summary of the residuals - remember that residuals are the error terms or how far off we were at predicting the Mean_RT
- We will use this information as part of our assumptions diagnostics for data screening
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -213.699 -62.551 -9.714 0.000 53.874 388.998
Coefficients
- The coefficients table shows you the individual predictor significance levels
- If you use p < .05 as a criterion, we see that:
- Intercept is the average RL
- Length is a positive predictor: long words take longer to react to
- Frequency is a negative predictor: more frequent words are faster (i.e. low freq = high RL)
- Adjectives and Nouns have the same RL
- Verbs and Nouns have different RL
Coefficients
options(scipen = 999)
round(summary(model)$coefficients, 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 616.351 12.233 50.385 0.000
## Length 19.555 1.433 13.645 0.000
## Log_SUB -29.288 1.784 -16.420 0.000
## POSAdjective 6.115 8.506 0.719 0.472
## POSVerb -23.069 7.918 -2.913 0.004
Coefficients
- To interpret categorical predictors, it can help to make a means table
- Now I can see that verbs are responded to faster than nouns, interpreting the categorical predictor
tapply(ELP$Mean_RT, #dv
ELP$POS, #iv group variable
mean) #function
## Noun Adjective Verb
## 787.5959 822.9145 754.3316
Coefficient Confidence Intervals
- We can calculate the confidence intervals for the coefficients, to help understand their precision
## 2.5 % 97.5 %
## (Intercept) 592.34193 640.36007
## Length 16.74194 22.36757
## Log_SUB -32.78872 -25.78704
## POSAdjective -10.57915 22.80935
## POSVerb -38.61021 -7.52737
Coefficient Practical Importance
- Calculate \(pr^2\): variance accounted for in the DV by that IV after removing all variance due to other IVs
\[ \frac{t_{x_i}}{\sqrt{t_{x_i}^2 + df_{res}}} \]
t <- summary(model)$coefficients[-1 , 3]
pr <- t / sqrt(t^2 + model$df.residual)
pr^2
## Length Log_SUB POSAdjective POSVerb
## 0.1754422571 0.2355448602 0.0005903474 0.0096065265
Overall Model
- So how much better than a random guess are we at predicting?
- A good random guess is always the y-intercept or the mean of y.
- The F-statistic represents the difference of the model from zero
summary(model)$fstatistic
## value numdf dendf
## 183.7024 4.0000 875.0000
Overall Model
- \(R^2\) represents the overlap in all IV variance with the DV variance
## [1] 0.4564574
Diagnostic Tests
- Outliers and influential observations: data points that have large residuals or are otherwise odd in relation to the rest of the data
- Assumptions of parametric regression:
- Independence
- DV is at least interval or better yet ratio scale
- Additivity (no multicollinearity)
- Linearity
- Normality
- Homoscedasticity/Homogeneity
Outliers
- Hat values (or leverage): indicates how much influence on the slope a data point has
- Studentized residuals: the normalized (z-scored) difference between a participant’s predicted and actual score
- Cook’s values: a measure of influence (both leverage and discrepancy)
Outliers
library(car)
influencePlot(model)

## StudRes Hat CookD
## 16 1.0366190 0.027128347 0.005992376
## 207 -0.5110465 0.030963411 0.001670423
## 331 2.9961984 0.020366762 0.036990327
## 411 3.8500613 0.002244803 0.006566174
## 498 4.2181812 0.004047706 0.014190408
## 660 3.7238861 0.008750676 0.024129118
Outliers
ELP[c(331,660,498,411), ]
## Word Length SUBTLWF POS Mean_RT Log_SUB
## 331 interdepartmental 17 0.04 Adjective 1324.57 -3.2188758
## 660 sacrilegious 12 0.39 Adjective 1228.06 -0.9416085
## 498 whippet 7 0.10 Noun 1209.67 -2.3025851
## 411 archenemy 9 0.25 Noun 1188.91 -1.3862944
Assumptions
- Independence - the data is independent from each other (i.e. each data point is from a different “person”)
- Interval scale dependent variable: check!
Additivity
- No correlation between predictors above .9 (but .7 is actually not good either)
summary(model, correlation = T)$correlation[ , -1]
## Length Log_SUB POSAdjective POSVerb
## (Intercept) -0.94238493 -0.272940500 -0.09904041 -0.232585764
## Length 1.00000000 0.340416664 -0.05527619 0.066688453
## Log_SUB 0.34041666 1.000000000 0.09363100 0.006852748
## POSAdjective -0.05527619 0.093631001 1.00000000 0.237177357
## POSVerb 0.06668845 0.006852748 0.23717736 1.000000000
Additivity
- Small Variance Inflation Scores (VIF) values (less than 5 to 10)
## GVIF Df GVIF^(1/(2*Df))
## Length 1.151054 1 1.072872
## Log_SUB 1.150140 1 1.072446
## POS 1.026925 2 1.006664
Linearity

Normality
hist(scale(residuals(model)))

Homoscedasticity/Homogeneity

Homoscedasticity/Homogeneity
{plot(scale(residuals(model)), scale(model$fitted.values))
abline(v = 0, h = 0)}

One Solution to Bad Assumptions
- First, build a function that saves the numbers you want:
bootcoef <- function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
}
Bootstrapping
- Next, use the
boot library to run the bootstraps (lots of runs on randomly sampled data)
library(boot)
model.boot <- boot(formula = Mean_RT ~ Length + Log_SUB + POS,
data = ELP,
statistic = bootcoef,
R = 1000)
Bootstrapping
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ELP, statistic = bootcoef, R = 1000, formula = Mean_RT ~
## Length + Log_SUB + POS)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 616.350998 -0.25564332 12.557699
## t2* 19.554757 0.05627581 1.548970
## t3* -29.287879 0.06915177 1.752018
## t4* 6.115101 -0.01991016 9.160950
## t5* -23.068792 -0.10379626 7.039374
CIs for Bootstrapped Estimates
boot.ci(model.boot, index = 3)
## Warning in boot.ci(model.boot, index = 3): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = model.boot, index = 3)
##
## Intervals :
## Level Normal Basic
## 95% (-32.79, -25.92 ) (-32.83, -25.81 )
##
## Level Percentile BCa
## 95% (-32.77, -25.74 ) (-32.82, -25.81 )
## Calculations and Intervals on Original Scale
Summary
- You learned about some linguistic projects that have provided useful data to test hypotheses
- … how we can use linear regression to look at predictions
- … how to examine assumptions of linear regression
- … how to bootstrap in case those assumptions are not met