---
title: "CONJ620: CM 4.2"
subtitle: "The General Linear Model"
author: "Alison Hill"
output:
xaringan::moon_reader:
css: ["default", "css/ohsu.css", "css/ohsu-fonts.css"]
lib_dir: libs
nature:
highlightStyle: atelier-lakeside-light
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
options(knitr.table.format = "html")
knitr::opts_chunk$set(warning = FALSE, message = FALSE,
comment = NA, dpi = 300,
fig.align = "center", cache = FALSE)
library(tidyverse)
library(infer)
library(moderndive)
```
## Family of *t*-tests
- One-sample *t*-test
- Independent samples *t*-test
- Dependent samples *t*-test (also known as paired)
---
## What do they all have in common?
We want to compare two things- mean 1 and mean 2!
---
## Ways to get independent samples
* Random assignment of participants to two different experimental conditions
* Scream versus Scream 2
* Classical music versus silence
* Naturally occurring assignment of participants to two different groups
* Male versus female
* Young versus old
---
## Formula for the independent groups *t*-test
$$t=\frac{(\bar{y}_1-\bar{y}_2)-(\mu_1-\mu_2)}{SE}$$
Generally,
> $H_0: \mu_1-\mu_2=0$ and $H_1: \mu_1-\mu_2\neq0$
So $\mu_1-\mu_2$ is often excluded from the formula. Since we now have two sample means and therefore two sample variances, we need some way to combine these two variances in a logical way. The answer is to __pool__ the variances to estimate the SE of the difference, $(\bar{y}_1-\bar{y}_2)$:
$$s_{\bar{y}_1-\bar{y}_2}=s_{pooled}\times{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$
and $s_{pooled}$ is:
$$s_{pooled}=\sqrt{\frac{(n_1-1)s_{y_1}^2+(n_2-1)s_{y_2}^2}{n_1+n_2-2}}$$
---
## The *t*-test as a general linear model (GLM)
All statistical procedures are basically the same thing:
$$outcome_i=(model)+error_i$$
In my simple linear regression from the past few labs, this has been:
$$prestige_i=(intercept + education_i)+error_i$$
Or more generally:
$$y_i=(b_0 + b_1x_i)+error_i$$
> __N.B.__ $Error_i$ is an unknowable, incalculable statistic- it is the deviation of the $i^{th}$ value from the (unobservable) true value. You can think of it as measurement error. This is different from the error of the regression formula, which is defined as the *residual* of the $i^{th}$ value and calculated as the difference between the observed $(y_i)$ and predicted scores $(\hat{y_i})$.
---
## Sidebar: The GLM in matrix notation
The GLM we have been dealing with thus far includes just one independent variable and thus just one $b_1$. However, the full GLM is better represented as a matrix
$$\boldsymbol{Y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}$$
where...
* $\boldsymbol{Y}$ is the *response vector* of length *N*;
* $\boldsymbol{\epsilon}$ is the *error vector* of length *N*;
* $\boldsymbol{\beta}$ is the vector of parameters of length *p*+1 where *p* is the number of IVs and the 1 accounts for the intercept;
* and $\boldsymbol{X}$ is called the *design matrix* consisting of a matrix of *N* rows and *p+1* columns
---
# Design matrices with one independent variable
.pull-left[
In both simple linear regression and the independent samples *t*-test, $\boldsymbol{X}$ is a matrix of *N* rows and *2* columns. Note that the number of columns in $\boldsymbol{X}$ must always equal the number of rows in $\boldsymbol{\beta}$.
![here](latex-image-2.pdf)
]
.pull-right[
In simple linear regression, the vector *X* can take on any value. In the independent samples *t*-test, this vector simply contains 0's and 1's. Below, *n*=4 with 2 in each group to illustrate.
![here](latex-image-1.pdf)
]
---
## Dummy variables
.pull-left[
Now, let's switch IVs in our example from education (a continuous variable) to a group variable:
$$prestige_i=(intercept + group_i)+error_i$$
where group is a dummy or indicator variable that can only take two values: 0 or 1.
This is easy to do in R. Remember in my Prestige dataset, I have a categorical variable called "type." Let's let the variable Group denote:
> 0 for blue collar (type="bc") and
> 1 for white collar (type="wc")
]
.pull-right[
]
---
# Creating a dummy variable in R
```{r warning = FALSE, message = FALSE}
library(car)
table(Prestige$type)
```
```{r}
Prestige$group <- ifelse(Prestige$type=="bc",0, ifelse(Prestige$type=="wc",1,NA))
table(Prestige$type, Prestige$group)
```
---
# Cleaning up the new dataset
```{r}
Prestige.2 <- subset(Prestige, group %in% c(0,1))
Prestige.2$type <- droplevels(Prestige.2$type) #get rid of type=prof
```
---
# OK, we are all set now with two groups
```{r}
table(Prestige.2$type, Prestige.2$group)
```
```{r}
head(Prestige.2)
```
---
# Plotting data for two groups
Not surprisingly, mean prestige ratings appear to be higher among white collar workers than blue collar workers.
```{r warning=FALSE, message=FALSE, echo = FALSE}
library(ggplot2)
Prestige.2$group.l <- factor(Prestige.2$group, labels=c("bc","wc"))
ggplot(Prestige.2, aes(x=factor(group.l),y=prestige)) + geom_dotplot(stackdir = "centerwhole",binwidth=1,binaxis='y', alpha=.5)+theme_bw() + geom_segment(aes(x=0.75, y=mean(Prestige.2$prestige[Prestige.2$group==0]), xend=1.25, yend=mean(Prestige.2$prestige[Prestige.2$group==0]))) + geom_segment(aes(x=1.75, y=mean(Prestige.2$prestige[Prestige.2$group==1]), xend=2.25, yend=mean(Prestige.2$prestige[Prestige.2$group==1])))
```
---
# But are the two groups different?
Let's do an independent samples *t*-test to find the answer:
```{r}
t.test(prestige~group,data=Prestige.2, var.equal=T)
```
---
# What happens if I now run a linear regression?
---
# Kidding. Here's is the linear regression summary...
```{r}
fit <- lm(formula = prestige ~ group, data = Prestige.2)
summary(fit)
```
---
# But what does it MEAN?
Let's do a quick review of the linear regression formula:
$$y_i=(b_0 + b_1x_i)+error_i$$
Solving for the intercept term, for example, we need to apply some summation algebra:
$$\frac{1}{n}\sum_i^n{y_i}=\frac{1}{n}\sum_i^n{b_0} + \frac{1}{n}\sum_i^n{b_1x_i}+\frac{1}{n}\sum_i^n{error_i}$$
A few reminders about summation algebra in this context:
- The mean is defined as: $\frac{1}{n}\sum_i^n{y_i}$
- The sum of a constant is just *n* times the constant: $\sum_i^n{b_0}=n\times{b_0}$
- The sum of a constant times a random variable is the constant times the sum of the variable: $\sum_i^n{b_1x_i}=b_1\sum_i^n{x_i}$
- By definition, in GLM, we assume the mean of the error is 0: $\frac{1}{n}\sum_i^n{error_i}=0$
---
# Solving for the regression coefficients
Applying the summation algebra from the previous slide, we get:
$$\frac{1}{n}\sum_i^n{y_i}=\frac{1}{n}nb_0 + \frac{1}{n}b_1\sum_i^n{x_i}$$
$$\bar{y}=b_0 + b_1\bar{x}$$
This should look familiar! The formula for the regression intercept term, $b_0$, is:
$$b_0=\bar{y}-b_1\bar{x}$$
But look again: we also have the formula for $\bar{y}$. The intercept term in linear regression is the expected mean value of $y$ when $x_i$=0.
---
# Solving for $\bar{x}$
.pull-left[
$$\bar{y}=b_0 + b_1\bar{x}$$
But, you protest, how do we calculate $\bar{x}$? Is it the mean of the 0/1 values in *x*? What is happening?
]
.pull-right[
]
---
# Solving for $\bar{x}$
.pull-left[
$$t=\frac{(\bar{y}_1-\bar{y}_2)-(\mu_1-\mu_2)}{SE}$$
Remember the original formula I gave you for the *t* statistic. There is no *x*! But implicitly, this formula is actually:
$$t=\frac{(\bar{y}_{x=1}-\bar{y}_{x=2})-(\mu_{x=1}-\mu_{x=2})}{SE}$$
So what we are actually interested in the mean of *y* when *x* takes on one value versus another (in my current example, *x*=0 or 1).
]
.pull-right[
]
---
# The intercept
For the *t*-test, we can solve for the intercept just as we can for the simple linear regression. Remember our formula for $\bar{y}$:
$$\bar{y}=b_0 + b_1\bar{x}$$
Let's re-write it two ways:
$$\bar{y}_{x=0}=b_0 + b_1\bar{x_0}$$
$$\bar{y}_{x=1}=b_0 + b_1\bar{x_1}$$
Start with the top formula: What is $\bar{x_0}$? This is simple to think about in matrix notation- the mean of a vector of 0's is 0. So, when the group variable is equal to zero (blue collar)...
$$\bar{y}_{bc}=(b_0 + b_1\times0)=b_0$$
Therefore, $b_0$ (the intercept term) is equal to the mean prestige score of the blue collar group (the one coded as 0).
---
# The slope
Now, let's tackle the second formula:
$$\bar{y}_{x=1}=b_0 + b_1\bar{x_1}$$
When the group variable is equal to 1 (white collar), $\bar{x_1}=1$ because the mean of a vector of 1's is 1.
$$\bar{y}_{wc}=(b_0 + b_1\times1)$$
$$\bar{y}_{wc}=b_0+ b_1$$
$$\bar{y}_{wc}=\bar{y}_{bc}+ b_1$$
Solving for $b_1$:
$$b_1=\bar{y}_{wc}-\bar{y}_{bc}$$
Therefore, $b_1$ (the slope) is equal to the difference between group means in prestige scores.
---
# What does this mean?
We could represent a two-group experiment as a regression equation in which the regression coefficient $b_1$ is equal to the difference between group means and the intercept term $b_0$ is the mean of the group coded as 0.
Our independent samples *t*-test would take the form:
$$y_i=\bar{y}_{bc} + (\bar{y}_{wc}-\bar{y}_{bc})x_i+error_i$$
Think of it this way: the regression line must pass through these two points:
- (0, $\bar{y}_{bc}$)
- (1, $\bar{y}_{wc}$)
---
# Trust but verify
```{r}
y_bc <- mean(Prestige.2$prestige[Prestige.2$group==0])
y_wc <- mean(Prestige.2$prestige[Prestige.2$group==1])
diff <- y_wc-y_bc
cbind(y_bc,y_wc, diff)
```
So, y_bc is $b_0$ and diff is $b_1$...right?
```{r}
coef(fit)
```
---
# The General Linear Model
A number of different statistical models are extensions of this same idea of a GLM:
- Ordinary least squares (OLS) linear regression (simple and multiple): 1+ predictors may be continuous or factors
- *t*-test: a *t*-test is basically a regression model where the 1 predictor is a factor with exactly 2 levels
- Analysis of Variance/Covariance (ANOVA/ANCOVA): an ANOVA is basically a regression model where the 1+ predictors are factors
- Multivariate Analysis of Variance/Covariance (MANOVA/MANCOVA): a MANOVA is basically a regression model with 2+ DVs where the 1+ predictors are factors
---
# Further food for thought...
The independent samples *t*-test is a special case of an ANOVA. Specifically, a one-way ANOVA (definition: 1 IV that is a factor and 1 DV that is continuous) in which the IV factor has *exactly* two levels.
Again, trust but verify!
```{r}
anova(lm(prestige~group,data=Prestige.2)) #doing an ANOVA in R
```
(hint: square the *t*-statistic to get the *F*-statistic equivalent)
---
# The non-centrality parameter of the *t* distribution
In an earlier class, I alluded to the fact that the *t*-distribution has an another parameter in addition to the degrees of freedom- a non-centrality parameter, $\delta$.
> When $H_0$ is true, $\delta=0$.
> When $H_0$ is false, $\delta\neq0$.
Why is this? Let's look at the formula:
$$\delta=\sqrt{\frac{n_1n_2}{n_1+n_2}}(\frac{\mu_1-\mu_2}{SE})$$
As you can see, if $\mu_1-\mu_2$=0 then $\delta$=0.
---
# The *t*-distribution...
.pull-left[
> When the null is true
```{r echo=FALSE}
# Script 12.4 - The Normal distribution and several t-distributions
# Clearing out the workspace
rm(list = ls())
# Setting the values for the abscissa
tval <- pretty(c(-3.25, 3.25), 1000)
# Find the ordinates for the normal distribution
ordn <- dnorm(tval)
# Drawing the normal distribution
plot(tval, ordn, type = "l", lty = 1, lwd = 3,
ylab = "Ordinate", xlab = "Value")
# Setting df for three different t-distributions
nu <- c(1, 3, 10)
# Drawing the three different t-distributions
lines(tval, dt(tval, nu[1]), type = "l" ,lty = 2, lwd = 2)
lines(tval, dt(tval, nu[2]), type = "l", lty = 3, lwd = 2)
lines(tval, dt(tval, nu[3]), type = "l", lty = 5, lwd = 2)
legend(-3,.40, c("Normal", "t, df=1", "t, df=3", "t, df=10"),
lwd = c(3,2,2,2), lty = c(1,2,3,5))
```
]
.pull-right[
> When the null is false
```{r echo=FALSE}
# Script 12.4 - The Normal distribution and several t-distributions
# Clearing out the workspace
rm(list = ls())
# Setting the values for the abscissa
tval <- pretty(c(-3.25, 3.25), 1000)
# Find the ordinates for the normal distribution
ordn <- dnorm(tval)
# Drawing the normal distribution
plot(tval, ordn, type = "l", lty = 1, lwd = 3,
ylab = "Ordinate", xlab = "Value")
# Setting df for three different t-distributions
nu <- c(9, 9, 9)
delta <- c(.5, 1, 1.5)
# Drawing the three different t-distributions
lines(tval, dt(tval, nu[1], delta[1]), type = "l" ,lty = 2, lwd = 2)
lines(tval, dt(tval, nu[2], delta[2]), type = "l", lty = 3, lwd = 2)
lines(tval, dt(tval, nu[3], delta[3]), type = "l", lty = 5, lwd = 2)
legend(-3,.40, c("Normal", "t, delta=.5", "t, delta=1", "t, delta=1.5"),
lwd = c(3,2,2,2), lty = c(1,2,3,5))
```
]
---
# Non-central *t*-distributions...
.pull-left[
> Degrees of freedom=9
```{r echo=FALSE}
nu <- c(9, 9, 9, 9)
delta <- c(0, .5, 1, 1.5)
plot(tval, dt(tval, nu[1], delta[1]), type = "l" ,lty = 1, lwd = 3,
ylab = "Ordinate", xlab = "Value", col="#d7191c")
# Drawing the three different t-distributions
lines(tval, dt(tval, nu[2], delta[2]), type = "l" ,lty = 2, lwd = 2, col="#fdae61")
lines(tval, dt(tval, nu[3], delta[3]), type = "l", lty = 3, lwd = 2, col="#abdda4")
lines(tval, dt(tval, nu[4], delta[4]), type = "l", lty = 5, lwd = 2, col="#2b83ba")
legend(-3,.40, c("t, delta=0", "t, delta=.5", "t, delta=1", "t, delta=1.5"),
lwd = c(3,2,2,2), lty = c(1,2,3,5), col=c("#d7191c", "#fdae61","#abdda4", "#2b83ba"))
```
]
.pull-right[
> $\delta$=5
```{r echo=FALSE}
# Setting the values for the abscissa
ntval <- pretty(c(0,12), 1000)
nu <- c(10, 20, 30, 40)
delta <- c(5,5,5,5)
plot(ntval, dt(ntval, nu[1], delta[1]), type = "l" ,lty = 1, lwd = 3,
ylab = "Ordinate", xlab = "Value", xlim=c(0, 12), ylim=c(0, .45), col="#d7191c")
# Drawing the three different t-distributions
lines(ntval, dt(ntval, nu[2], delta[2]), type = "l" ,lty = 2, lwd = 2, col="#fdae61")
lines(ntval, dt(ntval, nu[3], delta[3]), type = "l", lty = 3, lwd = 2, col="#abdda4")
lines(ntval, dt(ntval, nu[4], delta[4]), type = "l", lty = 5, lwd = 2, col="#2b83ba")
legend(0,.45, c("df=10", "df=20", "df=30", "df=40"),
lwd = c(3,2,2,2), lty = c(1,2,3,5), col=c("#d7191c", "#fdae61","#abdda4", "#2b83ba"))
```
]
---
# Using non-central *t*-distributions...
```{r}
help(TDist)
pt(-1, 20, 0)
pt(-1, 20, 5)
```
Recommended:
```{r eval=FALSE}
power.t.test()
```
---
# Welch's *t*-test: Dealing with unequal variances
Recall the formula for the independent groups *t*-test:
$$t=\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{SE}$$
In Welch's formula, we calculate the SE differently:
$$SE'=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$$
So the formula for Welch's $t'$:
$$t'=\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$$
---
# Welch's *t*-test: Modified degrees of freedom
Recall the degrees of freedom for the independent groups *t*-test:
$$\nu=n_1+n_2-1$$
The degrees of freedom are modified for Welch's $t'$:
$$\nu'=\frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{s_1^4}{n_1^2(n_1-1)}+\frac{s_2^4}{n_2^2(n_2-1)}}$$
---
# George E. P. Box
> "Equally, the statistician knows, for example, that in nature there never was a normal distribution. There never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world." -*Journal of the American Statistical Association, 71*(356), pp. 791-799.