---
title: "CONJ620: CM 4.2"
subtitle: "The General Linear Model"
author: "Alison Hill"
output:
  xaringan::moon_reader:
    css: ["default", "css/ohsu.css", "css/ohsu-fonts.css"]
    lib_dir: libs
    nature:
      highlightStyle: atelier-lakeside-light
      highlightLines: true
      countIncrementalSlides: false
---

```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
options(knitr.table.format = "html") 
knitr::opts_chunk$set(warning = FALSE, message = FALSE, 
  comment = NA, dpi = 300,
  fig.align = "center", cache = FALSE)
library(tidyverse)
library(infer)
library(moderndive)
```

## Family of *t*-tests

- One-sample *t*-test
- Independent samples *t*-test
- Dependent samples *t*-test (also known as paired) 

---
## What do they all have in common?

We want to compare two things- mean 1 and mean 2!

---
## Ways to get independent samples

* Random assignment of participants to two different experimental conditions 
  * Scream versus Scream 2
  * Classical music versus silence
* Naturally occurring assignment of participants to two different groups 
  * Male versus female
  * Young versus old
  
---
## Formula for the independent groups *t*-test

$$t=\frac{(\bar{y}_1-\bar{y}_2)-(\mu_1-\mu_2)}{SE}$$

Generally, 
> $H_0: \mu_1-\mu_2=0$ and $H_1: \mu_1-\mu_2\neq0$

So $\mu_1-\mu_2$ is often excluded from the formula. Since we now have two sample means and therefore two sample variances, we need some way to combine these two variances in a logical way. The answer is to __pool__ the variances to estimate the SE of the difference, $(\bar{y}_1-\bar{y}_2)$:
$$s_{\bar{y}_1-\bar{y}_2}=s_{pooled}\times{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$
and $s_{pooled}$ is:
$$s_{pooled}=\sqrt{\frac{(n_1-1)s_{y_1}^2+(n_2-1)s_{y_2}^2}{n_1+n_2-2}}$$
---
## The *t*-test as a general linear model (GLM) 

All statistical procedures are basically the same thing:
$$outcome_i=(model)+error_i$$

In my simple linear regression from the past few labs, this has been:
$$prestige_i=(intercept + education_i)+error_i$$

Or more generally:
$$y_i=(b_0 + b_1x_i)+error_i$$

> __N.B.__ $Error_i$ is an unknowable, incalculable statistic- it is the deviation of the $i^{th}$ value from the (unobservable) true value. You can think of it as measurement error. This is different from the error of the regression formula, which is defined as the *residual* of the $i^{th}$ value and calculated as the difference between the observed $(y_i)$ and predicted scores $(\hat{y_i})$. 

---
## Sidebar: The GLM in matrix notation
The GLM we have been dealing with thus far includes just one independent variable and thus just one $b_1$. However, the full GLM is better represented as a matrix

$$\boldsymbol{Y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}$$

where...

* $\boldsymbol{Y}$ is the *response vector* of length *N*;

* $\boldsymbol{\epsilon}$ is the *error vector* of length *N*;

* $\boldsymbol{\beta}$ is the vector of parameters of length *p*+1 where *p* is the number of IVs and the 1 accounts for the intercept;

* and $\boldsymbol{X}$ is called the *design matrix* consisting of a matrix of *N* rows and *p+1* columns

---

# Design matrices with one independent variable

.pull-left[
In both simple linear regression and the independent samples *t*-test, $\boldsymbol{X}$ is a matrix of *N* rows and *2* columns. Note that the number of columns in $\boldsymbol{X}$ must always equal the number of rows in $\boldsymbol{\beta}$.

![here](latex-image-2.pdf)
]


.pull-right[
In simple linear regression, the vector *X* can take on any value. In the independent samples *t*-test, this vector simply contains 0's and 1's. Below, *n*=4 with 2 in each group to illustrate.

![here](latex-image-1.pdf)
]


---

## Dummy variables

.pull-left[

Now, let's switch IVs in our example from education (a continuous variable) to a group variable:
$$prestige_i=(intercept + group_i)+error_i$$

where group is a dummy or indicator variable that can only take two values: 0 or 1. 

This is easy to do in R. Remember in my Prestige dataset, I have a categorical variable called "type." Let's let the variable Group denote:

> 0 for blue collar (type="bc") and

> 1 for white collar (type="wc")
]

.pull-right[

<div style="position: relative; left: 0; top: 0;">
  <img height="250" width="250" src="http://ecx.images-amazon.com/images/I/41H0VBC6NML.jpg" style="position: relative; top: 0; left: 0;"/>
  <img height="250" width="250" src="http://stargate-sg1-solutions.com/blog/wp-content/uploads/2011/07/White-Collar-poster-215x300.jpg" style="position: absolute; top: 150px; left: 250px;"/>
</div>
]

---

# Creating a dummy variable in R


```{r warning = FALSE, message = FALSE}
library(car)
table(Prestige$type)
```


```{r}
Prestige$group <- ifelse(Prestige$type=="bc",0, ifelse(Prestige$type=="wc",1,NA))
table(Prestige$type, Prestige$group)
```


---

# Cleaning up the new dataset

```{r}
Prestige.2 <- subset(Prestige, group %in% c(0,1))
Prestige.2$type <- droplevels(Prestige.2$type) #get rid of type=prof
```

---

# OK, we are all set now with two groups

```{r}
table(Prestige.2$type, Prestige.2$group)
```


```{r}
head(Prestige.2)
```


---

# Plotting data for two groups

Not surprisingly, mean prestige ratings appear to be higher among white collar workers than blue collar workers.

```{r warning=FALSE, message=FALSE, echo = FALSE}
library(ggplot2)
Prestige.2$group.l <- factor(Prestige.2$group, labels=c("bc","wc"))
ggplot(Prestige.2, aes(x=factor(group.l),y=prestige)) + geom_dotplot(stackdir = "centerwhole",binwidth=1,binaxis='y', alpha=.5)+theme_bw() + geom_segment(aes(x=0.75, y=mean(Prestige.2$prestige[Prestige.2$group==0]), xend=1.25, yend=mean(Prestige.2$prestige[Prestige.2$group==0]))) + geom_segment(aes(x=1.75, y=mean(Prestige.2$prestige[Prestige.2$group==1]), xend=2.25, yend=mean(Prestige.2$prestige[Prestige.2$group==1]))) 

```

---

# But are the two groups different?

Let's do an independent samples *t*-test to find the answer:


```{r}
t.test(prestige~group,data=Prestige.2, var.equal=T)
```


---

# What happens if I now run a linear regression?

<div style='text-align: center;'>
    <img height='560' src='http://images.huffingtonpost.com/2009-03-29-rebeccabird3.jpg' />
</div>

---

# Kidding. Here's is the linear regression summary... 

```{r}
fit <- lm(formula = prestige ~ group, data = Prestige.2)
summary(fit)
```


---

# But what does it MEAN?

Let's do a quick review of the linear regression formula:
$$y_i=(b_0 + b_1x_i)+error_i$$

Solving for the intercept term, for example, we need to apply some summation algebra:
$$\frac{1}{n}\sum_i^n{y_i}=\frac{1}{n}\sum_i^n{b_0} + \frac{1}{n}\sum_i^n{b_1x_i}+\frac{1}{n}\sum_i^n{error_i}$$

A few reminders about summation algebra in this context:
- The mean is defined as: $\frac{1}{n}\sum_i^n{y_i}$
- The sum of a constant is just *n* times the constant: $\sum_i^n{b_0}=n\times{b_0}$
- The sum of a constant times a random variable is the constant times the sum of the variable: $\sum_i^n{b_1x_i}=b_1\sum_i^n{x_i}$
- By definition, in GLM, we assume the mean of the error is 0: $\frac{1}{n}\sum_i^n{error_i}=0$

---

# Solving for the regression coefficients

Applying the summation algebra from the previous slide, we get:
$$\frac{1}{n}\sum_i^n{y_i}=\frac{1}{n}nb_0 + \frac{1}{n}b_1\sum_i^n{x_i}$$
$$\bar{y}=b_0 + b_1\bar{x}$$

This should look familiar! The formula for the regression intercept term, $b_0$, is:
$$b_0=\bar{y}-b_1\bar{x}$$

But look again: we also have the formula for $\bar{y}$. The intercept term in linear regression is the expected mean value of $y$ when $x_i$=0.

---

# Solving for $\bar{x}$

.pull-left[

$$\bar{y}=b_0 + b_1\bar{x}$$

But, you protest, how do we calculate $\bar{x}$? Is it the mean of the 0/1 values in *x*? What is happening?

]

.pull-right[

<div style='text-align: right;'>
    <img height='560' src='http://4.bp.blogspot.com/_D_Z-D2tzi14/TJAGr3y2spI/AAAAAAAADzg/-9K3i4ih4Hg/s1600/anesthesia30.png' />
</div>
]

--- 

# Solving for $\bar{x}$

.pull-left[
$$t=\frac{(\bar{y}_1-\bar{y}_2)-(\mu_1-\mu_2)}{SE}$$
Remember the original formula I gave you for the *t* statistic. There is no *x*! But implicitly, this formula is actually:
$$t=\frac{(\bar{y}_{x=1}-\bar{y}_{x=2})-(\mu_{x=1}-\mu_{x=2})}{SE}$$

So what we are actually interested in the mean of *y* when *x* takes on one value versus another (in my current example, *x*=0 or 1). 
]

.pull-right[

<div style='text-align: right;'>
    <img height='560' src='http://1.bp.blogspot.com/_D_Z-D2tzi14/TJAHB22PzNI/AAAAAAAADzw/AiLPtefqhx4/s1600/anesthesia.png' />
</div>
]

---

# The intercept

For the *t*-test, we can solve for the intercept just as we can for the simple linear regression. Remember our formula for $\bar{y}$:
$$\bar{y}=b_0 + b_1\bar{x}$$

Let's re-write it two ways:
$$\bar{y}_{x=0}=b_0 + b_1\bar{x_0}$$
$$\bar{y}_{x=1}=b_0 + b_1\bar{x_1}$$

Start with the top formula: What is $\bar{x_0}$? This is simple to think about in matrix notation- the mean of a vector of 0's is 0. So, when the group variable is equal to zero (blue collar)...
 
$$\bar{y}_{bc}=(b_0 + b_1\times0)=b_0$$

Therefore, $b_0$ (the intercept term) is equal to the mean prestige score of the blue collar group (the one coded as 0).

---

# The slope

Now, let's tackle the second formula:
$$\bar{y}_{x=1}=b_0 + b_1\bar{x_1}$$

When the group variable is equal to 1 (white collar), $\bar{x_1}=1$ because the mean of a vector of 1's is 1.
$$\bar{y}_{wc}=(b_0 + b_1\times1)$$
$$\bar{y}_{wc}=b_0+ b_1$$
$$\bar{y}_{wc}=\bar{y}_{bc}+ b_1$$

Solving for $b_1$:
$$b_1=\bar{y}_{wc}-\bar{y}_{bc}$$

Therefore, $b_1$ (the slope) is equal to the difference between group means in prestige scores. 

---

# What does this mean?

We could represent a two-group experiment as a regression equation in which the regression coefficient $b_1$ is equal to the difference between group means and the intercept term $b_0$ is the mean of the group coded as 0.

Our independent samples *t*-test would take the form:
$$y_i=\bar{y}_{bc} + (\bar{y}_{wc}-\bar{y}_{bc})x_i+error_i$$

Think of it this way: the regression line must pass through these two points:
- (0, $\bar{y}_{bc}$)
- (1, $\bar{y}_{wc}$)

---

# Trust but verify


```{r}
y_bc <- mean(Prestige.2$prestige[Prestige.2$group==0])
y_wc <- mean(Prestige.2$prestige[Prestige.2$group==1])
diff <- y_wc-y_bc
cbind(y_bc,y_wc, diff)
```


So, y_bc is $b_0$ and diff is $b_1$...right?


```{r}
coef(fit)
```


---

# The General Linear Model

A number of different statistical models are extensions of this same idea of a GLM:

- Ordinary least squares (OLS) linear regression (simple and multiple): 1+ predictors may be continuous or factors
- *t*-test: a *t*-test is basically a regression model where the 1 predictor is a factor with exactly 2 levels
- Analysis of Variance/Covariance (ANOVA/ANCOVA): an ANOVA is basically a regression model where the 1+ predictors are factors
- Multivariate Analysis of Variance/Covariance (MANOVA/MANCOVA): a MANOVA is basically a regression model with 2+ DVs where the 1+ predictors are factors

---

# Further food for thought...

The independent samples *t*-test is a special case of an ANOVA. Specifically, a one-way ANOVA (definition: 1 IV that is a factor and 1 DV that is continuous) in which the IV factor has *exactly* two levels.

Again, trust but verify!

```{r}
anova(lm(prestige~group,data=Prestige.2)) #doing an ANOVA in R
```


(hint: square the *t*-statistic to get the *F*-statistic equivalent)

---

# The non-centrality parameter of the *t* distribution

In an earlier class, I alluded to the fact that the *t*-distribution has an another parameter in addition to the degrees of freedom- a non-centrality parameter, $\delta$. 

> When $H_0$ is true, $\delta=0$.

> When $H_0$ is false, $\delta\neq0$.

Why is this? Let's look at the formula:

$$\delta=\sqrt{\frac{n_1n_2}{n_1+n_2}}(\frac{\mu_1-\mu_2}{SE})$$

As you can see, if $\mu_1-\mu_2$=0 then $\delta$=0.

---


# The *t*-distribution... 

.pull-left[

> When the null is true

```{r echo=FALSE}
# Script 12.4 - The Normal distribution and several t-distributions
# Clearing out the workspace
rm(list = ls())
# Setting the values for the abscissa
tval <- pretty(c(-3.25, 3.25), 1000)
# Find the ordinates for the normal distribution
ordn <- dnorm(tval)
# Drawing the normal distribution
plot(tval, ordn, type = "l", lty = 1, lwd = 3,
  ylab = "Ordinate", xlab = "Value")
# Setting df for three different t-distributions
nu <- c(1, 3, 10)
# Drawing the three different t-distributions
lines(tval, dt(tval, nu[1]), type = "l" ,lty = 2, lwd = 2)
lines(tval, dt(tval, nu[2]), type = "l", lty = 3, lwd = 2)
lines(tval, dt(tval, nu[3]), type = "l", lty = 5, lwd = 2)
legend(-3,.40, c("Normal", "t, df=1", "t, df=3", "t, df=10"),
  lwd = c(3,2,2,2), lty = c(1,2,3,5))
```
]

.pull-right[

> When the null is false

```{r echo=FALSE}
# Script 12.4 - The Normal distribution and several t-distributions
# Clearing out the workspace
rm(list = ls())
# Setting the values for the abscissa
tval <- pretty(c(-3.25, 3.25), 1000)
# Find the ordinates for the normal distribution
ordn <- dnorm(tval)
# Drawing the normal distribution
plot(tval, ordn, type = "l", lty = 1, lwd = 3,
  ylab = "Ordinate", xlab = "Value")
# Setting df for three different t-distributions
nu <- c(9, 9, 9)
delta <- c(.5, 1, 1.5)
# Drawing the three different t-distributions
lines(tval, dt(tval, nu[1], delta[1]), type = "l" ,lty = 2, lwd = 2)
lines(tval, dt(tval, nu[2], delta[2]), type = "l", lty = 3, lwd = 2)
lines(tval, dt(tval, nu[3], delta[3]), type = "l", lty = 5, lwd = 2)
legend(-3,.40, c("Normal", "t, delta=.5", "t, delta=1", "t, delta=1.5"),
  lwd = c(3,2,2,2), lty = c(1,2,3,5))
```
]

---

# Non-central *t*-distributions... 

.pull-left[

> Degrees of freedom=9

```{r echo=FALSE}
nu <- c(9, 9, 9, 9)
delta <- c(0, .5, 1, 1.5)
plot(tval, dt(tval, nu[1], delta[1]), type = "l" ,lty = 1, lwd = 3,
  ylab = "Ordinate", xlab = "Value", col="#d7191c")
# Drawing the three different t-distributions
lines(tval, dt(tval, nu[2], delta[2]), type = "l" ,lty = 2, lwd = 2, col="#fdae61")
lines(tval, dt(tval, nu[3], delta[3]), type = "l", lty = 3, lwd = 2, col="#abdda4")
lines(tval, dt(tval, nu[4], delta[4]), type = "l", lty = 5, lwd = 2, col="#2b83ba")
legend(-3,.40, c("t, delta=0", "t, delta=.5", "t, delta=1", "t, delta=1.5"),
  lwd = c(3,2,2,2), lty = c(1,2,3,5), col=c("#d7191c", "#fdae61","#abdda4", "#2b83ba"))
```
]

.pull-right[

> $\delta$=5

```{r echo=FALSE}
# Setting the values for the abscissa
ntval <- pretty(c(0,12), 1000)
nu <- c(10, 20, 30, 40)
delta <- c(5,5,5,5)
plot(ntval, dt(ntval, nu[1], delta[1]), type = "l" ,lty = 1, lwd = 3,
  ylab = "Ordinate", xlab = "Value", xlim=c(0, 12), ylim=c(0, .45), col="#d7191c")
# Drawing the three different t-distributions
lines(ntval, dt(ntval, nu[2], delta[2]), type = "l" ,lty = 2, lwd = 2, col="#fdae61")
lines(ntval, dt(ntval, nu[3], delta[3]), type = "l", lty = 3, lwd = 2, col="#abdda4")
lines(ntval, dt(ntval, nu[4], delta[4]), type = "l", lty = 5, lwd = 2, col="#2b83ba")
legend(0,.45, c("df=10", "df=20", "df=30", "df=40"),
  lwd = c(3,2,2,2), lty = c(1,2,3,5), col=c("#d7191c", "#fdae61","#abdda4", "#2b83ba"))
```
]

---

# Using non-central *t*-distributions... 

```{r}
help(TDist)
pt(-1, 20, 0)
pt(-1, 20, 5)
```

Recommended:
```{r eval=FALSE}
power.t.test()
```


---

# Welch's *t*-test: Dealing with unequal variances

Recall the formula for the independent groups *t*-test:

$$t=\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{SE}$$

In Welch's formula, we calculate the SE differently:
$$SE'=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$$

So the formula for Welch's $t'$: 
$$t'=\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$$

---

# Welch's *t*-test: Modified degrees of freedom

Recall the degrees of freedom for the independent groups *t*-test:

$$\nu=n_1+n_2-1$$

The degrees of freedom are modified for Welch's $t'$: 

$$\nu'=\frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{s_1^4}{n_1^2(n_1-1)}+\frac{s_2^4}{n_2^2(n_2-1)}}$$

---

# George E. P. Box

> "Equally, the statistician knows, for example, that in nature there never was a normal distribution. There never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world." -*Journal of the American Statistical Association, 71*(356), pp. 791-799.