--- title: "CONJ620: CM 4.2" subtitle: "The General Linear Model" author: "Alison Hill" output: xaringan::moon_reader: css: ["default", "css/ohsu.css", "css/ohsu-fonts.css"] lib_dir: libs nature: highlightStyle: atelier-lakeside-light highlightLines: true countIncrementalSlides: false --- ```{r setup, include=FALSE} options(htmltools.dir.version = FALSE) options(knitr.table.format = "html") knitr::opts_chunk$set(warning = FALSE, message = FALSE, comment = NA, dpi = 300, fig.align = "center", cache = FALSE) library(tidyverse) library(infer) library(moderndive) ``` ## Family of *t*-tests - One-sample *t*-test - Independent samples *t*-test - Dependent samples *t*-test (also known as paired) --- ## What do they all have in common? We want to compare two things- mean 1 and mean 2! --- ## Ways to get independent samples * Random assignment of participants to two different experimental conditions * Scream versus Scream 2 * Classical music versus silence * Naturally occurring assignment of participants to two different groups * Male versus female * Young versus old --- ## Formula for the independent groups *t*-test $$t=\frac{(\bar{y}_1-\bar{y}_2)-(\mu_1-\mu_2)}{SE}$$ Generally, > $H_0: \mu_1-\mu_2=0$ and $H_1: \mu_1-\mu_2\neq0$ So $\mu_1-\mu_2$ is often excluded from the formula. Since we now have two sample means and therefore two sample variances, we need some way to combine these two variances in a logical way. The answer is to __pool__ the variances to estimate the SE of the difference, $(\bar{y}_1-\bar{y}_2)$: $$s_{\bar{y}_1-\bar{y}_2}=s_{pooled}\times{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$ and $s_{pooled}$ is: $$s_{pooled}=\sqrt{\frac{(n_1-1)s_{y_1}^2+(n_2-1)s_{y_2}^2}{n_1+n_2-2}}$$ --- ## The *t*-test as a general linear model (GLM) All statistical procedures are basically the same thing: $$outcome_i=(model)+error_i$$ In my simple linear regression from the past few labs, this has been: $$prestige_i=(intercept + education_i)+error_i$$ Or more generally: $$y_i=(b_0 + b_1x_i)+error_i$$ > __N.B.__ $Error_i$ is an unknowable, incalculable statistic- it is the deviation of the $i^{th}$ value from the (unobservable) true value. You can think of it as measurement error. This is different from the error of the regression formula, which is defined as the *residual* of the $i^{th}$ value and calculated as the difference between the observed $(y_i)$ and predicted scores $(\hat{y_i})$. --- ## Sidebar: The GLM in matrix notation The GLM we have been dealing with thus far includes just one independent variable and thus just one $b_1$. However, the full GLM is better represented as a matrix $$\boldsymbol{Y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}$$ where... * $\boldsymbol{Y}$ is the *response vector* of length *N*; * $\boldsymbol{\epsilon}$ is the *error vector* of length *N*; * $\boldsymbol{\beta}$ is the vector of parameters of length *p*+1 where *p* is the number of IVs and the 1 accounts for the intercept; * and $\boldsymbol{X}$ is called the *design matrix* consisting of a matrix of *N* rows and *p+1* columns --- # Design matrices with one independent variable .pull-left[ In both simple linear regression and the independent samples *t*-test, $\boldsymbol{X}$ is a matrix of *N* rows and *2* columns. Note that the number of columns in $\boldsymbol{X}$ must always equal the number of rows in $\boldsymbol{\beta}$. ![here](latex-image-2.pdf) ] .pull-right[ In simple linear regression, the vector *X* can take on any value. In the independent samples *t*-test, this vector simply contains 0's and 1's. Below, *n*=4 with 2 in each group to illustrate. ![here](latex-image-1.pdf) ] --- ## Dummy variables .pull-left[ Now, let's switch IVs in our example from education (a continuous variable) to a group variable: $$prestige_i=(intercept + group_i)+error_i$$ where group is a dummy or indicator variable that can only take two values: 0 or 1. This is easy to do in R. Remember in my Prestige dataset, I have a categorical variable called "type." Let's let the variable Group denote: > 0 for blue collar (type="bc") and > 1 for white collar (type="wc") ] .pull-right[
] --- # Creating a dummy variable in R ```{r warning = FALSE, message = FALSE} library(car) table(Prestige$type) ``` ```{r} Prestige$group <- ifelse(Prestige$type=="bc",0, ifelse(Prestige$type=="wc",1,NA)) table(Prestige$type, Prestige$group) ``` --- # Cleaning up the new dataset ```{r} Prestige.2 <- subset(Prestige, group %in% c(0,1)) Prestige.2$type <- droplevels(Prestige.2$type) #get rid of type=prof ``` --- # OK, we are all set now with two groups ```{r} table(Prestige.2$type, Prestige.2$group) ``` ```{r} head(Prestige.2) ``` --- # Plotting data for two groups Not surprisingly, mean prestige ratings appear to be higher among white collar workers than blue collar workers. ```{r warning=FALSE, message=FALSE, echo = FALSE} library(ggplot2) Prestige.2$group.l <- factor(Prestige.2$group, labels=c("bc","wc")) ggplot(Prestige.2, aes(x=factor(group.l),y=prestige)) + geom_dotplot(stackdir = "centerwhole",binwidth=1,binaxis='y', alpha=.5)+theme_bw() + geom_segment(aes(x=0.75, y=mean(Prestige.2$prestige[Prestige.2$group==0]), xend=1.25, yend=mean(Prestige.2$prestige[Prestige.2$group==0]))) + geom_segment(aes(x=1.75, y=mean(Prestige.2$prestige[Prestige.2$group==1]), xend=2.25, yend=mean(Prestige.2$prestige[Prestige.2$group==1]))) ``` --- # But are the two groups different? Let's do an independent samples *t*-test to find the answer: ```{r} t.test(prestige~group,data=Prestige.2, var.equal=T) ``` --- # What happens if I now run a linear regression?
--- # Kidding. Here's is the linear regression summary... ```{r} fit <- lm(formula = prestige ~ group, data = Prestige.2) summary(fit) ``` --- # But what does it MEAN? Let's do a quick review of the linear regression formula: $$y_i=(b_0 + b_1x_i)+error_i$$ Solving for the intercept term, for example, we need to apply some summation algebra: $$\frac{1}{n}\sum_i^n{y_i}=\frac{1}{n}\sum_i^n{b_0} + \frac{1}{n}\sum_i^n{b_1x_i}+\frac{1}{n}\sum_i^n{error_i}$$ A few reminders about summation algebra in this context: - The mean is defined as: $\frac{1}{n}\sum_i^n{y_i}$ - The sum of a constant is just *n* times the constant: $\sum_i^n{b_0}=n\times{b_0}$ - The sum of a constant times a random variable is the constant times the sum of the variable: $\sum_i^n{b_1x_i}=b_1\sum_i^n{x_i}$ - By definition, in GLM, we assume the mean of the error is 0: $\frac{1}{n}\sum_i^n{error_i}=0$ --- # Solving for the regression coefficients Applying the summation algebra from the previous slide, we get: $$\frac{1}{n}\sum_i^n{y_i}=\frac{1}{n}nb_0 + \frac{1}{n}b_1\sum_i^n{x_i}$$ $$\bar{y}=b_0 + b_1\bar{x}$$ This should look familiar! The formula for the regression intercept term, $b_0$, is: $$b_0=\bar{y}-b_1\bar{x}$$ But look again: we also have the formula for $\bar{y}$. The intercept term in linear regression is the expected mean value of $y$ when $x_i$=0. --- # Solving for $\bar{x}$ .pull-left[ $$\bar{y}=b_0 + b_1\bar{x}$$ But, you protest, how do we calculate $\bar{x}$? Is it the mean of the 0/1 values in *x*? What is happening? ] .pull-right[
] --- # Solving for $\bar{x}$ .pull-left[ $$t=\frac{(\bar{y}_1-\bar{y}_2)-(\mu_1-\mu_2)}{SE}$$ Remember the original formula I gave you for the *t* statistic. There is no *x*! But implicitly, this formula is actually: $$t=\frac{(\bar{y}_{x=1}-\bar{y}_{x=2})-(\mu_{x=1}-\mu_{x=2})}{SE}$$ So what we are actually interested in the mean of *y* when *x* takes on one value versus another (in my current example, *x*=0 or 1). ] .pull-right[
] --- # The intercept For the *t*-test, we can solve for the intercept just as we can for the simple linear regression. Remember our formula for $\bar{y}$: $$\bar{y}=b_0 + b_1\bar{x}$$ Let's re-write it two ways: $$\bar{y}_{x=0}=b_0 + b_1\bar{x_0}$$ $$\bar{y}_{x=1}=b_0 + b_1\bar{x_1}$$ Start with the top formula: What is $\bar{x_0}$? This is simple to think about in matrix notation- the mean of a vector of 0's is 0. So, when the group variable is equal to zero (blue collar)... $$\bar{y}_{bc}=(b_0 + b_1\times0)=b_0$$ Therefore, $b_0$ (the intercept term) is equal to the mean prestige score of the blue collar group (the one coded as 0). --- # The slope Now, let's tackle the second formula: $$\bar{y}_{x=1}=b_0 + b_1\bar{x_1}$$ When the group variable is equal to 1 (white collar), $\bar{x_1}=1$ because the mean of a vector of 1's is 1. $$\bar{y}_{wc}=(b_0 + b_1\times1)$$ $$\bar{y}_{wc}=b_0+ b_1$$ $$\bar{y}_{wc}=\bar{y}_{bc}+ b_1$$ Solving for $b_1$: $$b_1=\bar{y}_{wc}-\bar{y}_{bc}$$ Therefore, $b_1$ (the slope) is equal to the difference between group means in prestige scores. --- # What does this mean? We could represent a two-group experiment as a regression equation in which the regression coefficient $b_1$ is equal to the difference between group means and the intercept term $b_0$ is the mean of the group coded as 0. Our independent samples *t*-test would take the form: $$y_i=\bar{y}_{bc} + (\bar{y}_{wc}-\bar{y}_{bc})x_i+error_i$$ Think of it this way: the regression line must pass through these two points: - (0, $\bar{y}_{bc}$) - (1, $\bar{y}_{wc}$) --- # Trust but verify ```{r} y_bc <- mean(Prestige.2$prestige[Prestige.2$group==0]) y_wc <- mean(Prestige.2$prestige[Prestige.2$group==1]) diff <- y_wc-y_bc cbind(y_bc,y_wc, diff) ``` So, y_bc is $b_0$ and diff is $b_1$...right? ```{r} coef(fit) ``` --- # The General Linear Model A number of different statistical models are extensions of this same idea of a GLM: - Ordinary least squares (OLS) linear regression (simple and multiple): 1+ predictors may be continuous or factors - *t*-test: a *t*-test is basically a regression model where the 1 predictor is a factor with exactly 2 levels - Analysis of Variance/Covariance (ANOVA/ANCOVA): an ANOVA is basically a regression model where the 1+ predictors are factors - Multivariate Analysis of Variance/Covariance (MANOVA/MANCOVA): a MANOVA is basically a regression model with 2+ DVs where the 1+ predictors are factors --- # Further food for thought... The independent samples *t*-test is a special case of an ANOVA. Specifically, a one-way ANOVA (definition: 1 IV that is a factor and 1 DV that is continuous) in which the IV factor has *exactly* two levels. Again, trust but verify! ```{r} anova(lm(prestige~group,data=Prestige.2)) #doing an ANOVA in R ``` (hint: square the *t*-statistic to get the *F*-statistic equivalent) --- # The non-centrality parameter of the *t* distribution In an earlier class, I alluded to the fact that the *t*-distribution has an another parameter in addition to the degrees of freedom- a non-centrality parameter, $\delta$. > When $H_0$ is true, $\delta=0$. > When $H_0$ is false, $\delta\neq0$. Why is this? Let's look at the formula: $$\delta=\sqrt{\frac{n_1n_2}{n_1+n_2}}(\frac{\mu_1-\mu_2}{SE})$$ As you can see, if $\mu_1-\mu_2$=0 then $\delta$=0. --- # The *t*-distribution... .pull-left[ > When the null is true ```{r echo=FALSE} # Script 12.4 - The Normal distribution and several t-distributions # Clearing out the workspace rm(list = ls()) # Setting the values for the abscissa tval <- pretty(c(-3.25, 3.25), 1000) # Find the ordinates for the normal distribution ordn <- dnorm(tval) # Drawing the normal distribution plot(tval, ordn, type = "l", lty = 1, lwd = 3, ylab = "Ordinate", xlab = "Value") # Setting df for three different t-distributions nu <- c(1, 3, 10) # Drawing the three different t-distributions lines(tval, dt(tval, nu[1]), type = "l" ,lty = 2, lwd = 2) lines(tval, dt(tval, nu[2]), type = "l", lty = 3, lwd = 2) lines(tval, dt(tval, nu[3]), type = "l", lty = 5, lwd = 2) legend(-3,.40, c("Normal", "t, df=1", "t, df=3", "t, df=10"), lwd = c(3,2,2,2), lty = c(1,2,3,5)) ``` ] .pull-right[ > When the null is false ```{r echo=FALSE} # Script 12.4 - The Normal distribution and several t-distributions # Clearing out the workspace rm(list = ls()) # Setting the values for the abscissa tval <- pretty(c(-3.25, 3.25), 1000) # Find the ordinates for the normal distribution ordn <- dnorm(tval) # Drawing the normal distribution plot(tval, ordn, type = "l", lty = 1, lwd = 3, ylab = "Ordinate", xlab = "Value") # Setting df for three different t-distributions nu <- c(9, 9, 9) delta <- c(.5, 1, 1.5) # Drawing the three different t-distributions lines(tval, dt(tval, nu[1], delta[1]), type = "l" ,lty = 2, lwd = 2) lines(tval, dt(tval, nu[2], delta[2]), type = "l", lty = 3, lwd = 2) lines(tval, dt(tval, nu[3], delta[3]), type = "l", lty = 5, lwd = 2) legend(-3,.40, c("Normal", "t, delta=.5", "t, delta=1", "t, delta=1.5"), lwd = c(3,2,2,2), lty = c(1,2,3,5)) ``` ] --- # Non-central *t*-distributions... .pull-left[ > Degrees of freedom=9 ```{r echo=FALSE} nu <- c(9, 9, 9, 9) delta <- c(0, .5, 1, 1.5) plot(tval, dt(tval, nu[1], delta[1]), type = "l" ,lty = 1, lwd = 3, ylab = "Ordinate", xlab = "Value", col="#d7191c") # Drawing the three different t-distributions lines(tval, dt(tval, nu[2], delta[2]), type = "l" ,lty = 2, lwd = 2, col="#fdae61") lines(tval, dt(tval, nu[3], delta[3]), type = "l", lty = 3, lwd = 2, col="#abdda4") lines(tval, dt(tval, nu[4], delta[4]), type = "l", lty = 5, lwd = 2, col="#2b83ba") legend(-3,.40, c("t, delta=0", "t, delta=.5", "t, delta=1", "t, delta=1.5"), lwd = c(3,2,2,2), lty = c(1,2,3,5), col=c("#d7191c", "#fdae61","#abdda4", "#2b83ba")) ``` ] .pull-right[ > $\delta$=5 ```{r echo=FALSE} # Setting the values for the abscissa ntval <- pretty(c(0,12), 1000) nu <- c(10, 20, 30, 40) delta <- c(5,5,5,5) plot(ntval, dt(ntval, nu[1], delta[1]), type = "l" ,lty = 1, lwd = 3, ylab = "Ordinate", xlab = "Value", xlim=c(0, 12), ylim=c(0, .45), col="#d7191c") # Drawing the three different t-distributions lines(ntval, dt(ntval, nu[2], delta[2]), type = "l" ,lty = 2, lwd = 2, col="#fdae61") lines(ntval, dt(ntval, nu[3], delta[3]), type = "l", lty = 3, lwd = 2, col="#abdda4") lines(ntval, dt(ntval, nu[4], delta[4]), type = "l", lty = 5, lwd = 2, col="#2b83ba") legend(0,.45, c("df=10", "df=20", "df=30", "df=40"), lwd = c(3,2,2,2), lty = c(1,2,3,5), col=c("#d7191c", "#fdae61","#abdda4", "#2b83ba")) ``` ] --- # Using non-central *t*-distributions... ```{r} help(TDist) pt(-1, 20, 0) pt(-1, 20, 5) ``` Recommended: ```{r eval=FALSE} power.t.test() ``` --- # Welch's *t*-test: Dealing with unequal variances Recall the formula for the independent groups *t*-test: $$t=\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{SE}$$ In Welch's formula, we calculate the SE differently: $$SE'=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$$ So the formula for Welch's $t'$: $$t'=\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$$ --- # Welch's *t*-test: Modified degrees of freedom Recall the degrees of freedom for the independent groups *t*-test: $$\nu=n_1+n_2-1$$ The degrees of freedom are modified for Welch's $t'$: $$\nu'=\frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{s_1^4}{n_1^2(n_1-1)}+\frac{s_2^4}{n_2^2(n_2-1)}}$$ --- # George E. P. Box > "Equally, the statistician knows, for example, that in nature there never was a normal distribution. There never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world." -*Journal of the American Statistical Association, 71*(356), pp. 791-799.