One of the implicit assumptions up to this point was that the models were being
applied to a single homogeneous population.
In many cases, we take a sample
from a population but that overall group is likely a combination of individuals from
different sub-populations. For example, the SAT study was interested in all
students at the university but that contains the obvious sub-populations based
on the gender of the students. It is dangerous to fit MLR models across
subpopulations but we can also use MLR models to address more sophisticated
research questions by comparing groups. We will be able to compare the
intercepts (mean levels) and the slopes to see if they differ between the
groups. For example, does the relationship between the SATV and FYGPA
differ for male and female students? We can add the grouping information to
the scatterplot of FYGPA vs SATV (Figure 2.167) and
consider whether there is visual evidence of a difference in the slope and/or
intercept between the two groups, with men coded126 as 1 and women coded as 2. Code below changes this variable to GENDER
with more explicit labels, even though they might not be correct and the students were likely forced to choose one or the other.
It appears that the slope for females might be larger (steeper) in this relationship than it is for males. So increases in SAT Verbal percentiles for females might have more of an impact on the average first year GPA. We’ll handle this sort of situation in Section 8.11, where we will formally consider how to change the slopes for different groups. In this section, we develop new methods needed to begin to handle these situations and explore creating models that assume the same slope coefficient for all groups but allow for different y-intercepts. This material ends up resembling what we did for the Two-Way ANOVA additive model.
The results for SATV contrast with Figure 2.168 for the relationship between first year college GPA and SATM percentile by gender of the students. The lines for the two groups appear to be mostly parallel and just seem to have different y-intercepts. In this section, we will learn how we can use our MLR techniques to fit a model to the entire data set that allows for different y-intercepts. The real power of this idea is that we can then also test whether the different groups have different y-intercepts – whether the shift between the groups is “real”. In this example, it appears to suggest that females generally have slightly higher GPAs than males, on average, but that an increase in SATM has about the same impact on GPA for both groups. If this difference in y-intercepts is not “real”, then there appears to be no difference between the sexes in their relationship between SATM and GPA and we can safely continue using a model that does not differentiate the two groups. We could also just subset the data set and do two analyses, but that approach will not allow us to assess whether things are “really” different between the two groups.
satGPA$GENDER <- factor(satGPA$sex) #Make 1,2 coded sex into factor GENDER
levels(satGPA$GENDER) <- c("MALE", "FEMALE") #Make category names clear but names might be wrong
scatterplot(FYGPA~SATV|GENDER, lwd=3, data=satGPA, smooth=F,
main="Scatterplot of GPA vs SATV by Sex")
scatterplot(FYGPA~SATM|GENDER, lwd=3, data=satGPA, smooth=F,
main="Scatterplot of GPA vs SATM by Sex")
To fit one model to a data set that contains multiple groups, we need a way of
entering categorical variable information in an MLR model. Regression models
require quantitative predictor variables for the \(x\text{'s}\) so we can’t
directly enter the text coded information on the gender of the students into the regression model since it
contains “words” and how can multiply a word times a slope coefficient. To be able to
put in “numbers” as predictors, we create what are called
indicator variables127
that are made up of 0s and 1s, with the 0 reflecting one category and 1 the
other, changing depending on the category of the individual in that row of the data set. The
lm
function does this whenever a
factor variable is used as an explanatory variable.
It sets up the indicator
variables using a baseline category (which gets coded as a 0) and the deviation
category for the other level of the variable (which gets coded as a 1). We can see how this works by
exploring what happens when we put GENDER
into our lm
with SATM
, after first making sure it is categorical using
the factor
function and making the factor levels
explicit instead of 1s
and 2s.
satGPA$GENDER <- factor(satGPA$sex) #Make 1,2 coded sex into factor GENDER
levels(satGPA$GENDER) <- c("MALE", "FEMALE") #Make category names clear but names might be wrong
SATGENDER1 <- lm(FYGPA~SATM+GENDER, data=satGPA) #Fit lm with SATM and SEX
summary(SATGENDER1)
##
## Call:
## lm(formula = FYGPA ~ SATM + GENDER, data = satGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.42124 -0.42363 0.01868 0.46540 1.66397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.21589 0.14858 1.453 0.147
## SATM 0.03861 0.00258 14.969 < 2e-16
## GENDERFEMALE 0.31322 0.04360 7.184 1.32e-12
##
## Residual standard error: 0.6667 on 997 degrees of freedom
## Multiple R-squared: 0.1917, Adjusted R-squared: 0.1901
## F-statistic: 118.2 on 2 and 997 DF, p-value: < 2.2e-16
The GENDER
row contains information that the linear model chose MALE as
the baseline category and FEMALE as the deviation category since MALE does
not show up in the output. To see what lm
is doing for us when we give it a
two-level categorical variable, we can create our own “numerical” predictor that
is 0 for males and 1 for females that we called GENDERINDICATOR
, displayed
for the first 10 observations:
# Convert logical to 0 for male, 1 for female
satGPA$GENDERINDICATOR <- as.numeric(satGPA$GENDER=="FEMALE")
# Explore first few observations
head(tibble(GENDER=satGPA$GENDER, GENDERINDICATOR=satGPA$GENDERINDICATOR), 10)
## # A tibble: 10 x 2
## GENDER GENDERINDICATOR
## <fct> <dbl>
## 1 MALE 0
## 2 FEMALE 1
## 3 FEMALE 1
## 4 MALE 0
## 5 MALE 0
## 6 FEMALE 1
## 7 MALE 0
## 8 MALE 0
## 9 FEMALE 1
## 10 MALE 0
We can define the indicator variable more generally by calling it
\(I_{\text{Female},i}\) to denote that it is an indicator
\((I)\) that takes on a value of 1 for
observations in the category Female and 0 otherwise (Male) – changing based
on the observation (\(i\)). Indicator variables, once created,
are quantitative variables that take on values of 0 or 1 and we can put them
directly
into linear models with other \(x\text{'s}\) (quantitative or categorical). If we
replace the categorical GENDER
variable with our quantitative GENDERINDICATOR
and re-fit the model, we get:
##
## Call:
## lm(formula = FYGPA ~ SATM + GENDERINDICATOR, data = satGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.42124 -0.42363 0.01868 0.46540 1.66397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.21589 0.14858 1.453 0.147
## SATM 0.03861 0.00258 14.969 < 2e-16
## GENDERINDICATOR 0.31322 0.04360 7.184 1.32e-12
##
## Residual standard error: 0.6667 on 997 degrees of freedom
## Multiple R-squared: 0.1917, Adjusted R-squared: 0.1901
## F-statistic: 118.2 on 2 and 997 DF, p-value: < 2.2e-16
This matches all the previous lm
output except that we didn’t get any
information on the categories used since lm
didn’t know that GENDERINDICATOR
was anything different from other quantitative predictors.
Now we want to think about what this model means. We can write the estimated model as
\[\widehat{\text{FYGPA}}_i = 0.216 + 0.0386\cdot\text{SATM}_i + 0.313I_{\text{Female},i}\ .\]
When we have a male observation, the indicator takes on a value of 0 so the 0.313 drops out of the model, leaving an SLR just in terms of SATM. For a female student, the indicator is 1 and we add 0.313 to the previous y-intercept. The following works this out step-by-step, simplifying the MLR into two SLRs:
Simplified model for Males (plug in a 0 for \(I_{\text{Female},i}\)):
Simplified model for Females (plug in a 1 for \(I_{\text{Female},i}\)):
\(\widehat{\text{FYGPA}}_i = 0.216 + 0.0386\cdot\text{SATM}_i + 0.313*1\)
\(= 0.216 + 0.0386\cdot\text{SATM}_i + 0.313\) (combine “like” terms to simplify the equation)
\(= 0.529 + 0.0386\cdot\text{SATM}_i\)
In this situation, we then end up with two SLR models that relate SATM to GPA, one model for males \((\widehat{\text{FYGPA}}_i= 0.216 + 0.0386\cdot\text{SATM}_i)\) and one for females \((\widehat{\text{FYGPA}}_i= 0.529 + 0.0386\cdot\text{SATM}_i)\). The only difference between these two models is in the y-intercept, with the female model’s y-intercept shifted up from the male y-intercept by 0.313. And that is what adding indicator variables into models does in general128 – it shifts the intercept up or down from the baseline group (here selected as males) to get a new intercept for the deviation group (here females).
To make this visually clearer, Figure 2.169 contains the
regression lines that were estimated for each
group. For any SATM, the difference in the groups is the 0.313 coefficient from
the GENDERFEMALE
or GENDERINDICATOR
row of the model summaries. For example,
at SATM=50, the difference in terms of predicted average first year GPAs
between males and females is displayed as a difference between 2.15 and 2.46.
This model assumes that the slope on SATM is the same for both groups except
that they are allowed to have different y-intercepts, which is reasonable here
because we saw approximately parallel relationships for the two groups in
Figure 2.168.
(ref:fig8-19) Plot of estimated model for FYGPA vs SATM by GENDER of students (female line is thicker dark line). Dashed lines aid in seeing the consistent vertical difference of 0.313 in the two estimated lines based on the model containing a different intercept for each group.
Remember that lm
selects baseline categories typically based on the
alphabetical order of the levels of the categorical variable when it is created unless the reorder
function is used to change the order. Here, the GENDER
variable started with a coding of 1 and 2 and retained that
order even with the recoding of levels that we created to give it more explicit
names. Because we allow lm
to create indicator variables for us, the main
thing you need to do is explore the model summary and look for the hint at the
baseline level that is not displayed after the name of the categorical variable.
We can also work out the impacts of adding an indicator variable to the model in general in the theoretical model with a single quantitative predictor \(x_i\) and indicator \(I_i\). The model starts as
\[y_i=\beta_0+\beta_1x_i + \beta_2I_i + \varepsilon_i\ .\]
Again, there are two versions:
For any observation \(i\) in the baseline category, \(I_i=0\) and the model is \(y_i=\beta_0+\beta_1x_i + \varepsilon_i\).
For any observation \(i\) in the non-baseline (deviation) category, \(I_i=1\) and the model simplifies to \(y_i=(\beta_0+\beta_2)+\beta_1x_i + \varepsilon_i\).
The interpretation and inferences for \(\beta_1\) resemble the work with any MLR model, noting that these results are “controlled for”, “adjusted for”, or “allowing for differences based on” the categorical variable in the model. The interpretation of \(\beta_2\) is as a shift up or down in the y-intercept for the model that includes \(x_i\). When we make term-plots in a model with a quantitative and additive categorical variable, the two reported model components match with the previous discussion – the same estimated term from the quantitative variable for all observations and a shift to reflect the different y-intercepts in the two groups. In Figure 2.170, the females are estimated to be that same 0.313 points higher on first year GPA. The males have a mean GPA slightly above 2.3 which is the predicted GPA for the average SATM percentile for a male (remember that we have to hold the other variable at its mean to make each term-plot). When making the SATM term-plot, the categorical variable is held at the most frequently occurring value (modal category) in the data set, which is for males in this data set.
(ref:fig8-20) Term-plots for the estimated model for \(\text{FYGPA}\sim\text{SATM} + \text{GENDER}\).
## 1
## GENDER 1
## MALE 516
## FEMALE 484
The model summary and confidence intervals provide some potential interesting
inferences in these models. Again, these are just applications of MLR methods
we have already seen except that the definition of one of the variables is
“different” using the indicator coding idea. For the same model, the GENDER
coefficient can be used to generate inferences for differences in the mean the
groups, controlling for their SATM scores.
## Estimate Std. Error t value Pr(>|t|)
## GENDERFEMALE 0.31322 0.04360 7.184 1.32e-12
Testing the null hypothesis that \(H_0: \beta_2=0\) vs \(H_A: \beta_2\ne 0\) using our regular \(t\)-test provides the opportunity to test for a difference in intercepts between the groups. In this situation, the test statistic is \(t=7.184\) and, based on a \(t_{997}\)-distribution if the null is true, the p-value is \(<0.0001\). We have very strong evidence against the null hypothesis that there is no difference in the true y-intercept in a SATM model for first year college GPA between males and females, so we would conclude that there is difference in their true mean GPA levels controlled for SATM. The confidence interval is also informative:
## 2.5 % 97.5 %
## (Intercept) -0.07566665 0.50744709
## SATM 0.03355273 0.04367726
## GENDERFEMALE 0.22766284 0.39877160
We are 95% confident that the true mean GPA for females is between 0.228 and 0.399 points higher than for males, after adjusting for the SATM in the population of students. If we had subset the data set and fit two SLRs, we could have obtained the same simplified regression models but we never could have performed inferences for the differences between the two groups without putting all the observations together in one model and then assessing those differences with targeted coefficients. We also would not be able to get an estimate of their common slope for SATM, after adjusting for differences in the intercept for each group.
We are actually making an educated guess about what these codes mean. Other similar data sets used 1 for males but the documentation on these data is a bit sparse. We proceed with a small potential that the conclusions regarding differences in gender are in the wrong direction.↩
Some people also call them dummy variables to reflect that they are stand-ins for dealing with the categorical information. But it seems like a harsh anthropomorphism so I prefer “indicators”.↩
This is true for additive uses of indicator variables. In Section 8.11, we consider interactions between quantitative and categorical variables which has the effect of changing slopes and intercepts. The simplification ideas to produce estimated equations for each group are used there but we have to account for changing slopes by group too.↩