The same techniques can be extended to more than two groups. A study was
conducted to explore sound tolerances using \(n=98\) subjects with the data
available in the Headache
data set from the heplots
package
(Fox and Friendly 2018).
Each
subject was initially exposed to a tone, stopping when the tone became definitely
intolerable (DU) and that decibel level was recorded (variable called du1
).
Then the subjects were randomly assigned to one of four treatments: T1
(Listened again to the tone at their initial DU level, for the same amount of
time they were able to tolerate it before); T2 (Same as T1, with one
additional minute of exposure); T3 (Same as T2, but the subjects were
explicitly instructed to use the relaxation techniques); and Control (these
subjects experienced no further exposure to the noise tone until the final
sensitivity measures were taken). Then the DU was measured again (variable
called du2
). One would expect that there would be a relationship between the
upper tolerance levels of the subjects before and after treatment. But
maybe the treatments impact that relationship? We can use our indicator
approach to see if the treatments provide a shift to higher tolerances after
accounting for the relationship between the two measurements129. The scatterplot130
of the results in Figure 2.171 shows some variation in the
slopes and the intercepts for the groups although the variation in intercepts
seems more prominent than differences in slopes. Note that the relevel
function was applied to the treatment
variable with an option of "Control"
to make the Control category the baseline category as the person who created the data set had set T1
as the baseline in the treatment
variable.
(ref:fig8-21) Scatterplot of post-treatment decibel tolerance (du2) vs pre-treatment tolerance (du1) by treatment level (4 groups).
## # A tibble: 98 x 6
## type treatment u1 du1 u2 du2
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Migrane T3 2.34 5.3 5.8 8.52
## 2 Migrane T1 2.73 6.85 4.68 6.68
## 3 Tension T1 0.37 0.53 0.55 0.84
## 4 Migrane T3 7.5 9.12 5.7 7.88
## 5 Migrane T3 4.63 7.21 5.63 6.75
## 6 Migrane T3 3.6 7.3 4.83 7.32
## 7 Migrane T2 2.45 3.75 2.5 3.18
## 8 Migrane T1 2.31 3.25 2 3.3
## 9 Migrane T1 1.38 2.33 2.23 3.98
## 10 Tension T3 0.85 1.42 1.37 1.89
## # ... with 88 more rows
Headache$treatment <- factor(Headache$treatment)
Headache$treatment <- relevel(Headache$treatment, "Control") #Make Control the baseline category
scatterplot(du2~du1|treatment, data=Headache, smooth=F, lwd=2,
main="Plot of Maximum DB tolerances before & after treatment (by treatment)",
legend=list(coords="topleft",columns=2), col=viridis(4))
This data set contains a categorical variable with 4 levels. To go beyond two groups, we have to add more than one indicator variable, defining three indicators to turn on (1) or off (0) for three of the levels of the variable with the same reference level used for all the indicators. For this example, the Control group is chosen as the baseline group so it hides in the background while we define indicators for the other three levels. The indicators for T1, T2, and T3 treatment levels are:
Indicator for T1: \(I_{T1,i}=\left\{\begin{array}{rl} 1 & \text{if Treatment}=T1 \\ 0 & \text{else} \end{array}\right.\)
Indicator for T2: \(I_{T2,i}=\left\{\begin{array}{rl} 1 & \text{if Treatment}=T2 \\ 0 & \text{else} \end{array}\right.\)
Indicator for T3: \(I_{T3,i}=\left\{\begin{array}{rl} 1 & \text{if Treatment}=T3 \\ 0 & \text{else} \end{array}\right.\)
We can see the values of these indicators for a few observations and their
original variable (treatment
) in the following output. The bolded
observations show each of the indicators being “turned on”. For Control all the
indicators stay at 0.
Treatment | I_T1 | I_T2 | I_T3 |
---|---|---|---|
T3 | 0 | 0 | 1 |
T1 | 1 | 0 | 0 |
T1 | 1 | 0 | 0 |
T3 | 0 | 0 | 1 |
T3 | 0 | 0 | 1 |
T3 | 0 | 0 | 1 |
T2 | 0 | 1 | 0 |
T1 | 1 | 0 | 0 |
T1 | 1 | 0 | 0 |
T3 | 0 | 0 | 1 |
T3 | 0 | 0 | 1 |
T2 | 0 | 1 | 0 |
T3 | 0 | 0 | 1 |
T1 | 1 | 0 | 0 |
T3 | 0 | 0 | 1 |
Control | 0 | 0 | 0 |
T3 | 0 | 0 | 1 |
When we fit the additive model of the form y~x+group
, the lm
function takes the \(\boldsymbol{J}\) categories and creates \(\boldsymbol{J-1}\)
indicator variables.
The baseline level is always handled in the intercept.
The true model will be of the form
\[y_i=\beta_0 + \beta_1x_i +\beta_2I_{\text{Level}2,i}+\beta_3I_{\text{Level}3,i} +\cdots+\beta_{J}I_{\text{Level}J,i}+\varepsilon_i\]
where the \(I_{\text{CatName}j,i}\text{'s}\) are the different indicator variables. Note that each indicator variable gets a coefficient associated with it and is “turned on” whenever the \(i^{th}\) observation is in that category. At most only one of the \(I_{\text{CatName}j,i}\text{'s}\) is a 1 for any observation, so the y-intercept will either be \(\beta_0\) for the baseline group or \(\beta_0+\beta_j\) for \(j=2,\ldots,J\). It is important to remember that this is an “additive” model since the effects just add and there is no interaction between the grouping variable and the quantitative predictor. To be able to trust this model, we need to check that we do not need different slope coefficients for the groups as discussed in the next section.
For these types of models, it is always good to start with a plot of the data
set with regression lines for each group – assessing whether the lines look
relatively parallel or not.
In Figure 2.171, there are some
differences in slopes – we investigate that further in the next section. For
now, we can proceed with fitting the additive model with different intercepts
for the four levels of treatment
and the quantitative explanatory variable, du1
.
##
## Call:
## lm(formula = du2 ~ du1 + treatment, data = Headache)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9085 -0.9551 -0.3118 1.1141 10.5364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25165 0.51624 0.487 0.6271
## du1 0.83705 0.05176 16.172 <2e-16
## treatmentT1 0.55752 0.61830 0.902 0.3695
## treatmentT2 0.63444 0.63884 0.993 0.3232
## treatmentT3 1.36671 0.60608 2.255 0.0265
##
## Residual standard error: 2.14 on 93 degrees of freedom
## Multiple R-squared: 0.7511, Adjusted R-squared: 0.7404
## F-statistic: 70.16 on 4 and 93 DF, p-value: < 2.2e-16
The complete estimated regression model is
\[\widehat{\text{du2}}_i=0.252+0.837\cdot\text{du1}_i +0.558I_{\text{T1},i}+0.634I_{\text{T2},i}+1.367I_{\text{T3},i}\ .\]
For each group, the model simplifies to an SLR as follows:
\[\begin{array}{rl} \widehat{\text{du2}}_i &=0.252+0.837\cdot\text{du1}_i +0.558I_{\text{T1},i}+0.634I_{\text{T2},i}+1.367I_{\text{T3},i} \\ &= 0.252+0.837\cdot\text{du1}_i+0.558*0+0.634*0+1.367*0 \\ &= 0.252+0.837\cdot\text{du1}_i. \end{array}\]
\[\begin{array}{rl} \widehat{\text{du2}}_i &=0.252+0.837\cdot\text{du1}_i +0.558I_{\text{T1},i}+0.634I_{\text{T2},i}+1.367I_{\text{T3},i} \\ &= 0.252+0.837\cdot\text{du1}_i+0.558*1+0.634*0+1.367*0 \\ &= 0.809+0.837\cdot\text{du1}_i + 0.558 \\ &= 0.81+0.837\cdot\text{du1}_i. \end{array}\]
\[\widehat{\text{du2}}_i = 0.886 + 0.837\cdot\text{du1}_i\ .\]
\[\widehat{\text{du2}}_i = 1.62 + 0.837\cdot\text{du1}_i\ .\]
To reinforce what this additive model is doing, Figure 2.172 displays the estimated regression lines for all four groups, showing the shifts in the y-intercepts among the groups.
The right panel of the term-plot (Figure 2.173) shows how the T3 group
seems to have shifted up the most relative to
the others and the Control group seems to have a mean that is a bit lower than the others,
in the model that otherwise assumes that the same linear relationship holds between
du1
and du2
for all the groups. After controlling for the Treatment
group, for a 1 decibel increase in initial tolerances, we estimate, on average,
to obtain a 0.84 decibel change in the second tolerance measurement. The
R2 shows that this is a decent model for the responses, with
this model explaining 75.1% percent of the
variation in the second decibel tolerance measure. We should check the
diagnostic plots and VIFs to check for any issues – all the diagnostics and
assumptions are as before except that there is no assumption of linearity
between the grouping variable and the responses.
Additionally, sometimes we
need to add group information to diagnostics to see if any patterns in residuals look
different in different groups, like linearity or non-constant variance, when we are fitting models that might contain multiple groups.
The diagnostic plots in Figure 2.174 provides some indications of a few observations in the tails that deviate from a normal distribution to having slightly heavier tails but only one outlier is of real concern and causes some concern about the normality assumption. There is a small indication of increasing variability as a function of the fitted values as both the Residuals vs. Fitted and Scale-Location plots show some fanning out for higher values but this is a minor issue. There are no influential points here since all the Cook’s D values are less than 0.5.
par(mfrow=c(2,2), oma=c(0,0,2,0))
plot(head1,
sub.caption="Plot of diagnostics for additive model with du1 and treatment for du2", pch=16)
Additionally, sometimes we
need to add group information to diagnostics to see if any patterns in residuals look
different in different groups, like linearity or non-constant variance, when we are fitting models that might contain multiple groups. We can use the scatterplot
function to plot the residuals (extracted using the residuals
function) versus the fitted values (extracted using the fitted
function) by groups as in Figure 2.175. In this example, there are no additional patterns extracted by making this plot, but it is a good additional check in these multi-group situations.
The VIFs are different for categorical variables than for quantitative
predictors in MLR. The 4 levels
are combined in a measure called the generalized VIF (GVIF). For GVIFs,
we only focus on the inflation of the SE scale (square root for 1 df effects
and raised to the power \(1/(2*J)\) for a \(J\)-level predictor). On this scale,
the interpretation is as the multiplicative increase in the SEs for the coefficients on all the indicator variables due to
multicollinearity with other predictors. In this model, the SE for du1
is
1.009 times larger due to multicollinearity with other predictors
and the SEs for the indicator variables for treatment
are 1.003 times larger due to multicollinearity than they otherwise
would have been. Neither are large so multicollinearity is not a problem in
this model.
## GVIF Df GVIF^(1/(2*Df))
## du1 1.01786 1 1.008891
## treatment 1.01786 3 1.002955
While there are inferences available in the model output, the tests for the indicator variables are not too informative since they only compare each group to the baseline. In Section 8.12, we see how to use ANOVA F-tests to help us ask general questions about including a categorical predictor in the model. But we can compare adjusted R2 values with and without Treatment to see if including the categorical variable was “worth it”:
##
## Call:
## lm(formula = du2 ~ du1, data = Headache)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9887 -0.8820 -0.2765 1.1529 10.4165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.84744 0.36045 2.351 0.0208
## du1 0.85142 0.05189 16.408 <2e-16
##
## Residual standard error: 2.165 on 96 degrees of freedom
## Multiple R-squared: 0.7371, Adjusted R-squared: 0.7344
## F-statistic: 269.2 on 1 and 96 DF, p-value: < 2.2e-16
The adjusted R2 in the model with both Treatment and du1 is 0.7404 and the adjusted R2 for this reduced model with just du1 is 0.7344, suggesting the Treatment is useful. The next section provides a technique to be able to work with different slopes on the quantitative predictor for each group. Comparing those results to the results for the additive model allows assessment of the assumption in this section that all the groups had the same slope coefficient for the quantitative variable.
Fox, John, and Michael Friendly. 2018. Heplots: Visualizing Hypothesis Tests in Multivariate Linear Models. https://CRAN.R-project.org/package=heplots.
Models like this with a categorical variable and quantitative variable are often called ANCOVA or analysis of covariance models but really are just versions of our linear models we’ve been using throughout this material.↩
Note that we employed some specific options in the legend
option to get the legend to fit on this scatterplot better. Usually you can avoid this but the coords
option defined a location and the columns
option made it a two column legend. The viridis(4)
code makes the plot in a suite of four colors from the viridis
package. (Garnier 2018)↩