Many universities require students to have certain test scores in order to be
admitted into their institutions. They
obviously must think that those scores are useful predictors of student success
to use them in this way. Quality assessments of recruiting classes are also
based on their test scores. The Educational Testing Service (the company behind
such fun exams as the SAT and GRE) collected a data set to validate their SAT
on \(n=1000\) students from an unnamed Midwestern university; the data set is
available in the openintro
package (Diez, Barr, and Cetinkaya-Rundel 2017)
in the satGPA
data set.
It is unclear from the documentation whether a
random sample was collected, in fact it looks like it certainly wasn’t a random
sample of all incoming students at a large university (more later). What
potential issues would arise if a company was providing a data set to show the
performance of their test and it was not based on a random sample?
We will proceed assuming they used good methods in developing their test
(there are sophisticated
statistical models underlying the development of the SAT and GRE) and also in
obtaining a data set for testing out the performance of their tests that is at
least representative of the students (or some types of students) at this
university. They provided information on the SAT Verbal (SATV
)
and Math (SATM
) percentiles (these are not the scores but the ranking
percentile that each score translated to in a particular year),
High School GPA (HSGPA
), First Year of college GPA (FYGPA
), Gender (Sex
of the students coded 1 and 2 with possibly 1 for males and 2 for females – the documentation was also unclear this). Should Sex
even be displayed in a plot with correlations since it is a categorical variable? Our interests here are in whether the two SAT percentiles are (together?)
related to the first year college GPA, describing the size of their impacts
and assessing the predictive potential of SAT-based measures for first year in
college GPA. There are certainly other possible research questions that can be
addressed with these data but this will keep us focused.
library(openintro)
data(satGPA)
satGPA <- as_tibble(satGPA)
pairs.panels(satGPA[,-4], ellipse=F, col="red", lwd=2)
There are positive relationships in Figure 2.163 among all the pre-college measures and the college GPA but none are above the moderate strength level. The HSGPA has a highest correlation with first year of college results but its correlation is not that strong. Maybe together in a model the SAT percentiles can also be useful… Also note that plot shows an odd HSGPA of 4.5 that probably should be removed124 if that variable is going to be used (HSGPA was not used in the following models so the observation remains in the data).
In MLR, the modeling process is a bit more complex and often involves more than one model, so we will often avoid the 6+ steps in testing initially and try to generate a model we can use in that more specific process. In this case, the first model of interest using the two SAT percentiles,
\[\text{FYGPA}_i = \beta_0 + \beta_{\text{SATV}}\text{SATV}_i + \beta_{\text{SATM}}\text{SATM}_i +\varepsilon_i,\]
looks like it might be worth interrogating further so we can jump straight into considering the 6+ steps involved in hypothesis testing for the two slope coefficients to address our RQ about assessing the predictive ability and relationship of the SAT scores on first year college GPA. We will use \(t\)-based inferences, assuming that we can trust the assumptions and the initial plots get us some idea of the potential relationship.
Note that this is not a randomized experiment but we can assume that it is representative of the students at that single university. We would not want to extend these inferences to other universities (who might be more or less selective) or to students who did not get into this university and, especially, not to students that failed to complete the first year. The second and third constraints point to a severe limitation in this research – only students who were accepted, went to, and finished one year at this university could be studied. Lower SAT percentile students might not have been allowed in or may not have finished the first year and higher SAT students might have been attracted to other more prestigious institutions. So the scope of inference is just limited to students that were invited and chose to attend this institution and successfully completed one year of courses. It is hard to know if the SAT “works” when the inferences are so restricted in who they might apply to… But you could see why the company that administers the SAT might want to analyze these data. Admissions people also often focus on predicting first year retention rates, but that is a categorical response variable (retained/not) and so not compatible with the linear models considered here.
The following code fits the model of interest, provides a model summary, and the diagnostic plots, allowing us to consider the tests of interest:
(ref:fig8-14) Diagnostic plots for the \(\text{FYGPA}\sim\text{ SATV }+\text{ SATM}\) model.
##
## Call:
## lm(formula = FYGPA ~ SATV + SATM, data = satGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19647 -0.44777 0.02895 0.45717 1.60940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007372 0.152292 0.048 0.961
## SATV 0.025390 0.002859 8.879 < 2e-16
## SATM 0.022395 0.002786 8.037 2.58e-15
##
## Residual standard error: 0.6582 on 997 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2106
## F-statistic: 134.2 on 2 and 997 DF, p-value: < 2.2e-16
par(mfrow=c(2,2), oma=c(0,0,2,0))
plot(gpa1, sub.caption="Diagnostics for GPA model with SATV and SATM")
Hypotheses of interest:
\(H_0: \beta_\text{SATV}=0\) given SATM in the model vs \(H_A: \beta_\text{SATV}\ne 0\) given SATM in the model.
\(H_0: \beta_\text{SATM}=0\) given SATV in the model vs \(H_A: \beta_\text{SATM}\ne 0\) given SATV in the model.
Plot the data and assess validity conditions:
Quantitative variables condition:
Independence of observations:
Linearity of relationships:
The initial scatterplots (Figure 2.163) do not show any clear nonlinearities with each predictor used in this model.
The Residuals vs Fitted and Scale-Location plots (Figure 2.164) do not show much more than a football shape, which is our desired result.
The partial residuals are displayed in Figure 2.165 and do not suggest any clear missed curvature.
Multicollinearity checked for:
The original scatterplots suggest that there is some collinearity between the two SAT percentiles with a correlation of 0.47. That is actually a bit lower than one might expect and suggests that each score must be measuring some independent information about different characteristics of the students.
VIFs also do not suggest a major issue with multicollinearity in the model with the VIFs for both variables the same at 1.278125. This suggests that both SEs are about 13% larger than they otherwise would have been due to shared information between the two predictor variables.
## SATV SATM
## 1.278278 1.278278
## SATV SATM
## 1.13061 1.13061
Equal (constant) variance:
Normality of residuals:
No influential points:
So we are fairly comfortable with all the assumptions being at least not clearly violated and so the inferences from our model should be relatively trustworthy.
Calculate the test statistics and p-values:
For SATV: \(t=\dfrac{0.02539}{0.002859}=8.88\) with the \(t\) having \(df=997\) and p-value \(<0.0001\).
For SATM: \(t=\dfrac{0.02240}{0.002786}=8.04\) with the \(t\) having \(df=997\) and p-value \(<0.0001\).
Conclusions:
For SATV: There is strong evidence against the null hypothesis of no linear relationship between SATV and FYGPA (\(t_{997}=8.88\), p-value < 0.0001) and conclude that, in fact, there is a linear relationship between SATV percentile and the first year of college GPA, after controlling for the SATM percentile, in the population of students that completed their first year at this university.
For SATM: There is strong evidence against the null hypothesis of no linear relationship between SATM and FYGPA (\(t_{997}=8.04\), p-value < 0.0001)and conclude that, in fact, there is a linear relationship between SATM percentile and the first year of college GPA, after controlling for the SATV percentile, in the population of students that completed their first year at this university.
Size:
\[\widehat{\text{FYGPA}}_i=0.00737+0.0254\cdot\text{SATV}_i+0.0224\cdot\text{SATM}_i\ .\]
So for a 1 percent increase in the SATV percentile, we estimate, on average, a 0.0254 point change in GPA, after controlling for SATM percentile. Similarly, for a 1 percent increase in the SATM percentile, we estimate, on average, a 0.0224 point change in GPA, after controlling for SATV percentile. While this is a correct interpretation of the slope coefficients, it is often easier to assess “practical” importance of the results by considering how much change this implies over the range of observed predictor values.
The term-plots (Figure 2.165) provide a visualization of the “size” of the differences in the response variable explained by each predictor. The SATV term-plot shows that for the range of percentiles from around the 30th percentile to the 70th percentile, the mean first year GPA is predicted to go from approximately 1.9 to 3.0. That is a pretty wide range of differences in GPAs across the range of observed percentiles. This looks like a pretty interesting and important change in the mean first year GPA across that range of different SAT percentiles. Similarly, the SATM term-plot shows that the SATM percentiles were observed to range between around the 30th percentile and 70th percentile and predict mean GPAs between 1.95 and 2.8. It seems that the SAT Verbal percentiles produce slightly more impacts in the model, holding the other variable constant, but that both are important variables. The 95% confidence intervals for the means in both plots suggest that the results are fairly precisely estimated – there is little variability around the predicted means in each plot. This is mostly a function of the sample size as opposed to the model itself explaining most of the variation in the responses.
(ref:fig8-15) Term-plots for the \(\text{FYGPA}\sim\text{SATV} + \text{SATM}\) model with partial residuals.
* The confidence intervals also help us pin down the uncertainty in each
estimated slope coefficient. As always, the “easy” way to get 95% confidence
intervals is using the confint
function:
## 2.5 % 97.5 %
## (Intercept) -0.29147825 0.30622148
## SATV 0.01977864 0.03100106
## SATM 0.01692690 0.02786220
* So, for a 1 percent increase in the *SATV* percentile, we are 95% confident
that the true mean FYGPA changes between 0.0198 and 0.031 points, in the population of students who completed this year at this institution, after controlling for SATM. The SATM result is similar with an interval from 0.0169 and 0.0279. Both of these intervals might benefit from re-scaling the interpretation to, say, a 10 percentile increase in the predictor variable, with the change in the FYGPA for that level of increase of SATV providing an interval from 0.198 to 0.31 points and for SATM providing an interval from 0.169 to 0.279. So a boost of 10% in either exam percentile likely results in a noticeable but not huge average FYGPA increase.
Scope of Inference:
The term-plots also inform the types of students attending this university and successfully completing the first year of school. This seems like a good, but maybe not great, institution with few students scoring over the 75th percentile on either SAT Verbal or Math (at least that ended up in this data set). This result makes questions about their sampling mechanism re-occur as to who this data set might actually be representative of…
Note that neither inference is causal because there was no random assignment of SAT percentiles to the subjects. The inferences are also limited to students who stayed in school long enough to get a GPA from their first year of college at this university.
One final use of these methods is to do prediction and generate prediction intervals, which could be quite informative for a student considering going to this university who has a particular set of SAT scores. For example, suppose that the student is interested in the average FYGPA to expect with SATV at the 30th percentile and SATM at the 60th percentile. The predicted mean value is
\[\begin{array}{rl} \hat{\mu}_{\text{GPA}_i} &= 0.00737 + 0.0254\cdot\text{SATV}_i + 0.0224\cdot\text{SATM}_i \\ &= 0.00737 + 0.0254*30 + 0.0224*60 = 2.113. \end{array}\]
This result and the 95% confidence interval for the mean student GPA at these
scores can be found using the predict
function as:
## 1
## 2.11274
## fit lwr upr
## 1 2.11274 1.982612 2.242868
For students at the 30th percentile of SATV and 60th percentile of SATM, we are 95% confident that the true mean first year GPA is between 1.98 and 2.24 points. For an individual student, we would want the 95% prediction interval:
## fit lwr upr
## 1 2.11274 0.8145859 3.410894
For a student with SATV=30 and SATM=60, we are 95% sure that their first year GPA will be between 0.81 and 3.4 points. You can see that while we are very certain about the mean in this situation, there is a lot of uncertainty in the predictions for individual students. The PI is so wide as to almost not be useful.
To support this difficulty in getting a precise prediction for a new student, review the original scatterplots and partial residuals: there is quite a bit of vertical variability in first year GPAs for each level of any of the predictors. The residual SE, \(\hat{\sigma}\), is also informative in this regard – remember that it is the standard deviation of the residuals around the regression line. It is 0.6582, so the SD of new observations around the line is 0.66 GPA points and that is pretty large on a GPA scale. Remember that if the residuals meet our assumptions and follow a normal distribution around the line, observations within 2 or 3 SDs of the mean would be expected which is a large range of GPA values. Figure 2.166 remakes both term-plots, holding the other predictor at its mean, and adds the 95% prediction intervals to show the difference in variability between estimating the mean and pinning down the value of a new observation. The R code is very messy and rarely needed, but hopefully this helps reinforce the differences in these two types of intervals – to make them in MLR, you have to fix all but one of the predictor variables and we usually do that by fixing the other variables at their means.
(ref:fig8-16) Term-plots for the \(\text{FYGPA}\sim\text{SATV} + \text{SATM}\) model with 95% confidence intervals (red, dashed lines) and 95% PIs (light grey, dotted lines).
#Remake effects plots with 95% PIs
dv1 <- tibble(SATV=seq(from=24,to=76,length.out=50), SATM=rep(54.4,50))
dm1 <- tibble(SATV=rep(48.93,50), SATM=seq(from=29,to=77,length.out=50))
mv1 <- as_tibble(predict(gpa1, newdata=dv1, interval="confidence"))
pv1 <- as_tibble(predict(gpa1, newdata=dv1, interval="prediction"))
mm1 <- as_tibble(predict(gpa1, newdata=dm1, interval="confidence"))
pm1 <- as_tibble(predict(gpa1, newdata=dm1, interval="prediction"))
par(mfrow=c(1,2))
plot(dv1$SATV, mv1$fit, lwd=2, ylim=c(pv1$lwr[1],pv1$upr[50]), type="l",
xlab="SATV Percentile", ylab="GPA", main="SATV Effect, CI and PI")
lines(dv1$SATV, mv1$lwr, col="red", lty=2, lwd=2)
lines(dv1$SATV, mv1$upr, col="red", lty=2, lwd=2)
lines(dv1$SATV, pv1$lwr, col="grey", lty=3, lwd=3)
lines(dv1$SATV, pv1$upr, col="grey", lty=3, lwd=3)
legend("topleft", c("Estimate", "CI","PI"), lwd=3, lty=c(1,2,3),
col = c("black", "red","grey"))
plot(dm1$SATM, mm1$fit, lwd=2, ylim=c(pm1$lwr[1],pm1$upr[50]), type="l",
xlab="SATM Percentile", ylab="GPA", main="SATM Effect, CI and PI")
lines(dm1$SATM, mm1$lwr, col="red", lty=2, lwd=2)
lines(dm1$SATM, mm1$upr, col="red", lty=2, lwd=2)
lines(dm1$SATM, pm1$lwr, col="grey", lty=3, lwd=3)
lines(dm1$SATM, pm1$upr, col="grey", lty=3, lwd=3)
Diez, David M, Christopher D Barr, and Mine Cetinkaya-Rundel. 2017. Openintro: Data Sets and Supplemental Functions from ’Openintro’ Textbooks. https://CRAN.R-project.org/package=openintro.
Either someone had a weighted GPA with bonus points, or more likely here, there was a coding error in the data set since only one observation was over 4.0 in the GPA data. Either way, we could remove it and note that our inferences for HSGPA do not extend above 4.0.↩
When there are just two predictors, the VIFs have to be the same since the proportion of information shared is the same in both directions. With more than two predictors, each variable can have a different VIF value.↩