We can now apply the new confidence interval methods on the STAT 217 grade data.
This time we start with the parametric 95% confidence interval “by hand” in R
and then use lm
to verify our result. The favstats
output provides
us with the required information to calculate the confidence interval, with the estimated difference in the sample mean GPAs of 3.338-3.0886 = 0.2494:
## Sex min Q1 median Q3 max mean sd n missing
## 1 F 2.50 3.1 3.400 3.70 4 3.338378 0.4074549 37 0
## 2 M 1.96 2.8 3.175 3.46 4 3.088571 0.4151789 42 0
The \(df\) are \(37+42-2 = 77\). Using the SDs from the two groups and their sample sizes, we can calculate \(s_p\):
## [1] 0.4116072
The margin of error is:
## [1] 0.1847982
All together, the 95% confidence interval is:
## [1] 0.0646018 0.4341982
So we are 95% confident that the difference in the true mean GPAs between
females and males (females minus males) is between 0.065 and 0.434 GPA points.
We get a similar result from confint
on lm
, except that lm
switched the direction of the comparison from what was done “by hand” above, with the estimated mean difference of -0.25 GPA points (male - female) and similarly switched CI:
##
## Call:
## lm(formula = GPA ~ Sex, data = s217)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12857 -0.28857 0.06162 0.36162 0.91143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.33838 0.06766 49.337 < 2e-16
## SexM -0.24981 0.09280 -2.692 0.00871
##
## Residual standard error: 0.4116 on 77 degrees of freedom
## Multiple R-squared: 0.08601, Adjusted R-squared: 0.07414
## F-statistic: 7.246 on 1 and 77 DF, p-value: 0.008713
## 2.5 % 97.5 %
## (Intercept) 3.2036416 3.47311517
## SexM -0.4345955 -0.06501838
Note that we can easily switch to 90% or 99% confidence intervals by simply
changing the percentile in qt
or changing the level
option in the
confint
function.
## [1] 1.664885
## [1] 2.641198
## 5 % 95 %
## (Intercept) 3.2257252 3.45103159
## SexM -0.4043084 -0.09530553
## 0.5 % 99.5 %
## (Intercept) 3.1596636 3.517093108
## SexM -0.4949103 -0.004703598
As a review of some basic ideas with confidence intervals make sure you can answer the following questions:
What is the impact of increasing the confidence level in this situation?
What happens to the width of the confidence interval if the size of the SE increases or decreases?
What about increasing the sample size – should that increase or decrease the width of the interval?
All the general results you learned before about impacts to widths of CIs hold in this situation whether we are considering the parametric or bootstrap methods…
To finish this example, we will generate the comparable bootstrap 90% confidence interval using the bootstrap distribution in Figure 2.27.
## SexM
## -0.2498069
B <- 1000
set.seed(1234)
Tstar <- matrix(NA, nrow=B)
for (b in (1:B)){
lmP <- lm(GPA~Sex, data=resample(s217))
Tstar[b] <- coef(lmP)[2]
}
quantiles <- qdata(Tstar, c(0.05, 0.95))
quantiles
## quantile p
## 5% -0.39290566 0.05
## 95% -0.09622185 0.95
The output tells us that the 90% confidence interval is from -0.393 to -0.096 GPA points. The bootstrap distribution with the observed difference in the sample means and these cut-offs is displayed in Figure 2.27 using this code:
par(mfrow=c(1,2))
hist(Tstar,labels=T)
abline(v=Tobs,col="red",lwd=2)
abline(v=quantiles$quantile,col="blue",lwd=3,lty=2)
plot(density(Tstar),main="Density curve of Tstar")
abline(v=Tobs,col="red",lwd=2)
abline(v=quantiles$quantile,col="blue",lwd=3,lty=2)
In the previous output, the parametric 90% confidence interval is from -0.404 to -0.095, suggesting similar results again from the two approaches. Based on the bootstrap CI, we can say that we are 90% confident that the difference in the true mean GPAs for STAT 217 students is between -0.393 to -0.094 GPA points (male minus females). This result would be usefully added to step 5 in the 6+ steps of the hypothesis testing protocol with an updated result of:
Report and discuss an estimate of the size of the differences, with confidence interval(s) if appropriate.
Throughout the text, pay attention to the distinctions between parameters and statistics, focusing on the differences between estimates based on the sample and inferences for the population of interest in the form of the parameters of interest. Remember that statistics are summaries of the sample information and parameters are characteristics of populations (which we rarely know). And that our inferences are limited to the population that we randomly sampled from, if we randomly sampled.