This page was last updated on October 22, 2018.
Imagine, for example, we randomly assigned 10 subjects to a control study group and 10 to a treatment study group. On each subject, we measure a numeric response variable, say body temperature. Thus, we have a single numeric variable (body temperature) that we wish to compare among the two categories (control and treatment) of the categorical variable “study group”. To do this, we compare mean body temperature among the two categories of study group.
When comparing the means of two groups, you must choose between two statistical tests, depending upon the study design.
In a paired design, both treatments are applied to every sampled unit. In the two-sample design, each treatment group is composed of an independent random sample of units.
For a paired design, such as “before and after” measurements on the same subjects, one simply calculates the differences between the paired measurements, then conducts a one-sample t-test on these resulting differences. We also calculate the 95% confidence interval for the difference, using methods we learned in the Inference for a normal distribution tutorial.
For an independent-groups design, we use a two-sample t-test, and we calculate the 95% confidence interval for the difference using a new procedure that includes calculating the pooled sample variance (which R does for us).
In sum, in this tutorial you will learn about the following:
* the paired t-test
* the 2-sample t-test * calculating the 95% confidence interval for the difference between the means of two paired groups, and two independent groups
You must consult the Checking assumptions and data transformations tutorial, as we will use some of the methods described therein.
tigerstats
tidyr
car
We’ve used the first two packages before, but not the car
package. In any case, if you haven’t installed the tidyr
or car
packages yet, type the following into your command console:
install.packages("tidyr")
install.packages("car")
Load the packages:
library(tigerstats, warn.conflicts = FALSE, quietly = TRUE)
library(tidyr, warn.conflicts = FALSE, quietly = TRUE)
library(car, warn.conflicts = FALSE, quietly = TRUE)
blackbird <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/blackbird.csv"), header = TRUE)
students <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/students.csv"), header = TRUE)
We’ll use the blackbird
dataset for this example.
For 13 red-winged blackbirds, measurements of antibodies were taken before and after implantation with testosterone. Thus, the same bird was measured twice. Clearly, these measurements are not independent, hence the need for a “paired” t-test.
Let’s look at how the data are stored, as this key to deciding how to proceed:
blackbird
## blackbird time Antibody
## 1 1 Before 4.653960
## 2 2 Before 3.912023
## 3 3 Before 4.912655
## 4 4 Before 4.499810
## 5 5 Before 4.804021
## 6 6 Before 4.882802
## 7 7 Before 4.875197
## 8 8 Before 4.779123
## 9 9 Before 4.976734
## 10 10 Before 4.867534
## 11 11 Before 4.753590
## 12 12 Before 4.700480
## 13 13 Before 4.927254
## 14 1 After 4.442651
## 15 2 After 4.304065
## 16 3 After 4.976734
## 17 4 After 4.454347
## 18 5 After 4.997212
## 19 6 After 4.997212
## 20 7 After 5.010635
## 21 8 After 4.955827
## 22 9 After 5.017280
## 23 10 After 4.727388
## 24 11 After 4.770685
## 25 12 After 4.595120
## 26 13 After 5.010635
The data frame has 26 rows, and includes 3 variables, the first of which “blackbird” simply keeps track of the individual ID of blackbirds.
The response variable of interest, “Antibody” represents antibody production rate measured in units of natural logarithm (ln) 10^{-3} optical density per minute (ln[mOD/min]).
The factor variable time
that has two levels: “After” and “Before”.
These data are stored in long format, which is the ideal format for storing data. I encourage you to read this webpage regarding “tidy data”.
Sometimes you may get data in wide format, in which case, for instance, we would have a column for the “Before” antibody measurements and another column for the “After” measurements.
It is always preferable to work with long-format data.
Consult the following webpage for instructions on using the tidyr
package for converting between wide and long data formats.
With our data in the preferred long format, we can proceed with our hypothesis test.
The hypotheses for this paired t-test focus on the mean of the differences between the paired measurements, denoted by \(\mu\)d:
H0: The mean change in antibody production after testosterone implants was zero (\(\mu\)d = 0).
HA: The mean change in antibody production after testosterone implants was not zero (\(\mu\)d \(\neq\) 0).
Steps to a hypothesis test:
The assumptions of the paired t-test are the same as the assumptions for the one-sample t-test:
The best way to visualize the data for a paired t-test is to create a histogram of the calculated differences between the paired observations.
We can calculate the differences between the “After” and “Before” measurements in a couple different ways.
First, we can use the filter
command from the tidyr
package, and use the $
symbol to extract only the variable of interest:
antibody.diffs <- filter(blackbird, time == "After")$Antibody - filter(blackbird, time == "Before")$Antibody
Or we can use simple subsetting:
antibody.diffs <- blackbird[blackbird$time == "After", "Antibody"] - blackbird[blackbird$time == "Before", "Antibody"]
Either way, our result is a vector of 13 differences, which we can now visualize with a histogram.
NOTE: Although previously we’ve used the histogram
function to generate histograms, it is often easier to get easily-interpreted histograms using the base package hist
function. We’ll use this function now.
We can also use the segments
command to add a vertical dashed line that corresponds with the hypothesized mean difference of zero:
hist(antibody.diffs, nclass = 8, ## asks for 8 bars
xlab = "Antibody production rate (ln[mOD/min])",
las = 1, main = "",
col = "lightgrey")
segments(x0 = 0, y0 = 0, x1 = 0, y1 = 5, lty = 2, lwd = 2, col = "red") # add vertical dashed line at hypothesized mu
Figure 1: Histogram of the difference in antibody production rate before and after treatment (n = 13).
With such a small sample size (13), the histogram is not particularly informative. But we do see most observations are just above zero.
NOTE: If you’re curious about how to reproduce Figure 12.2-1 in the text, see this webpage, about a third of the way down.
The paired t-test assumes:
We assume the first assumption is met.
The second assumption we test using graphical methods and a type of goodness-of-fit (GOF) test, as described in the Checking assumptions and data transformations tutorial.
Let’s first check the normality assumption visually using a Normal Quantile Plot, and note that we’re assessing the single response variable representing the difference in before and after measurements:
qqnorm(antibody.diffs, las = 1, main = "");
qqline(antibody.diffs)
Figure 2: Normal quantile plot of the difference in antibody production rate (ln[mOD/min]) before and after testosterone implants within 13 red-winged black birds.
If the observations come from a normal distribution, they will generally fall close to the straight line.
Here, we would conclude:
“The normal quantile plot shows that the data generally fall close to the line (except perhaps the highest value), suggesting that the data are drawn from a normal distribution.”
And now a formal goodness of fit test, called the Shapiro-Wilk Normality Test, which tests the null hypothesis that the data are sampled from a normal distribution:
shapiro.result <- shapiro.test(antibody.diffs)
shapiro.result
##
## Shapiro-Wilk normality test
##
## data: antibody.diffs
## W = 0.97806, p-value = 0.9688
Given that the P-value is large (and much greater than 0.05), there is no reason to reject the null hypothesis. Thus, our normality assumption is met.
When testing the normality assumption using the Shapiro-Wilk test, there is no need to conduct all the steps associated with a hypothesis test. Simply report the results of the test (the test statistic W
value and the associated P-value).
For instance: “A Shapiro-Wilk test revealed no evidence against the assumption that the data are drawn from a normal distribution (W = 0.98, P-value = 0.969).”
NOTE: A tutorial covering “non-parametric” tests, which are used when assumptions of parametric tests (like the t-test), is in preparation, but will not be complete until early 2019. In the meantime, you can explore some non-parametric tests here.
We can conduct a paired t-test in two different ways:
conduct a one-sample t-test on the differences using the t.test
function and methods you learned in the Comparing one mean to a hypothesized value tutorial.
conduct a paired t-test using the t.test
function and the argument paired = TRUE
.
t.test
antibody.diffs
vector you created above (representing the differences in antibody production rates), conduct all the steps of a hypothesis test, with the null hypothesis being that \(\mu\)d = 0. Use the methods you learned in the Comparing one mean to a hypothesized value tutorial.Let’s proceed with the test, remembering to include the mosaic
package name prior to the function name, to ensure we use the correct function:
blackbird.ttest <- mosaic::t.test(Antibody ~ time, data = blackbird, paired = TRUE, conf.level = 0.95)
blackbird.ttest
##
## Paired t-test
##
## data: Antibody by time
## t = 1.2435, df = 12, p-value = 0.2374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04134676 0.15128638
## sample estimates:
## mean of the differences
## 0.05496981
NOTE: When you specify paired = TRUE
, the t.test
function assumes that the observations are ordered identically in each group (the “Before” and “After” groups).
The output from the t.test
function includes the calculated value of the test statistic t, the degrees of freedom (df), the P-value associated with the calculated test statistic, the 95% confidence interval for the difference, and the sample-based estimate of the difference (mean of the difference).
We see that the P-value of 0.237 is larger than our stated \(\alpha\) (0.05), hence we do not reject the null hypothesis.
For a refresher from a previous tutorial, let’s use R to figure out what the critical value of t is for df = 12:
alpha <- 0.05 # define alpha
n <- 13
upper.crit <- qt(alpha/2, df = n - 1, lower.tail = FALSE) # if 2-tailed, divide alpha by 2
lower.crit <- qt(alpha/2, df = n - 1, lower.tail = TRUE) # if 2-tailed, divide alpha by 2
c(lower.crit, upper.crit)
## [1] -2.178813 2.178813
This shows the lower and upper critical values of t associated with df = 12 and \(\alpha\) = 0.05.
Given that the t.test
function calculated the 95% confidence interval for the difference for us, we need not do any additional steps.
We fail to reject the null hypothesis, and conclude that there was no change in antibody production after testosterone implants (paired t-test; t = 1.24; P-value = 0.237; 95% confidence limits: -0.041, 0.151).
Have a look at the students
dataset:
inspect(students)
##
## categorical variables:
## name class levels n missing
## 1 sex factor 2 154 0
## 2 dominant_hand factor 2 154 0
## 3 dominant_foot factor 2 154 0
## 4 dominant_eye factor 2 154 0
## distribution
## 1 f (58.4%), m (41.6%)
## 2 r (90.3%), l (9.7%)
## 3 r (89%), l (11%)
## 4 r (68.8%), l (31.2%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd
## 1 height_cm numeric 150 165 171.475 180.0 210.8 171.971234 10.027280
## 2 head_circum_cm numeric 53 56 57.000 58.5 63.0 57.185065 1.884848
## 3 number_siblings integer 0 1 2.000 2.0 6.0 1.707792 1.053629
## n missing
## 1 154 0
## 2 154 0
## 3 154 0
These data include measurements taken on 154 students in BIOL202 a few years ago.
Note that the categories in the sex
variable are “f” and “m”:
levels(students$sex)
## [1] "f" "m"
Let’s change these to be more informative, “Female” and “Male”. We do this using the function levels
(see above) which tells us what the categories are in a categorical (factor) variable. Then we simply rename those values:
levels(students$sex) <- c("Female", "Male")
H0: The mean height of male and female students is the same (\(\mu\)M = \(\mu\)F).
HA: The mean height of male and female students is not the same (\(\mu\)M \(\neq\) \(\mu\)F).
Steps to a hypothesis test:
The assumptions of the 2-sample t-test are the same as the assumptions for the one-sample t-test:
NOTE: Read bottom of page 340 in the text, which describes how robust this test is to violations of the assumptions.
We learned in an earlier tutorial that we can use a stripchart or boxplot to visualize a numerical response variable in relation to a categorical variable.
Here we want to visualize height in relation to sex (or gender).
We use a stripchart for relatively small sample sizes in each group (e.g. < 20), and a boxplot otherwise.
Let’s calculate sample sizes by tabulating the frequency of each sex:
samp.sizes <- xtabs(~ sex, data = students)
samp.sizes
## sex
## Female Male
## 90 64
So, large samples sizes in each group, therefore a boxplot is warranted.
boxplot(height_cm ~ sex, data = students,
ylab = "Height (cm)",
xlab = "Gender",
las = 1) # orients y-axis tick labels properly
Figure 3: Boxplot of height in relation to gender among 154 students. Boxes delimit the first to third quartiles, bold lines represent the group medians, bold circles the group means, and whiskers extend to 1.5 times the IQR. Points beyond whiskers are extreme observations.
We see that males are, on average, quite a bit taller than females.
NOTE: In the Biology guidelines to data presentation, it is recommended that the first boxplot you present should include, in the figure caption, a description of all the features of the boxplot (as shown above).
Be sure to follow those instructions in any assignment.
TIP: You can create even better boxplots using the ggplot2
package, as described here.
The assumptions of the 2-sample t-test are the same as the assumptions for the one-sample t-test:
Test for normality
Now let’s check the normality assumption by plotting a normal quantile plot for each group. We use the par
function to enable 2 graphs positioned side by side:
par(mfrow = c(1,2)) # create one row of 2 columns for graphs
qqnorm(students$height_cm[students$sex == "Female"], las = 1, main = "Female");
qqline(students$height_cm[students$sex == "Female"]) # add the line
qqnorm(students$height_cm[students$sex == "Male"], las = 1, main = "Male");
qqline(students$height_cm[students$sex == "Male"]) # add the line
Figure 4: Normal quantile plots of the heights (cm) of 90 female and 64 male students.
We see that the male data are a little bit off the line, but we know that, thanks to the central limit theorem, the 2-sample t-test is robust to violations of non-normality when one has large sample sizes. Thus, we’ll proceed with testing the next assumption (there’s no need to conduct a Shapiro-Wilk’s test).
Test for equal variances
Now we need to test the assumption of equal variance among the groups, as described in the Checking assumptions and data transformations tutorial.
We’ll use the Levene’s test to test the null hypothesis that the variances are equal among the groups.
height.vartest <- leveneTest(height_cm ~ sex, data = students)
height.vartest
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0 0.999
## 152
It uses a test statistic “F”, and we see here that the P-value associated with the test statistic is almost 1, so clearly not significant.
We state “A Levene’s test showed no evidence against the assumption of equal variance (F = 0; P-value = 0.999).”
Thus, we’ll proceed with conducting the test.
We use the t.test
function again (or you can use the ttestGC
function if you wish), but this time we make sure to set the paired
argument to FALSE
.
height.ttest <- t.test(height_cm ~ sex, data = students, var.equal = TRUE, conf.level = 0.95)
height.ttest
##
## Two Sample t-test
##
## data: height_cm by sex
## t = -12.834, df = 152, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.87672 -12.37372
## sample estimates:
## mean in group Female mean in group Male
## 165.8932 180.5184
We see that the test produced an extremely small P-value, much smaller than our \(\alpha\), so we reject the null hypothesis.
NOTE: Given the negative t value shown in the output, it’s clear that the function calculated the difference in means as (Female minus Male). This is fine, but we need to make sure our concluding statement recognizes this, and reports the t value either as positive or negative, depending on the wording.
Note also that the output includes a confidence interval for the difference in group means. We need to include this in our concluding statement.
On average, male students are significantly taller than female students (2-sample t-test; t = 12.83; P-value < 0.001; 95% confidence limits for the true difference in mean height: 12.374, 16.877).
When one or more assumptions of the test of choice are not met, you have other options, such as transforming the variables.
These options are covered in the Checking assumptions and data transformations tutorial.
Getting started:
read.csv
url
library
Data management / manipulation:
inspect
(tigerstats
/ mosaic
packages)levels
filter
(tidyr
package)The “t” distribution:
qt
t.test
(mosaic
package loaded as part of the tigerstats
package)Graphs:
hist
boxplot
qqnorm
qqline
segments
par
Assumptions:
leveneTest
(car
package)qqnorm
qqline
shapiro.test