This page was last updated on October 22, 2018.
In this tutorial we will learn about the following:
tigerstats
library(tigerstats)
bodytemp <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/bodytemp.csv"), header = TRUE)
stalkies <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/stalkies.csv"), header = TRUE)
inspect(bodytemp)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 individual integer 1.0 7 13.0 19 25 13.000 7.3598007 25 0
## 2 temperature numeric 97.4 98 98.6 99 100 98.524 0.6777905 25 0
inspect(stalkies)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n
## 1 individual integer 1.00 3.00 5.00 7.00 9.00 5.000000 2.738613 9
## 2 eyespan numeric 8.15 8.63 8.69 8.96 9.45 8.777778 0.398302 9
## missing
## 1 0
## 2 0
As described in the text on pages 304-306, the t distribution resembles the standard normal distribution (the Z distribution), but is slighly fatter in the tails.
The t distribution is what we use in practice, i.e. when we’re working with samples of data, when drawing inferences about normal populations for which \(\mu\) and \(\sigma\) are unknown.
As with the Z distribution, we can look up probability values (areas under the curve) associated with values of t in a table, such as the one provided on page 708 of the text.
Unlike the Z distribution, the t distribution changes shape depending upon the degrees of freedom:
df = n - 1
Example:
What is the probability (P-value) associated with obtaining a t statistic value of 2.1 or larger, given a sample size n = 11 (df = 10)?
To calculate this, we use the pt
function in the base package:
pt(2.1, df = 10, lower.tail = FALSE) # note lower.tail argument
## [1] 0.03103862
Without a computer, one would use tables to look up critical values of a test statistic like the t statistic.
In R, we can use the qt
function to find the critical value of t associated with a given \(\alpha\) level and degrees of freedom.
For instance, if we were testing a 2-tailed hypothesis, with \(\alpha\) = 0.05, and sample size n = 11, here’s the code:
alpha <- 0.05 # define alpha
n <- 11 # define n
upper.crit <- qt(alpha/2, df = n - 1, lower.tail = FALSE) # if 2-tailed, divide alpha by 2
lower.crit <- qt(alpha/2, df = n - 1, lower.tail = TRUE) # if 2-tailed, divide alpha by 2
c(lower.crit, upper.crit)
## [1] -2.228139 2.228139
This shows the lower and upper critical values of t associated with df = 10 and \(\alpha\) = 0.05.
If we had a different value of \(\alpha\), say \(\alpha\) = 0.1, and the same sample size here’s the code:
alpha <- 0.10 # define alpha
upper.crit <- qt(alpha/2, df = n - 1, lower.tail = FALSE) # if 2-tailed, divide alpha by 2
lower.crit <- qt(alpha/2, df = n - 1, lower.tail = TRUE) # if 2-tailed, divide alpha by 2
c(lower.crit, upper.crit)
## [1] -1.812461 1.812461
We generally don’t use the qt
and pt
functions much on their own, but now you know what they can be used for!
We previously learned statistical tests for testing hypotheses about categorical response variables. For instance, we learned how to conduct a \(\chi\)2 contingency test to test the null hypothesis that there is no association between two categorical variables.
Here we are going to learn our first statistical test for testing hypotheses about a numeric response variable, specifically one whose frequency distribution in the population is normally distributed.
We’ll use the body temperature data for this example, as described in example 11.3 (Page 310) in the text.
Americans are taught as kids that the normal human body temperature is 98.6 degrees Farenheit.
Are the data consistent with this assertion?
The hypotheses for this test:
H0: The mean human body temperature is 98.6\(^\circ\)F.
HA: The mean human body temperature is not 98.6\(^\circ\)F.
The assumptions of the one-sample t-test are as follows:
Under the “Extra tutorials” section of the lab webpage, go through the Checking assumptions and data transformations tutorial. There you’ll learn how to check the assumption of normality. For now we’ll assume both assumptions are met.
NOTE: A tutorial covering “non-parametric” tests, which are used when assumptions of parametric tests (like the t-test), is in preparation, but will not be complete until early 2019. In the meantime, you can explore some non-parametric tests here.
Let’s view a histogram of the body temperatures:
histogram(~ temperature, data = bodytemp,
type = "count",
breaks = seq(from = 97, to = 100.5, by = 0.5),
col = "firebrick",
las = 1,
xlab = "Body temperature (degrees F)",
ylab = "Frequency",
main = "")
Figure 1: Frequency distribution of body temperature for 25 randomly chosen healthy people.
We can see in Figure 1 that the modal temperature among the 25 subjects is between 98.5 and 99\(^\circ\)F, which is consistent with conventional wisdom, but there are 7 people with temperature below 98\(^\circ\)F, and 5 with temperatures above 99\(^\circ\)F. The frequency distribution is unimodal but not particularly symetrical.
We use the t.test
command from the mosaic
package (installed as part of the tigerstats
package) to conduct a one-sample t-test.
NOTE: The base stats
package that automatically loads when you start R also includes a t.test
function, but it doesn’t have the same functionality as the mosaic
package version. BE SURE to load the tigerstats
package prior to using the t.test
function!
?t.test # select the version associated with the "mosaic" package
TIP: We can ensure that the correct function is used by including the package name before the function name, separated by colons:
mosaic::t.test
This function is used for both one-sample and two-sample t-tests (next tutorial), and for calculating 95% confidence intervals for a mean.
Because this function has multiple purposes, be sure to pay attention to the arguments.
You can find out more about the function at the tigerstats
website here.
Let’s think again about what the test is doing. In essence, it is taking the observed sample mean, transforming it to a value of t to fit on the t distribution. This is analogous to what we learned to do with “Z-scores”, but here we’re using the t distribution instead of the Z distribution, because we don’t know the true population parameters (\(\mu\) and \(\sigma\)), and we’re dealing with sample data.
Once we have our observed value of the test statistic t, we can calculate the probability of observing that value, or one of greater magnitude, assuming the null hypothesis were true.
Here’s the code, and we include the null hypothesized body temperature:
body.ttest <- mosaic::t.test(~ temperature, data = bodytemp,
mu = 98.6,
alternative = "two.sided",
conf.level = 0.95)
body.ttest
##
## One Sample t-test
##
## data: temperature
## t = -0.56065, df = 24, p-value = 0.5802
## alternative hypothesis: true mean is not equal to 98.6
## 95 percent confidence interval:
## 98.24422 98.80378
## sample estimates:
## mean of x
## 98.524
The output includes a 95% confidence interval for \(\mu\), the calculated test statistic t, the degrees of freedom, and the associated P-value.
The observed P-value for our test is larger than our \(\alpha\) level of 0.05. We therefore fail to reject the null hypothesis.
We have no reason to reject the hypothesis that the mean body temperature of a healthy human is 98.6\(^\circ\)F (one-sample t-test; t = -0.56; n = 25 or df = 24; P = 0.58).
In an earlier tutorial we learned about the rule of thumb 95% confidence interval here.
Now we will learn how to calculate confidence intervals correctly.
Here’s the code, again using the t.test
function, but this time, if all we wish to do is calculate confidence intervals, we do not include a value for “mu” (as we do when we are conducting a one-sample t-test):
body.conf <- mosaic::t.test(~ temperature, data = bodytemp,
alternative = "two.sided",
conf.level = 0.95)
body.conf
##
## One Sample t-test
##
## data: temperature
## t = 726.8, df = 24, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 98.24422 98.80378
## sample estimates:
## mean of x
## 98.524
It is good practice to report confidence intervals (or any measure of precision) to 3 decimal places, and to include units.
We can do this in R as follows:
lower.cl <- round(body.conf$conf.int[1],3)
upper.cl <- round(body.conf$conf.int[2],3)
c(lower.cl, upper.cl)
## [1] 98.244 98.804
TIP: You can get R to provide “inline” evaluation of code. For instance, type the following code in the main text area, NOT in a chunk:
The 95% confidence interval for the mean is 98.244 to 98.804 $^\circ$F.
… and you will get the following:
The 95% confidence interval for the mean is 98.244 to 98.804 \(^\circ\)F.
Now we can re-write our concluding statement and include the confidence interval:
We have no reason to reject the hypothesis that the mean body temperature of a healthy human is 98.6\(^\circ\)F (one-sample t-test; t = -0.56; n = 25 or df = 24; P = 0.58; 95% CI: 98.244 - 98.804 \(^\circ\)F).
Using the “stalkies” dataset:
Getting started:
read.csv
url
library
Data frame structure:
head
inspect
(tigerstats
/ mosaic
packages)The “t” distribution:
pt
qt
t.test
(mosaic
package loaded as part of the tigerstats
package)Graphs:
histogram