This page was last updated on October 14, 2019.
In this tutorial we will learn about the following:
cancer <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/cancer.csv"), header = TRUE)
inspect(cancer)
##
## categorical variables:
## name class levels n missing
## 1 aspirinTreatment factor 2 39876 0
## 2 cancer factor 2 39876 0
## distribution
## 1 Placebo (50%), Aspirin (50%)
## 2 No cancer (92.8%), Cancer (7.2%)
worm <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/worm.csv"), header = TRUE)
inspect(worm)
##
## categorical variables:
## name class levels n missing
## 1 infection factor 3 141 0
## 2 fate factor 2 141 0
## distribution
## 1 uninfected (35.5%), highly (32.6%) ...
## 2 not eaten (66%), eaten (34%)
When testing for an association between two categorical variables, the most common test that is used is the \(\chi\)2 contingency test, which is described below.
When the two categorical variables have only 2 categories each, the Fisher’s Exact test (a type of contingency test) provides an EXACT probability for a contingency test, and is therefore preferred over the \(\chi\)2 contingency test (below) when you have a computer to do the calculations.
We’ll use the cancer study data again for this example, as described in example 9.2 (Page 238) in the text.
The hypotheses for this test:
H0: There is no association between the use of aspirin and the probability of developing cancer.
HA: There is an association between the use of aspirin and the probability of developing cancer.
HOWEVER: it is recommended that you report the “odds ratio” in your concluding statement, along with its appropriate confidence interval; this is a useful stand-in test statistic for the Fisher’s Exact Test
To create a 2 x 2 contingency table from these data, we’ll make use of the xtabs
function we’ve used before:
## aspirinTreatment
## cancer Aspirin Placebo
## Cancer 1438 1427
## No cancer 18496 18515
NOTE: When dealing with data from studies on human health (e.g. evaluating healthy versus sick subjects), it is convention to organize the contingency table as shown above, with (i) the outcome of interest (here, cancer) in the top row and the alternative outcome on the bottom row, and (ii) the treatment in the first column and placebo (control group) in the second column.
In order to display the data as shown above, it is important that the variable containing the outcome data (here, cancer
) comes first after the ~
symbol.
When the data are not related to human health, you do not need to worry about the ordering of the rows of data.
Let’s visualize the data using a mosaic plot.
NOTE: We need to “transpose” (rotate) our table to display the graph properly. For this we use the t
function:
mosaicplot(cancerTable.transposed,
col = c("firebrick", "goldenrod1"), # define colours to use
cex.axis = 0.9, # change font size for axis labels
xlab = "Treatment",
ylab = "Condition",
main = "") # don't show a main heading above the graph
Figure 1: The relative frequency of cancer among study subjects (all women) who received 100mg of aspirin every other day (n = 19934) or a placebo (n = 19942)
NOTE: You’ll note that there is no y-axis in the mosaic plot, unlike those shown in the text book. This is OK. It is still showing relative frequency (hence the figure caption).
To do the Fisher’s exact test on the cancer data, it is straightforward, using the “fisher.test” command:
?fisher.test
##
## Fisher's Exact Test for Count Data
##
## data: cancerTable
## p-value = 0.8311
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9342128 1.0892376
## sample estimates:
## odds ratio
## 1.008744
The P-value associated with the test is 0.831, which is clearly greater than our \(\alpha\) of 0.05. We therefore do NOT reject the null hypothesis.
You’ll notice that the output includes the **odds ratio* and its 95% confidence interval. The interval it provides is slightly different from the one we calculated in the Odds Ratio tutorial, but is fine to report here.
There is no evidence that the probability of developing cancer differs between the control group and the aspirin treatment group (Fisher’s Exact Test; P-value = 0.831; odds ratio = 1.01; 95% CI: 0.934 - 1.089). [NOTE: report odds ratios to 2 decimal places, and associated measures of uncertainty to 3 decimal places]
When the contingency table is of dimensions greater than 2 x 2, the most commonly applied test is the \(\chi\)2 Contingency Test.
For this activity we’re using the “worm” data associated with Example 9.4 on page 246 of the test. Please read the example!
Here we have a 2 x 3 contingency table, and we’re testing for an association between two categorical variables.
Here are the null and alternative hypotheses:
H0: There is no association between the level of trematode parasitism and the frequency (or probability) of being eaten.
HA: There is an association between the level of trematode parasitism and the frequency (or probability) of being eaten.
Let’s generate a contingency table:
## infection
## fate highly lightly uninfected
## eaten 37 10 1
## not eaten 9 35 49
Hmm, the ordering of the categories (or “levels”) of the categorical (factor) variable “infection” is the opposite to what it should be.
In order to change the order levels in a factor variable, here’s what you do (consult this resource for additional info).
First, check the current ordering of the levels using the levels
function:
## [1] "highly" "lightly" "uninfected"
Now change the ordering as follows:
worm$infection <- factor(worm$infection, levels = c("uninfected", "lightly", "highly"))
levels(worm$infection)
## [1] "uninfected" "lightly" "highly"
Now re-display the contingency table:
## infection
## fate uninfected lightly highly
## eaten 1 10 37
## not eaten 49 35 9
If we wish to calculate the corresponding relative frequencies, we can use the prop.table
function:
?prop.table
wormTable.rel.freq <- prop.table(wormTable, margin = 2) # margin=2 specifies divide by column totals
wormTable.rel.freq
## infection
## fate uninfected lightly highly
## eaten 0.0200000 0.2222222 0.8043478
## not eaten 0.9800000 0.7777778 0.1956522
NOTE: We need to “transpose” (rotate) our table to display the graph properly. For this we use the t
function:
mosaicplot(wormTable.transposed,
col = c("firebrick", "goldenrod1"), # define colours to use
cex.axis = 0.9, # change font size for axis labels
xlab = "Infection level",
ylab = "Relative frequency",
main = "") # don't show a main heading above the graph
Figure 2: The relative frequency of being eaten among 141 killifish that exhibited different levels of trematode parasitism.
The \(\chi\)2 contingency test (also known as association test) has assumptions that must be checked prior to proceeding:
To test these assumptions, we need to actually conduct the test, because in doing so R calculates the expected frequencies for us.
So, conduct the test using the chisq.test
function, and assign the output to an object:
The resulting object contains a lot of information, and we can get an idea of what it contains by using the names
function (the str
and inspect
functions aren’t really useful for objects that represent the output of statistical tests):
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
As you can see, one of the names is expected
. This is what holds our expected frequencies:
## infection
## fate uninfected lightly highly
## eaten 17.02128 15.31915 15.65957
## not eaten 32.97872 29.68085 30.34043
We see that all our assumptions are met: none of the categories have an expected frequency of less than one, and no more than 20% of the categories have expected frequencies less than five.
We can see the results of the \(\chi\)2 test by simply typing the name of the results object:
##
## Pearson's Chi-squared test
##
## data: wormTable
## X-squared = 69.756, df = 2, p-value = 7.124e-16
This shows a very large value of \(\chi\)2 (69.76) and a very small P-value (7.12 x 10-16), which is much smaller than our stated \(\alpha\). So we reject the null hypothesis.
The probability of being eaten is significantly associated with the level of trematode parasitism (\(\chi\)2 contingency test; df = 2; \(\chi\)2 = 69.76; P < 0.001). Based on Figure 2, the probability of being eaten increases substantially with increasing intensity of parasitism.
Getting started:
library
Data frame structure:
inspect
names
levels
Tabulation:
xtabs
prop.table
Manipulate data:
t
(base package, for transposing matrix)Graphs:
mosaicplot
Contingency analysis:
fisher.test
chisq.test