This page was last updated on October 14, 2019.


Getting started

In this tutorial we will learn about the following:

  • the Fisher’s Exact Test
  • the \(\chi\)2 Contingency Test

Required packages

  • tigerstats

Load the package:


Required data

  • the “cancer.csv” dataset (described in Example 9.2 in the text, page 238)
  • the “worm.csv” dataset (described in Example 9.4 in the text, page 246)
## 
## categorical variables:  
##               name  class levels     n missing
## 1 aspirinTreatment factor      2 39876       0
## 2           cancer factor      2 39876       0
##                                    distribution
## 1 Placebo (50%), Aspirin (50%)                 
## 2 No cancer (92.8%), Cancer (7.2%)
## 
## categorical variables:  
##        name  class levels   n missing
## 1 infection factor      3 141       0
## 2      fate factor      2 141       0
##                                    distribution
## 1 uninfected (35.5%), highly (32.6%) ...       
## 2 not eaten (66%), eaten (34%)

Fisher’s Exact Test on a 2 x 2 table

When testing for an association between two categorical variables, the most common test that is used is the \(\chi\)2 contingency test, which is described below.

When the two categorical variables have only 2 categories each, the Fisher’s Exact test (a type of contingency test) provides an EXACT probability for a contingency test, and is therefore preferred over the \(\chi\)2 contingency test (below) when you have a computer to do the calculations.


Hypothesis statement

We’ll use the cancer study data again for this example, as described in example 9.2 (Page 238) in the text.

The hypotheses for this test:

H0: There is no association between the use of aspirin and the probability of developing cancer.
HA: There is an association between the use of aspirin and the probability of developing cancer.

  • We’ll use an \(\alpha\) level of 0.05.
  • It is a two-tailed alternative hypothesis
  • We’ll use a Fisher’s Exact test to test the null hypothesis, because this is the most powerful test when analyzing a 2 x 2 contingency table.
  • There is no test statistic for the Fisher’s Exact test, and nor does it use “degrees of freedom” (the latter is only required when we use a theoretical distribution for a test statistic)

HOWEVER: it is recommended that you report the “odds ratio” in your concluding statement, along with its appropriate confidence interval; this is a useful stand-in test statistic for the Fisher’s Exact Test

  • We also don’t need to worry about assumptions for this test
  • It is always a good idea to present a figure to accompany your analysis; in the case of a Fisher’s Exact test, the figure heading will include information about the sample size / total number of observations, whereas the concluding statement typically does not

Display a contingency table

To create a 2 x 2 contingency table from these data, we’ll make use of the xtabs function we’ve used before:

##            aspirinTreatment
## cancer      Aspirin Placebo
##   Cancer       1438    1427
##   No cancer   18496   18515

NOTE: When dealing with data from studies on human health (e.g. evaluating healthy versus sick subjects), it is convention to organize the contingency table as shown above, with (i) the outcome of interest (here, cancer) in the top row and the alternative outcome on the bottom row, and (ii) the treatment in the first column and placebo (control group) in the second column.

In order to display the data as shown above, it is important that the variable containing the outcome data (here, cancer) comes first after the ~ symbol.

When the data are not related to human health, you do not need to worry about the ordering of the rows of data.


Display a mosaic plot

Let’s visualize the data using a mosaic plot.

NOTE: We need to “transpose” (rotate) our table to display the graph properly. For this we use the t function:

Figure 1: The relative frequency of cancer among study subjects (all women) who received 100mg of aspirin every other day (_n_ = 19934) or a placebo (_n_ = 19942)

Figure 1: The relative frequency of cancer among study subjects (all women) who received 100mg of aspirin every other day (n = 19934) or a placebo (n = 19942)

NOTE: You’ll note that there is no y-axis in the mosaic plot, unlike those shown in the text book. This is OK. It is still showing relative frequency (hence the figure caption).


Conduct the Fisher’s Exact Test

To do the Fisher’s exact test on the cancer data, it is straightforward, using the “fisher.test” command:

?fisher.test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cancerTable
## p-value = 0.8311
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342128 1.0892376
## sample estimates:
## odds ratio 
##   1.008744

The P-value associated with the test is 0.831, which is clearly greater than our \(\alpha\) of 0.05. We therefore do NOT reject the null hypothesis.

You’ll notice that the output includes the **odds ratio* and its 95% confidence interval. The interval it provides is slightly different from the one we calculated in the Odds Ratio tutorial, but is fine to report here.


Concluding statement

There is no evidence that the probability of developing cancer differs between the control group and the aspirin treatment group (Fisher’s Exact Test; P-value = 0.831; odds ratio = 1.01; 95% CI: 0.934 - 1.089). [NOTE: report odds ratios to 2 decimal places, and associated measures of uncertainty to 3 decimal places]


\(\chi\)2 Contingency Test on a m x n table

When the contingency table is of dimensions greater than 2 x 2, the most commonly applied test is the \(\chi\)2 Contingency Test.

For this activity we’re using the “worm” data associated with Example 9.4 on page 246 of the test. Please read the example!


Hypothesis statement

Here we have a 2 x 3 contingency table, and we’re testing for an association between two categorical variables.

Here are the null and alternative hypotheses:

H0: There is no association between the level of trematode parasitism and the frequency (or probability) of being eaten.
HA: There is an association between the level of trematode parasitism and the frequency (or probability) of being eaten.

  • We use \(\alpha\) = 0.05.
  • It is a two-tailed alternative hypothesis
  • We’ll use a contingency test to test the null hypothesis, because this is appropriate for analyzing for association between two categorical variables, and when the resulting contingency table has dimension 2 x 3 or greater.
  • We will use the \(\chi\)2 test statistic, with degrees of freedom equal to (r-1)(c-1), where “r” is the number of rows, and “c” is the number of colums, so (2-1)(3-1) = 2.
  • We must check assumptions of the \(\chi\)2 contingency test
  • It is always a good idea to present a figure to accompany your analysis; in the case of a contingency test, the figure heading will include information about the sample size / total number of observations

Visualize the contingency table

Let’s generate a contingency table:

##            infection
## fate        highly lightly uninfected
##   eaten         37      10          1
##   not eaten      9      35         49

Hmm, the ordering of the categories (or “levels”) of the categorical (factor) variable “infection” is the opposite to what it should be.

In order to change the order levels in a factor variable, here’s what you do (consult this resource for additional info).

First, check the current ordering of the levels using the levels function:

## [1] "highly"     "lightly"    "uninfected"

Now change the ordering as follows:

## [1] "uninfected" "lightly"    "highly"

Now re-display the contingency table:

##            infection
## fate        uninfected lightly highly
##   eaten              1      10     37
##   not eaten         49      35      9

If we wish to calculate the corresponding relative frequencies, we can use the prop.table function:

?prop.table
##            infection
## fate        uninfected   lightly    highly
##   eaten      0.0200000 0.2222222 0.8043478
##   not eaten  0.9800000 0.7777778 0.1956522

Visualize a mosaic plot

NOTE: We need to “transpose” (rotate) our table to display the graph properly. For this we use the t function:

Figure 2: The relative frequency of being eaten among 141 killifish that exhibited different levels of trematode parasitism.

Figure 2: The relative frequency of being eaten among 141 killifish that exhibited different levels of trematode parasitism.


Check the assumptions

The \(\chi\)2 contingency test (also known as association test) has assumptions that must be checked prior to proceeding:

  • none of the categories should have an expected frequency of less than one
  • no more than 20% of the categories should have expected frequencies less than five

To test these assumptions, we need to actually conduct the test, because in doing so R calculates the expected frequencies for us.

So, conduct the test using the chisq.test function, and assign the output to an object:

The resulting object contains a lot of information, and we can get an idea of what it contains by using the names function (the str and inspect functions aren’t really useful for objects that represent the output of statistical tests):

## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"

As you can see, one of the names is expected. This is what holds our expected frequencies:

##            infection
## fate        uninfected  lightly   highly
##   eaten       17.02128 15.31915 15.65957
##   not eaten   32.97872 29.68085 30.34043

We see that all our assumptions are met: none of the categories have an expected frequency of less than one, and no more than 20% of the categories have expected frequencies less than five.


Get the results of the test

We can see the results of the \(\chi\)2 test by simply typing the name of the results object:

## 
##  Pearson's Chi-squared test
## 
## data:  wormTable
## X-squared = 69.756, df = 2, p-value = 7.124e-16

This shows a very large value of \(\chi\)2 (69.76) and a very small P-value (7.12 x 10-16), which is much smaller than our stated \(\alpha\). So we reject the null hypothesis.


Concluding statement

The probability of being eaten is significantly associated with the level of trematode parasitism (\(\chi\)2 contingency test; df = 2; \(\chi\)2 = 69.76; P < 0.001). Based on Figure 2, the probability of being eaten increases substantially with increasing intensity of parasitism.


List of functions

Getting started:

  • library

Data frame structure:

  • inspect
  • names
  • levels

Tabulation:

  • xtabs
  • prop.table

Manipulate data:

  • t (base package, for transposing matrix)

Graphs:

  • mosaicplot

Contingency analysis:

  • fisher.test
  • chisq.test