In order to assess the evidence against our null hypotheses of no difference in distributions or no relationship between the variables, we need to define a test statistic and find its distribution under the null hypothesis. The test statistic used with both types of tests is called the \(\mathbf{X^2}\) statistic (we want to call the statistic X-square not Chi-square). The statistic compares the observed counts in the contingency table to the expected counts under the null hypothesis, with large differences between what we observed and what we expect under the null leading to evidence against the null hypothesis. To help this statistic to follow a named parametric distribution and provide some insights into sources of interesting differences from the null hypothesis, we standardize84 the difference between the observed and expected counts by the square-root of the expected count. The \(\mathbf{X^2}\) statistic is based on the sum of squared standardized differences,
\[\boldsymbol{X^2 = \Sigma^{RC}_{i=1}\left(\frac{Observed_i-Expected_i} {\sqrt{Expected_i}}\right)^2},\]
which is the sum over all (\(R\) times \(C\)) cells in the contingency table of the square of the difference between observed and expected cell counts divided by the square root of the expected cell count. To calculate this test statistic, it useful to start with a table of expected cell counts to go with our contingency table of observed counts. The expected cell counts are easiest to understand in the homogeneity situation but are calculated the same in either scenario.
The idea underlying finding the expected cell counts is to find how many observations we would expect in category \(c\) given the sample size in that group, \(\mathbf{n_{r\bullet}}\), if the null hypothesis is true. Under the null hypothesis across all \(R\) groups the conditional probabilities in each response category must be the same. Consider Figure 2.78 where, under the null hypothesis, the probability of None, Some, and Marked are the same in both treatment groups. Specifically we have \(\text{Pr}(None)=0.5\), \(\text{Pr}(Some)=0.167\), and \(\text{Pr}(Marked)=0.333\). With \(\mathbf{n_{Placebo\bullet}}=43\) and \(\text{Pr}(None)=0.50\), we would expect \(43*0.50=21.5\) subjects to be found in the Placebo, None combination if the null hypothesis were true. Similarly, with \(\text{Pr}(Some)=0.167\), we would expect \(43*0.167=7.18\) in the Placebo, Some cell. And for the Treated group with \(\mathbf{n_{Treated\bullet}}=41\), the expected count in the Marked improvement group would be \(41*0.333=13.65\). Those conditional probabilities came from aggregating across the rows because, under the null, the row (Treatment) should not matter. So, the conditional probability was actually calculated as \(\mathbf{n_{\bullet c}/N}\) = total number of responses in category \(c\) divided by the table total. Since each expected cell count was a conditional probability times the number of observations in the row, we can re-write the expected cell count formula for row \(r\) and column \(c\) as:
\[\mathbf{Expected\ cell\ count_{rc} = \frac{(n_{r\bullet}*n_{\bullet c})}{N}} = \frac{(\text{row } r \text{ total }*\text{ column } c \text{ total})} {\text{table total}}.\]
Table 2.10 demonstrates the calculations of the expected cell counts using this formula for all 6 cells in the \(2\times 3\) table.
(ref:fig5-7) Stacked bar chart that could occur if the null hypothesis were true for the Arthritis study.
(ref:tab5-3) Demonstration of calculation of expected cell counts for Arthritis data.
None | Some | Marked | Totals | |
---|---|---|---|---|
Placebo | \(\boldsymbol{\dfrac{n_{\text{Placebo}\bullet}*n_{\bullet\text{None}}}{N}}\) \(\boldsymbol{=\dfrac{43*42}{84}}\) \(\boldsymbol{=\color{red}{\mathbf{21.5}}}\) |
\(\boldsymbol{\dfrac{n_{\text{Placebo}\bullet}*n_{\bullet\text{Some}}}{N}}\) \(\boldsymbol{=\dfrac{43*14}{84}}\) \(\boldsymbol{=\color{red}{\mathbf{7.167}}}\) |
\(\boldsymbol{\dfrac{n_{\text{Placebo}\bullet}*n_{\bullet\text{Marked}}}{N}}\) \(\boldsymbol{=\dfrac{43*28}{84}}\) \(\boldsymbol{=\color{red}{\mathbf{14.33}}}\) |
\(\boldsymbol{n_{\text{Placebo}\bullet}=43}\) |
Treated | \(\boldsymbol{\dfrac{n_{\text{Treated}\bullet}*n_{\bullet\text{None}}}{N}}\) \(\boldsymbol{=\dfrac{41*42}{84}}\) \(\boldsymbol{=\color{red}{\mathbf{20.5}}}\) |
\(\boldsymbol{\dfrac{n_{\text{Treated}\bullet}*n_{\bullet\text{Some}}}{N}}\) \(\boldsymbol{=\dfrac{41*14}{84}}\) \(\boldsymbol{=\color{red}{\mathbf{6.83}}}\) |
\(\boldsymbol{\dfrac{n_{\text{Treated}\bullet}*n_{\bullet\text{Marked}}}{N}}\) \(\boldsymbol{=\dfrac{41*28}{84}}\) \(\boldsymbol{=\color{red}{\mathbf{13.67}}}\) |
\(\boldsymbol{n_{\text{Treated}\bullet}=41}\) |
Totals | \(\boldsymbol{n_{\bullet\text{None}}=42}\) | \(\boldsymbol{n_{\bullet\text{Some}}=14}\) | \(\boldsymbol{n_{\bullet\text{Marked}}=28}\) | \(\boldsymbol{N=84}\) |
Of course, using R can help us avoid tedium like this… The main
engine for results in this chapter is the chisq.test
function. It operates on a table of
counts that has been produced without row or column totals.
For example, Arthtable
below contains just the observed cell
counts. Applying the chisq.test
function85 to Arthtable
provides a variety of useful output. For the moment, we are just
going to extract the information in the “expected
” attribute of
the results from running this function (using chisq.test()$expected)
.
These are the expected cell counts which match the previous calculations
except for some rounding in the hand-calculations.
## Improved
## Treatment None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
## Improved
## Treatment None Some Marked
## Placebo 21.5 7.166667 14.33333
## Treated 20.5 6.833333 13.66667
With the observed and expected cell counts in hand, we can turn our attention to calculating the test statistic. It is possible to lay out the “contributions” to the \(X^2\) statistic in a table format, allowing a simple way to finally calculate the statistic without losing any information. For each cell we need to find
\[(\text{observed}-\text{expected})/\sqrt{\text{expected}}),\]
square them, and then we need to add them all up. In the current example, there are 6 cells to add up (\(R=2\) times \(C=3\)), shown in Table 2.11.
None | Some | Marked | |
---|---|---|---|
Placebo | \(\left(\frac{29-21.5}{\sqrt{21.5}}\right)^2=\color{red}{\mathbf{2.616}}\) | \(\left(\frac{7-7.167}{\sqrt{7.167}}\right)^2=\color{red}{\mathbf{0.004}}\) | \(\left(\frac{7-14.33}{\sqrt{14.33}}\right)^2=\color{red}{\mathbf{3.752}}\) |
Treated | \(\left(\frac{13-20.5}{\sqrt{20.5}}\right)^2=\color{red}{\mathbf{2.744}}\) | \(\left(\frac{7-6.833}{\sqrt{6.833}}\right)^2=\color{red}{\mathbf{0.004}}\) | \(\left(\frac{21-13.67}{\sqrt{13.67}}\right)^2=\color{red}{\mathbf{3.935}}\) |
Finally, the \(X^2\) statistic here is the sum of these six results \(={\color{red}{2.616+0.004+3.752+2.744+0.004+3.935}}=13.055\)
Our favorite function in this chapter, chisq.test
, does not provide
the contributions to the \(X^2\) statistic directly. It provides a related
quantity called the
\[\textbf{standardized residual}=\left(\frac{\text{Observed}_i - \text{Expected}_i}{\sqrt{\text{Expected}_i}}\right),\]
which, when squared (in R, squaring is accomplished using ^2
),
is the contribution of that particular cell to the \(X^2\)
statistic that is displayed in Table 2.11.
## Improved
## Treatment None Some Marked
## Placebo 2.616279070 0.003875969 3.751937984
## Treated 2.743902439 0.004065041 3.934959350
The most common error made in calculating the \(X^2\) statistic by hand
involves having observed less than expected
and then failing to make the \(X^2\) contribution positive for all cells
(remember you are squaring the entire quantity in the parentheses
and so the sign has to go positive!). In R, we can add up the cells using
the sum
function over the entire table of numbers:
## [1] 13.05502
Or we can let R do all this hard work for us and get straight to the good stuff:
##
## Pearson's Chi-squared test
##
## data: Arthtable
## X-squared = 13.055, df = 2, p-value = 0.001463
The chisq.test
function reports a p-value by
default. Before we discover how it got that result, we can rely on our
permutation methods to obtain a distribution for the \(X^2\) statistic
under the null hypothesis. As in Chapters 2 and ??,
this will allow us to find a
p-value while relaxing one of our assumptions86.
In the One-WAY ANOVA in Chapter ??, we permuted the
grouping variable relative
to the responses, mimicking the null hypothesis that the groups are the same
and so we can shuffle them around if the null is true. That same technique is
useful here. If we randomly permute the grouping variable used to form the rows
in the contingency table relative to the responses in the other variable and
track the possibilities available for the \(X^2\) statistic under
permutations, we can find the probability of getting a result as extreme as or
more extreme than what we observed assuming the null is true, our p-value.
The observed statistic is the
\(X^2\) calculated using the formula above.
Like the \(F\)-statistic, it ends up
that only results in the right tail of this distribution are desirable for
finding evidence against the null hypothesis
because all the values showing deviation from the null in any direction going into the statistic have to be positive. You can see this by observing that
values of the \(X^2\) statistic close to 0 are generated when the
observed values are close to the expected values and that sort of result should
not be used to find evidence against the null. When the observed and expected
values are “far apart”, then we should find evidence against the null. It is
helpful to work through some examples to be able to understand how the \(X^2\)
statistic “measures” differences between observed and expected.
To start, compare the previous observed \(X^2\) of 13.055 to the sort of results we obtain in a single permutation of the treated/placebo labels – Figure 2.79 (top left panel) shows a permuted data set that produced \(X^{2*} = 0.62\). Visually, you can only see minimal differences between the treatment and placebo groups showing up in the stacked bar-chart. Three other permuted data sets are displayed in Figure 2.79 showing the variability in results in permutations but that none get close to showing the differences in the bars observed in the real data set in Figure 2.73.
## Improved
## PermTreatment None Some Marked
## Placebo 22 6 15
## Treated 20 8 13
##
## Pearson's Chi-squared test
##
## data: Arthpermtable
## X-squared = 0.47646, df = 2, p-value = 0.788
(ref:fig5-8) Stacked bar charts of four permuted Arthritis data sets that produced \(X^2\) between 0.62 and 2.38.
To build the permutation-based null distribution for the \(X^2\) statistic,
we need to collect up the test statistics (\(X^{2*}\)) in many of these permuted
results. The code is similar to permutation tests in Chapters
2 and ?? except
that each permutation generates a new contingency table that is summarized and
provided to chisq.test
to analyze. We extract the
$statistic
attribute of the results from running chisq.test
.
(ref:fig5-9) Permutation distribution for the \(X^2\) statistic for the Arthritis data with an observed \(X^2\) of 13.1 (bold, vertical line).
## X-squared
## 13.05502
par(mfrow=c(1,2))
B <- 1000
Tstar <- matrix(NA, nrow=B)
for (b in (1:B)){
Tstar[b] <- chisq.test(tally(~shuffle(Treatment)+Improved,
data=Arthritis))$statistic
}
pdata(Tstar, Tobs, lower.tail=F)[[1]]
## [1] 0.002
hist(Tstar, xlim=c(0,Tobs+1))
abline(v=Tobs, col="red",lwd=3)
plot(density(Tstar), main="Density curve of Tstar",
xlim=c(0,Tobs+1), lwd=2)
abline(v=Tobs, col="red", lwd=3)
For an observed \(X^2\) statistic of 13.055, two out of 1,000 permutation
results matched or exceeded this value (pdata
returned a value of
0.002) as displayed in Figure 2.80.
This suggests that our observed result is quite extreme
relative to the null
hypothesis and provides strong evidence against it.
Validity conditions for a permutation \(X^2\) test are:
Independence of observations.
Both variables are categorical.
Expected cell counts > 0 (otherwise \(X^2\) is not defined).
For the permutation approach described here to provide valid inferences we need to be working with observations that are independent of one another. One way that a violation of independence can sometimes occur in this situation is when a single subject shows up in the table more than once. For example, if a single individual completes a survey more than once and those results are reported as if they came from \(N\) independent individuals. Be careful about this as it is really easy to make tables of poorly collected or non-independent observations and then consider them for these analyses. Poor data still lead to poor conclusions even if you have fancy new statistical tools to use!
Standardizing involves dividing by the standard deviation of a quantity so it has a standard deviation 1 regardless of its original variability and that is what is happening here even though it doesn’t look like the standardization you are used to with continuous variables.↩
Note that in smaller data sets to get results as discussed here, use the correct=F
option. If you get output that contains “...with Yate's continuity correction
”, a slightly modified version of this test is being used.↩
Here it allows us to relax a requirement that all the expected cell counts are larger than 5 for the parametric test (Section 5.6).↩