13.4 Chi-square: chisq.test()
Next, we’ll cover chi-square tests. In a chi-square test test, we test whether or not there is a difference in the rates of outcomes on a nominal scale (like sex, eye color, first name etc.). The test statistic of a chi-square text is \(\chi^2\) and can range from 0 to Infinity. The null-hypothesis of a chi-square test is that \(\chi^2\) = 0 which means no relationship.
A key difference between the chisq.test()
and the other hypothesis tests we’ve covered is that chisq.test()
requires a table created using the table()
function as its main argument. You’ll see how this works when we get to the examples.
13.4.0.1 1-sample Chi-square test
If you conduct a 1-sample chi-square test, you are testing if there is a difference in the number of members of each category in the vector. Or in other words, are all category memberships equally prevalent? Here’s the general form of a one-sample chi-square test:
As you can see, the main argument to chisq.test()
should be a table of values created using the table()
function. For example, let’s conduct a chi-square test to see if all pirate colleges are equally prevalent in the pirates
data. We’ll start by creating a table of the college data:
Just by looking at the table, it looks like pirates are much more likely to come from Captain Chunk’s Cannon Crew (CCCC) than Jack Sparrow’s School of Fashion and Piratery (JSSFP). For this reason, we should expect a very large test statistic and a very small p-value. Let’s test it using the chisq.test()
function.
# Are all colleges equally prevelant?
college.cstest <- chisq.test(x = table(pirates$college))
college.cstest
##
## Chi-squared test for given probabilities
##
## data: table(pirates$college)
## X-squared = 100, df = 1, p-value <2e-16
Indeed, with a test statistic of 99.86 and a tiny p-value, we can safely reject the null hypothesis and conclude that certain college are more popular than others.
13.4.0.2 2-sample Chi-square test
If you want to see if the frequency of one nominal variable depends on a second nominal variable, you’d conduct a 2-sample chi-square test. For example, we might want to know if there is a relationship between the college a pirate went to, and whether or not he/she wears an eyepatch. We can get a contingency table of the data from the pirates
dataframe as follows:
# Do pirates that wear eyepatches have come from different colleges
# than those that do not wear eyepatches?
table(pirates$eyepatch,
pirates$college)
##
## CCCC JSSFP
## 0 225 117
## 1 433 225
To conduct a chi-square test on these data, we will enter table of the two data vectors:
# Is there a relationship between a pirate's
# college and whether or not they wear an eyepatch?
colpatch.cstest <- chisq.test(x = table(pirates$college,
pirates$eyepatch))
colpatch.cstest
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pirates$college, pirates$eyepatch)
## X-squared = 0, df = 1, p-value = 1
It looks like we got a test statistic of \(\chi^2\) = 0 and a p-value of 1. At the traditional p = .05 threshold for significance, we would conclude that we fail to reject the null hypothesis and state that we do not have enough information to determine if pirates from different colleges differ in how likely they are to wear eye patches.
13.4.1 Getting APA-style conclusions with the apa
function
Most people think that R pirates are a completely unhinged, drunken bunch of pillaging buffoons. But nothing could be further from the truth! R pirates are a very organized and formal people who like their statistical output to follow strict rules. The most famous rules are those written by the American Pirate Association (APA). These rules specify exactly how an R pirate should report the results of the most common hypothesis tests to her fellow pirates.
For example, in reporting a t-test, APA style dictates that the result should be in the form t(df) = X, p = Y (Z-tailed), where df is the degrees of freedom of the text, X is the test statistic, Y is the p-value, and Z is the number of tails in the test. Now you can of course read these values directly from the test result, but if you want to save some time and get the APA style conclusion quickly, just use the apa function. Here’s how it works:
Consider the following two-sample t-test on the pirates dataset that compares whether or not there is a significant age difference between pirates who wear headbands and those who do not:
test.result <- t.test(age ~ headband,
data = pirates)
test.result
##
## Welch Two Sample t-test
##
## data: age by headband
## t = 0.4, df = 135, p-value = 0.7
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1.0 1.5
## sample estimates:
## mean in group no mean in group yes
## 28 27
It looks like the test statistic is 0.35, degrees of freedom is 135.47, and the p-value is 0.73. Let’s see how the apa function gets these values directly from the test object:
As you can see, the apa function got the values we wanted and reported them in proper APA style. The apa function will even automatically adapt the output for Chi-Square and correlation tests if you enter such a test object. Let’s see how it works on a correlation test where we correlate a pirate’s age with the number of parrots she has owned:
# Print an APA style conclusion of the correlation
# between a pirate's age and # of parrots
age.parrots.ctest <- cor.test(formula = ~ age + parrots,
data = pirates)
# Pring result
age.parrots.ctest
##
## Pearson's product-moment correlation
##
## data: age and parrots
## t = 6, df = 998, p-value = 1e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.13 0.25
## sample estimates:
## cor
## 0.19
# Print the apa style conclusion!
yarrr::apa(age.parrots.ctest)
## [1] "r = 0.19, t(998) = 6.13, p < 0.01 (2-tailed)"
The apa function has a few optional arguments that control things like the number of significant digits in the output, and the number of tails in the test. Run ?apa
to see all the options.