This page was last updated on October 04, 2019.
In this tutorial we will learn about the following:
Each of these packages has been used in previous tutorials:
tigerstats
packageknitr
packageThis is a new one:
kableExtra
packageinstall.packages("kableExtra")
Import the nhlbirths.csv
data set and look at its structure:
nhlbirths <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/nhlbirths.csv"), header = TRUE)
str(nhlbirths)
## 'data.frame': 12 obs. of 3 variables:
## $ month : Factor w/ 12 levels "April","August",..: 5 4 8 1 9 7 6 2 12 11 ...
## $ num_players_born : int 141 128 122 120 117 122 95 91 80 89 ...
## $ canada_births_2000_2005: int 161151 151824 172419 169800 176455 171169 176618 173397 173116 166950 ...
This data frame has only 12 rows, so let’s look at it all:
## month num_players_born canada_births_2000_2005
## 1 January 141 161151
## 2 February 128 151824
## 3 March 122 172419
## 4 April 120 169800
## 5 May 117 176455
## 6 June 122 171169
## 7 July 95 176618
## 8 August 91 173397
## 9 September 80 173116
## 10 October 89 166950
## 11 November 74 156284
## 12 December 88 155695
This dataset shows the number of NHL hockey players (active in 2006) that were born in each month of the year (“num players born” variable). It also shows the total number of births per month in the general population, over the years 2000-2005.
And now let’s produce a nicer table showing just the first two columns.
OPTIONAL: This table formatting section is optional, but recommended
We’ll use the methods described in the Generating tables in R Markdown tutorial:
kable(nhlbirths[,1:2], format = "html",
caption = "Table 1: Frequency table of NHL player birth months for players active in 2006",
digits = c(0, 0), align = "cc",
col.names = c("Month", "Number of players born")) %>%
kable_styling(full_width = FALSE, position = "left")
Month | Number of players born |
---|---|
January | 141 |
February | 128 |
March | 122 |
April | 120 |
May | 117 |
June | 122 |
July | 95 |
August | 91 |
September | 80 |
October | 89 |
November | 74 |
December | 88 |
You can produce the appropriate symbol for “chi-square” using this syntax:
$\chi$^2^
produces \(\chi\)2
We are still learning how to test hypotheses about categorical data.
We previously learned how to test a hypothesis about frequencies or proportions when there are only two categories of interest, i.e. success and failure (a binary variable). We used a binomial test for this purpose. Importantly, one can start with a categorical variable with more than two categories (e.g. hair colour), but if we define one particular category (e.g. red hair) as a “success”, and the remaining categories as failures, then we have simplified our variable to a binary categorical variable.
For testing hypotheses about frequencies or proportions when there are more than two categories, we use goodness of fit (GOF) tests. We test how well an observed discrete frequency (or probability) distribution fits some specified expectation.
It is hypothesized that, among sports that use December 31st as the cutoff for age categories, kids born early in the calendar year (e.g. January) will have an advantage within their training years (i.e. when they’re young kids learning to play), because they are comparatively larger than kids in the same age group but born later in the year.
Evidence consistent with this hypothesis would come in the form of a significant bias towards early-year birthdates among professional athletes.
Identify the appropriate test statistic, and calculate the test statistic value using observed data
Simulate data and use them to construct a null distribution for the test statistic, under the assumption of a true null hypothesis
OR: Instead of simulating data to construct a null distribution, use an appropriate statistical test, which provides a theoretical null distribution
Draw the appropriate conclusion and communicate it clearly
Let’s test our research hypothesis using the NHL data. We need to construct an appropriate null and alternative statistical hypothesis, and the null hypothesis must be specific about the expected proportions or frequencies under a true null hypothesis (i.e. when nothing is going on).
If there is no advantage to professional athletes by being born earlier in the year, then the “null expectation” is that the birth dates of NHL players will be distributed among the 12 months in direct proportion to the fraction of the year encompassed by each month, i.e. the number of days in each month divided by 365 days.
This is an example of the proportional model.
H0: The probability of an NHL player being born in any given month is proportional to the fraction of days in that month out 365 days.
HA: The probability of an NHL player being born in any given month is not proportional to the fraction of days in that month out 365 days.
Thus, we are testing a hypothesis about a discrete probability distribution that includes one null proportion for each month of the year (thus 12 categories in the variable birth month).
This contrasts with a binomial test, in which we test a hypothesis about a single proportion.
Let’s visualize the data, as one should always do when testing hypotheses.
First, we’ll present the data frame in table format (see the section above for info about presenting a table like this in R Markdown):
kable(nhlbirths[,1:2], format = "html",
caption = "Table 1: Frequency table of NHL player birth months for players active in 2006",
digits = c(0, 0), align = "cc",
col.names = c("Month", "Number of players born")) %>%
kable_styling(full_width = FALSE, position = "left")
Month | Number of players born |
---|---|
January | 141 |
February | 128 |
March | 122 |
April | 120 |
May | 117 |
June | 122 |
July | 95 |
August | 91 |
September | 80 |
October | 89 |
November | 74 |
December | 88 |
Now we’ll display a bar chart, which is appropriate for categorical data.
barplot(nhlbirths$num_players_born,
names.arg = nhlbirths$month,
las = 2,
xlab = "", # no room for x-axis label on graph
ylab = "Number of births")
Figure 1: Frequency of births per month of NHL players active in 2006
You can see from Figure 1 that there does seem to be considerable variation in the number of births across the months, and that in general there are more births than expected in the early months of the calendar year, and fewer births than expected in the later months of the year. Now we need to figure how whether these discrepancies (i.e. differences from expected) are large enough to be considered interesting, i.e. to warrant rejecting the null hypothesis in favour of the alternative.
The appropriate test statistic is the the \(\chi\)2 statistic, which measures the discrepancy between observed and expected frequencies
The \(\chi\)2 goodness-of-fit (GOF) test is the appropriate test to use here: it evaluates discrepancies between observed frequencies (for more than 2 categories) and those expected based on some appropriate model.
For this example, we’re using the Proportional model
You don’t need to know how to create this graph, nor do you need to include this kind of graph in any assignment:
curve(dchisq(x, df = 11),
main = "",
from = 0, to = 50,
ylab = "Probability density",
xlab = expression(paste(chi^2, " value")),
las = 1)
The chi-square probability distribution for degrees of freedom = 11.
Note that this is a continuous probability distribution, so rather than asking “what’s the probability of observing a given value of the test statistic”, we’re going to ask questions like “what’s the probability of observing a test statistic value greater than some value”; thus, we’re dealing with areas under the curve.
In the simplest scenario, the expected proportions for a proportional model would simply be 1 over the number of categories. Thus, in this example, it would be 1/12.
However, in our example the proportion associated with each month differs depending upon the number of days in the month: For example, the probability of giving birth in February, which as 28 days, is smaller (all else being equal) than in a month with 31 days such as January.
We therefore need to calculate the correct probabilities (proportions) for each month.
First let’s figure out how many players we have birthdates for, i.e. the total count of birth dates:
## [1] 1267
Now let’s create a vector object that holds the number of days in each month, and add this to the nhlbirths
data frame as a new variable called days.per.month
:
nhlbirths$days.per.month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
sum(nhlbirths$days.per.month) # make sure adds up to 365
## [1] 365
Now let’s calculate the expected proportion of births per month, and add this to the data frame as a new variable:
nhlbirths$expected.proportions <- nhlbirths$days.per.month / sum(nhlbirths$days.per.month) # the expected proportions
nhlbirths$expected.proportions
## [1] 0.08493151 0.07671233 0.08493151 0.08219178 0.08493151 0.08219178
## [7] 0.08493151 0.08493151 0.08219178 0.08493151 0.08219178 0.08493151
When using R to conduct Goodness of fit tests, you provide the function with the expected proportions for each category (see below).
When conducting Goodness of fit tests by hand, you must calculate expected frequencies.
We use the expected proportions to calculate the expected frequencies of births in each month.
We do this by multiplying the expected proportions by the total number of NHL births:
nhlbirths$expected.freqs <- nhlbirths$expected.proportions * total.nhl.births
nhlbirths$expected.freqs
## [1] 107.60822 97.19452 107.60822 104.13699 107.60822 104.13699 107.60822
## [8] 107.60822 104.13699 107.60822 104.13699 107.60822
NOTE: Expected frequencies do not need to be whole numbers (integers).
The \(\chi\)2 goodness of fit test has assumptions that must be checked prior to proceeding:
For our current example, these assumptions are met.
When we conduct the goodness of fit test, R will tell us what the P-value is associated with our observed value of \(\chi\)2
However, we can also use the qchisq
command to determine what the critical value of \(\chi\)2 is for our test; that is, beyond what value of \(\chi\)2 (along the x-axis in the sampling distribution figure above) is the area under the curve equal to our \(\alpha\) = 0.05?
?qchisq
NOTE: Values of \(\chi\)2 increase with increasing overall discrepancy between observed and expected frequencies. We therefore only consider the right-hand tail of the \(\chi\)2 distribution. This applies to all tests that use the \(\chi\)2 test statistic
Use this command to find the critical value of \(\chi\)2 for \(\alpha\) = 0.05 and degrees of freedom = 11:
## [1] 19.67514
Let’s show this critical value on our distribution:
curve(dchisq(x, df = 11),
main = "",
from = 0,
to = 50,
ylab = "Probability density",
xlab = expression(paste(chi^2, " value")),
las = 1)
segments(qchisq(0.05, 11, lower.tail = FALSE), 0, qchisq(0.05, 11, lower.tail = FALSE), 0.08, lwd = 2, col = "red")
The chi-square probability distribution for degrees of freedom = 11. The critical value for alpha = 0.05 is shown.
So values of \(\chi\)2 that fall on or to the right of the red line will be considered “significant” at \(\alpha\) = 0.05.
NOTE: The \(\chi\)2 probability distribution changes form depending upon the degrees of freedom. Thus, the critical value associated with \(\alpha\) = 0.05 will also change according to the degrees of freedom.
The text book includes a table of critical \(\chi\)2 values, for different degrees of freedom, on pages 703-705.
For this test, we’ll use the chisqtestGC
function from the tigerstats
package:
?chisqtestGC
This function can be used on raw data (“long format”) or on summarized data.
We have the summarized data in our data frame, in the variable num_players_born
. So, we need to provide this as a vector to the function. We will show an example with raw data in future tutorials.
The function also optionally provides a graph of the probability distribution, with a shaded area corresponding to the area under the curve to the right of the observed test statistic value.
Also, we need to provide a vector of expected proportions to the argument p
:
chisq.results <- chisqtestGC(nhlbirths$num_players_born,
p = nhlbirths$expected.proportions,
graph = FALSE)
chisq.results
## Chi-squared test for given probabilities
##
## Observed counts Expected by Null Contr to chisq stat
## A 141 107.61 10.36
## B 128 97.19 9.76
## C 122 107.61 1.92
## D 120 104.14 2.42
## E 117 107.61 0.82
## F 122 104.14 3.06
## G 95 107.61 1.48
## H 91 107.61 2.56
## I 80 104.14 5.59
## J 89 107.61 3.22
## K 74 104.14 8.72
## L 88 107.61 3.57
##
##
## Chi-Square Statistic = 53.4979
## Degrees of Freedom of the table = 11
## P-Value = 0
The resulting output includes lots of information, including a table of observed frequencies, expected frequencies, and “Contr to chisq stat”, which is the contribution of that observation to the overall \(\chi\)2 test statistic value. The bigger the value, the larger the discrepancy between the observed and expected frequencies.
The output also includes the calculated test statistic value of \(\chi\)2, which here is 53.5. This is clearly far beyond the critical value of \(\chi\)2 in the figure above.
Correspondintly, the P-value provided in the output (which represents the area under the curve beyond the calculated \(\chi\)2 value) is very small: 0.
The probability of birth for NHL players is not the same for each month (\(\chi\)2 goodness of fit test; \(\chi\)2 = 53.5; df = 11; P < 0.001). Based on our Figure 1 (showing the observed frequencies) we see that the frequency of births was high in January and February, and low in September and November. More generally, there tends to be a higher frequency of births in the first half of the year compared to the second half.
NOTE: There is no need to report a confidence interval when reporting the results of a \(\chi\)2 goodness-of-fit test.
The expected proportions for each category can be defined in any manner, depending on the null hypothesis.
Test the following null and alternative hypotheses:
H0: The probability of an NHL player being born in any given month is equal to the proportion of births in the general population that occurr in each month.
HA: The probability of an NHL player being born in any given month is not equal to the proportion of births in the general population that occurr in each month.
Use the frequency data within the canada_births_2000_2005
variable in the nhlbirths
data frame to calculate the expected proportions.
Getting started:
library
Data frame structure:
str
inspect
Create tables:
kable
Math:
sum
Graphs:
barplot
Goodness of fit test:
chisqtestGC
(from the tigerstats
package)