This page was last updated on October 04, 2019.


Getting started

In this tutorial we will learn about the following:

  • Conduct a \(\chi\)2 goodness of fit test using a proportional model

Required packages

Each of these packages has been used in previous tutorials:

  • load the tigerstats package
  • load the knitr package

This is a new one:

  • install and load the kableExtra package
install.packages("kableExtra")

Required data

Import the nhlbirths.csv data set and look at its structure:

## 'data.frame':    12 obs. of  3 variables:
##  $ month                  : Factor w/ 12 levels "April","August",..: 5 4 8 1 9 7 6 2 12 11 ...
##  $ num_players_born       : int  141 128 122 120 117 122 95 91 80 89 ...
##  $ canada_births_2000_2005: int  161151 151824 172419 169800 176455 171169 176618 173397 173116 166950 ...

This data frame has only 12 rows, so let’s look at it all:

##        month num_players_born canada_births_2000_2005
## 1    January              141                  161151
## 2   February              128                  151824
## 3      March              122                  172419
## 4      April              120                  169800
## 5        May              117                  176455
## 6       June              122                  171169
## 7       July               95                  176618
## 8     August               91                  173397
## 9  September               80                  173116
## 10   October               89                  166950
## 11  November               74                  156284
## 12  December               88                  155695

This dataset shows the number of NHL hockey players (active in 2006) that were born in each month of the year (“num players born” variable). It also shows the total number of births per month in the general population, over the years 2000-2005.

And now let’s produce a nicer table showing just the first two columns.


Presenting a nice HTML table using R Markdown

OPTIONAL: This table formatting section is optional, but recommended

We’ll use the methods described in the Generating tables in R Markdown tutorial:

Table 1: Frequency table of NHL player birth months for players active in 2006
Month Number of players born
January 141
February 128
March 122
April 120
May 117
June 122
July 95
August 91
September 80
October 89
November 74
December 88

Greek / math symbols in R Markdown

You can produce the appropriate symbol for “chi-square” using this syntax:

$\chi$^2^ produces \(\chi\)2


Refresher

We are still learning how to test hypotheses about categorical data.

We previously learned how to test a hypothesis about frequencies or proportions when there are only two categories of interest, i.e. success and failure (a binary variable). We used a binomial test for this purpose. Importantly, one can start with a categorical variable with more than two categories (e.g. hair colour), but if we define one particular category (e.g. red hair) as a “success”, and the remaining categories as failures, then we have simplified our variable to a binary categorical variable.

For testing hypotheses about frequencies or proportions when there are more than two categories, we use goodness of fit (GOF) tests. We test how well an observed discrete frequency (or probability) distribution fits some specified expectation.


A GOF test example

The research hypothesis

It is hypothesized that, among sports that use December 31st as the cutoff for age categories, kids born early in the calendar year (e.g. January) will have an advantage within their training years (i.e. when they’re young kids learning to play), because they are comparatively larger than kids in the same age group but born later in the year.

Evidence consistent with this hypothesis would come in the form of a significant bias towards early-year birthdates among professional athletes.


Steps to hypothesis testing

  • State the null (H0) and alternative (HA) hypotheses
  • Set an \(\alpha\) level (usually 0.05)
  • Determine whether a one-tailed or two-tailed test is appropriate (usually the latter)
  • Identify the appropriate test statistic, and calculate the test statistic value using observed data

  • Simulate data and use them to construct a null distribution for the test statistic, under the assumption of a true null hypothesis
    OR: Instead of simulating data to construct a null distribution, use an appropriate statistical test, which provides a theoretical null distribution

  • Evaluate any assumptions associated the statistical test
  • Use the null distribution to determine the P-value associated with the observed test statistic
  • Draw the appropriate conclusion and communicate it clearly


The statistical hypotheses

Let’s test our research hypothesis using the NHL data. We need to construct an appropriate null and alternative statistical hypothesis, and the null hypothesis must be specific about the expected proportions or frequencies under a true null hypothesis (i.e. when nothing is going on).

If there is no advantage to professional athletes by being born earlier in the year, then the “null expectation” is that the birth dates of NHL players will be distributed among the 12 months in direct proportion to the fraction of the year encompassed by each month, i.e. the number of days in each month divided by 365 days.

This is an example of the proportional model.

H0: The probability of an NHL player being born in any given month is proportional to the fraction of days in that month out 365 days.
HA: The probability of an NHL player being born in any given month is not proportional to the fraction of days in that month out 365 days.

Thus, we are testing a hypothesis about a discrete probability distribution that includes one null proportion for each month of the year (thus 12 categories in the variable birth month).

This contrasts with a binomial test, in which we test a hypothesis about a single proportion.


Visualizing the data

Let’s visualize the data, as one should always do when testing hypotheses.

First, we’ll present the data frame in table format (see the section above for info about presenting a table like this in R Markdown):

Table 1: Frequency table of NHL player birth months for players active in 2006
Month Number of players born
January 141
February 128
March 122
April 120
May 117
June 122
July 95
August 91
September 80
October 89
November 74
December 88

Now we’ll display a bar chart, which is appropriate for categorical data.

Figure 1: Frequency of births per month of NHL players active in 2006

Figure 1: Frequency of births per month of NHL players active in 2006

You can see from Figure 1 that there does seem to be considerable variation in the number of births across the months, and that in general there are more births than expected in the early months of the calendar year, and fewer births than expected in the later months of the year. Now we need to figure how whether these discrepancies (i.e. differences from expected) are large enough to be considered interesting, i.e. to warrant rejecting the null hypothesis in favour of the alternative.


The \(\chi\)2 test statistic

  • The appropriate test statistic is the the \(\chi\)2 statistic, which measures the discrepancy between observed and expected frequencies

  • The \(\chi\)2 goodness-of-fit (GOF) test is the appropriate test to use here: it evaluates discrepancies between observed frequencies (for more than 2 categories) and those expected based on some appropriate model.

For this example, we’re using the Proportional model


The sampling distribution of the \(\chi\)2 test statistic

  • The sampling distribution for the \(\chi\)2 statistic changes shape depending upon the degrees of freedom, which equals the number of categories minus 1; so here, 12-1 = 11

You don’t need to know how to create this graph, nor do you need to include this kind of graph in any assignment:

The chi-square probability distribution for degrees of freedom = 11.

The chi-square probability distribution for degrees of freedom = 11.

Note that this is a continuous probability distribution, so rather than asking “what’s the probability of observing a given value of the test statistic”, we’re going to ask questions like “what’s the probability of observing a test statistic value greater than some value”; thus, we’re dealing with areas under the curve.


Calculating the expected proportions

In the simplest scenario, the expected proportions for a proportional model would simply be 1 over the number of categories. Thus, in this example, it would be 1/12.

However, in our example the proportion associated with each month differs depending upon the number of days in the month: For example, the probability of giving birth in February, which as 28 days, is smaller (all else being equal) than in a month with 31 days such as January.

We therefore need to calculate the correct probabilities (proportions) for each month.

First let’s figure out how many players we have birthdates for, i.e. the total count of birth dates:

## [1] 1267

Now let’s create a vector object that holds the number of days in each month, and add this to the nhlbirths data frame as a new variable called days.per.month:

## [1] 365

Now let’s calculate the expected proportion of births per month, and add this to the data frame as a new variable:

##  [1] 0.08493151 0.07671233 0.08493151 0.08219178 0.08493151 0.08219178
##  [7] 0.08493151 0.08493151 0.08219178 0.08493151 0.08219178 0.08493151

Calculating the expected frequencies

When using R to conduct Goodness of fit tests, you provide the function with the expected proportions for each category (see below).

When conducting Goodness of fit tests by hand, you must calculate expected frequencies.

We use the expected proportions to calculate the expected frequencies of births in each month.

We do this by multiplying the expected proportions by the total number of NHL births:

##  [1] 107.60822  97.19452 107.60822 104.13699 107.60822 104.13699 107.60822
##  [8] 107.60822 104.13699 107.60822 104.13699 107.60822

NOTE: Expected frequencies do not need to be whole numbers (integers).


Assumptions of the \(\chi\)2 GOF test

The \(\chi\)2 goodness of fit test has assumptions that must be checked prior to proceeding:

  • none of the categories should have an expected frequency of less than one
  • no more than 20% of the categories should have expected frequencies less than five

For our current example, these assumptions are met.


Finding the critical value of \(\chi\)2

When we conduct the goodness of fit test, R will tell us what the P-value is associated with our observed value of \(\chi\)2

However, we can also use the qchisq command to determine what the critical value of \(\chi\)2 is for our test; that is, beyond what value of \(\chi\)2 (along the x-axis in the sampling distribution figure above) is the area under the curve equal to our \(\alpha\) = 0.05?

?qchisq

NOTE: Values of \(\chi\)2 increase with increasing overall discrepancy between observed and expected frequencies. We therefore only consider the right-hand tail of the \(\chi\)2 distribution. This applies to all tests that use the \(\chi\)2 test statistic

Use this command to find the critical value of \(\chi\)2 for \(\alpha\) = 0.05 and degrees of freedom = 11:

## [1] 19.67514

Let’s show this critical value on our distribution:

The chi-square probability distribution for degrees of freedom = 11. The critical value for alpha = 0.05 is shown.

The chi-square probability distribution for degrees of freedom = 11. The critical value for alpha = 0.05 is shown.

So values of \(\chi\)2 that fall on or to the right of the red line will be considered “significant” at \(\alpha\) = 0.05.

NOTE: The \(\chi\)2 probability distribution changes form depending upon the degrees of freedom. Thus, the critical value associated with \(\alpha\) = 0.05 will also change according to the degrees of freedom.

The text book includes a table of critical \(\chi\)2 values, for different degrees of freedom, on pages 703-705.


Conduct the \(\chi\)2 GOF test:

For this test, we’ll use the chisqtestGC function from the tigerstats package:

?chisqtestGC

This function can be used on raw data (“long format”) or on summarized data.

We have the summarized data in our data frame, in the variable num_players_born. So, we need to provide this as a vector to the function. We will show an example with raw data in future tutorials.

The function also optionally provides a graph of the probability distribution, with a shaded area corresponding to the area under the curve to the right of the observed test statistic value.

Also, we need to provide a vector of expected proportions to the argument p:

## Chi-squared test for given probabilities 
## 
##   Observed counts Expected by Null Contr to chisq stat
## A             141           107.61               10.36
## B             128            97.19                9.76
## C             122           107.61                1.92
## D             120           104.14                2.42
## E             117           107.61                0.82
## F             122           104.14                3.06
## G              95           107.61                1.48
## H              91           107.61                2.56
## I              80           104.14                5.59
## J              89           107.61                3.22
## K              74           104.14                8.72
## L              88           107.61                3.57
## 
## 
## Chi-Square Statistic = 53.4979 
## Degrees of Freedom of the table = 11 
## P-Value = 0

The resulting output includes lots of information, including a table of observed frequencies, expected frequencies, and “Contr to chisq stat”, which is the contribution of that observation to the overall \(\chi\)2 test statistic value. The bigger the value, the larger the discrepancy between the observed and expected frequencies.

The output also includes the calculated test statistic value of \(\chi\)2, which here is 53.5. This is clearly far beyond the critical value of \(\chi\)2 in the figure above.

Correspondintly, the P-value provided in the output (which represents the area under the curve beyond the calculated \(\chi\)2 value) is very small: 0.


Concluding statement

The probability of birth for NHL players is not the same for each month (\(\chi\)2 goodness of fit test; \(\chi\)2 = 53.5; df = 11; P < 0.001). Based on our Figure 1 (showing the observed frequencies) we see that the frequency of births was high in January and February, and low in September and November. More generally, there tends to be a higher frequency of births in the first half of the year compared to the second half.

NOTE: There is no need to report a confidence interval when reporting the results of a \(\chi\)2 goodness-of-fit test.


  1. Practice goodness of fit problem

The expected proportions for each category can be defined in any manner, depending on the null hypothesis.

Test the following null and alternative hypotheses:

H0: The probability of an NHL player being born in any given month is equal to the proportion of births in the general population that occurr in each month.

HA: The probability of an NHL player being born in any given month is not equal to the proportion of births in the general population that occurr in each month.

Use the frequency data within the canada_births_2000_2005 variable in the nhlbirths data frame to calculate the expected proportions.


List of functions

Getting started:

  • library

Data frame structure:

  • str
  • inspect

Create tables:

  • kable

Math:

  • sum

Graphs:

  • barplot

Goodness of fit test:

  • chisqtestGC (from the tigerstats package)