This page was last updated on October 04, 2019.


Activity 1

  1. Practice goodness of fit problem

The expected proportions for each category can be defined in any manner, depending on the null hypothesis.

Test the following null and alternative hypotheses:

H0: The probability of an NHL player being born in any given month is equal to the proportion of births in the general population that occurr in each month.

HA: The probability of an NHL player being born in any given month is not equal to the proportion of births in the general population that occurr in each month.

Use the frequency data within the canada_births_2000_2005 variable in the nhlbirths data frame to calculate the expected proportions.


  • We’ll use an \(\alpha\) level of 0.05.

Create a vector to store the proportion of general population births in each month; divide the general population births in each month by the sum of general population births over the year. Do the same for the NHL population.

Visualize the data

For each month we now have the observed proportion of births per month for NHL players, and the proportion of births per month in the general population. How should we visualize these data? A grouped bar graph is one useful way, but rather than showing frequencies, we’ll show proportions.

NOTE: In the Associations between two variables tutorial you learned how to create a grouped bar graph from raw (long format) data. Here we have summary data in the nhlbirths data frame. You have not been taught how to use such data for grouped bar graphs. Below is the code so you can know for future.

The familiar barplot function is used, but we need to manipulate our data frame a bit in order to produce the desired graph.

In short, we need to produce a matrix (rather than a data frame) that includes 2 rows (one for each vector of proportions) and 12 columns (one for each month). For this, we need to recall some fundamental R code we learned in the introductory R tutorials, and we also need to introduce the as.matrix function, which coerces an object into a matrix class, and the t function, which stands for “transpose”:

##                      [,1]      [,2]       [,3]       [,4]       [,5]
## nhl.birth.prop 0.11128650 0.1010260 0.09629045 0.09471192 0.09234412
## can.birth.prop 0.08037945 0.0757273 0.08599975 0.08469343 0.08801284
##                      [,6]       [,7]       [,8]       [,9]      [,10]
## nhl.birth.prop 0.09629045 0.07498027 0.07182320 0.06314128 0.07024467
## can.birth.prop 0.08537627 0.08809414 0.08648756 0.08634740 0.08327190
##                     [,11]      [,12]
## nhl.birth.prop 0.05840568 0.06945541
## can.birth.prop 0.07795188 0.07765809

Now we can use the ‘barplot’ function as follows:

The proportion of births in each month for 1267 NHL players who were active in 2006, and for 2004878 births among the general population between the years 2000 and 2005

The proportion of births in each month for 1267 NHL players who were active in 2006, and for 2004878 births among the general population between the years 2000 and 2005

Notice that there is some notable variation in the proportion of births per month in the general Canadian population. A higher proportion of babies are born in the summer months - wise parental planning in Canada!

We also see that the proportion of NHL players born early in the year is higher than the proportion of the general population born early in the year; this trend reverses for the second half of the year.

Conduct the \(\chi\)2 test

Since our sample values are the same NHL births as used before, we know that the assumptions of the \(\chi\)2 test are met. Similarly, our degrees of freedom and critical values do not change.

Use the vector of NHL players born and the proportion of births in the general population to perform the \(\chi\)2 test:

## Chi-squared test for given probabilities 
## 
##   Observed counts Expected by Null Contr to chisq stat
## A             141           101.84               15.06
## B             128            95.95               10.71
## C             122           108.96                1.56
## D             120           107.31                1.50
## E             117           111.51                0.27
## F             122           108.17                1.77
## G              95           111.62                2.47
## H              91           109.58                3.15
## I              80           109.40                7.90
## J              89           105.51                2.58
## K              74            98.77                6.21
## L              88            98.39                1.10
## 
## 
## Chi-Square Statistic = 54.2804 
## Degrees of Freedom of the table = 11 
## P-Value = 0

Concluding statement

The probability of birth in each month for NHL players differs from the probabilities observed in the general population (\(\chi\)2 goodness of fit test; \(\chi\)2 = 54.28; df = 11; P < 0.001). Based on our Figure 3, we see that the proportion of births was substantially higher than expected in January and February, and substantially lower than expected in September and November. More generally, there tends to be a higher proportion of births in the first half of the year compared to the second half.