This page was last updated on October 04, 2019.
nhlbirths <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/nhlbirths.csv"), header = TRUE)
The expected proportions for each category can be defined in any manner, depending on the null hypothesis.
Test the following null and alternative hypotheses:
H0: The probability of an NHL player being born in any given month is equal to the proportion of births in the general population that occurr in each month.
HA: The probability of an NHL player being born in any given month is not equal to the proportion of births in the general population that occurr in each month.
Use the frequency data within the canada_births_2000_2005
variable in the nhlbirths
data frame to calculate the expected proportions.
Create a vector to store the proportion of general population births in each month; divide the general population births in each month by the sum of general population births over the year. Do the same for the NHL population.
nhlbirths$can.birth.prop <- nhlbirths$canada_births_2000_2005 / (sum(nhlbirths$canada_births_2000_2005))
nhlbirths$nhl.birth.prop <- nhlbirths$num_players_born / (sum(nhlbirths$num_players_born))
For each month we now have the observed proportion of births per month for NHL players, and the proportion of births per month in the general population. How should we visualize these data? A grouped bar graph is one useful way, but rather than showing frequencies, we’ll show proportions.
NOTE: In the Associations between two variables tutorial you learned how to create a grouped bar graph from raw (long format) data. Here we have summary data in the nhlbirths
data frame. You have not been taught how to use such data for grouped bar graphs. Below is the code so you can know for future.
The familiar barplot
function is used, but we need to manipulate our data frame a bit in order to produce the desired graph.
In short, we need to produce a matrix (rather than a data frame) that includes 2 rows (one for each vector of proportions) and 12 columns (one for each month). For this, we need to recall some fundamental R code we learned in the introductory R tutorials, and we also need to introduce the as.matrix
function, which coerces an object into a matrix class, and the t
function, which stands for “transpose”:
nhl.proportion.matrix <- t(as.matrix(nhlbirths[, c("nhl.birth.prop", "can.birth.prop")]))
nhl.proportion.matrix
## [,1] [,2] [,3] [,4] [,5]
## nhl.birth.prop 0.11128650 0.1010260 0.09629045 0.09471192 0.09234412
## can.birth.prop 0.08037945 0.0757273 0.08599975 0.08469343 0.08801284
## [,6] [,7] [,8] [,9] [,10]
## nhl.birth.prop 0.09629045 0.07498027 0.07182320 0.06314128 0.07024467
## can.birth.prop 0.08537627 0.08809414 0.08648756 0.08634740 0.08327190
## [,11] [,12]
## nhl.birth.prop 0.05840568 0.06945541
## can.birth.prop 0.07795188 0.07765809
Now we can use the ‘barplot’ function as follows:
barplot(nhl.proportion.matrix, beside = TRUE,
names.arg = nhlbirths$month,
legend.text = c("NHL births", "General pop births"),
las = 2)
The proportion of births in each month for 1267 NHL players who were active in 2006, and for 2004878 births among the general population between the years 2000 and 2005
Notice that there is some notable variation in the proportion of births per month in the general Canadian population. A higher proportion of babies are born in the summer months - wise parental planning in Canada!
We also see that the proportion of NHL players born early in the year is higher than the proportion of the general population born early in the year; this trend reverses for the second half of the year.
Since our sample values are the same NHL births as used before, we know that the assumptions of the \(\chi\)2 test are met. Similarly, our degrees of freedom and critical values do not change.
Use the vector of NHL players born and the proportion of births in the general population to perform the \(\chi\)2 test:
nhl.vs.gen <- chisqtestGC(x = nhlbirths$num_players_born,
p = nhlbirths$can.birth.prop,
graph = FALSE
)
nhl.vs.gen
## Chi-squared test for given probabilities
##
## Observed counts Expected by Null Contr to chisq stat
## A 141 101.84 15.06
## B 128 95.95 10.71
## C 122 108.96 1.56
## D 120 107.31 1.50
## E 117 111.51 0.27
## F 122 108.17 1.77
## G 95 111.62 2.47
## H 91 109.58 3.15
## I 80 109.40 7.90
## J 89 105.51 2.58
## K 74 98.77 6.21
## L 88 98.39 1.10
##
##
## Chi-Square Statistic = 54.2804
## Degrees of Freedom of the table = 11
## P-Value = 0
The probability of birth in each month for NHL players differs from the probabilities observed in the general population (\(\chi\)2 goodness of fit test; \(\chi\)2 = 54.28; df = 11; P < 0.001). Based on our Figure 3, we see that the proportion of births was substantially higher than expected in January and February, and substantially lower than expected in September and November. More generally, there tends to be a higher proportion of births in the first half of the year compared to the second half.