--- title: "17. Chi-square goodness-of-fit test" author: "Put your name here" date: "Put the date here" output: html_notebook: toc: yes toc_float: yes --- `r options(scipen=999)`

`r intToUtf8(c(50,46,48))`

::: {.summary} ### Functions introduced in this chapter: `chisq.test` ::: ## Introduction In this assignment we will learn how to run the chi-square goodness-of-fit test. A chi-square goodness-of-fit test is similar to a test for a single proportion except, instead of two categories (success/failure), we now try to understand the distribution among three or more categories. ### Install new packages There are no new packages used in this chapter. ### Download the R notebook file Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`). Once the file is downloaded, move it to your project folder in RStudio and open it there. ### Restart R and run all chunks In RStudio, select "Restart R and Run All Chunks" from the "Run" menu. ## Load packages We load the standard `tidyverse`, `janitor`, and `infer` packages and the `openintro` package for the `hsb2` data. ```{r} library(tidyverse) library(janitor) library(infer) library(openintro) ``` ## Research question We use a classic data set `mtcars` from a 1974 Motor Trend magazine to examine the distribution of the number of engine cylinders (with values 4, 6, or 8). We'll assume that this data set is representative of all cars from 1974. In recent years, 4-cylinder vehicles and 6-cylinder vehicles have comprised about 38% of the market each, with nearly all the rest (24%) being 8-cylinder cars. (This ignores a very small number of cars manufactured with 3- or 5-cylinder engines.) Were car engines in 1974 manufactured according to the same distribution? Here is the structure of the data: ```{r} glimpse(mtcars) ``` Note that the variable of interest `cyl` is not coded as a factor variable. Let's convert `cyl` to a factor variable first and add it to a new data frame called `mtcars2`. (Since the levels are already called 4, 6, and 8, we do not need to specify `levels` or `labels`.) Be sure to remember to use `mtcars2` from here on out, and not the original `mtcars`. ```{r} mtcars2 <- mtcars %>% mutate(cyl_fct = factor(cyl)) mtcars2 ``` ```{r} glimpse(mtcars2) ``` ## Chi-squared When we have three or more categories in a categorical variable, it is natural to ask how the observed counts in each category compare to the counts that we expect to see under the assumption of some null hypothesis. In other words, we're assuming that there is some "true" distribution to which we are going to compare our data. Sometimes, this null comes from substantive expert knowledge. (For example, we will be comparing the 1974 distribution to a known distribution from recent years.) Sometimes we're interested to see if our data deviates from a null distribution that predicts an equal number of observations in each category. First of all, what is the actual distribution of cylinders in our data? Here's a frequency table. ```{r} tabyl(mtcars2, cyl_fct) %>% adorn_totals() %>% adorn_pct_formatting() ``` The counts of our frequency table are the "observed" values, usually denoted by the letter $O$ (uppercase "O", which is a little unfortunate, because it also looks like a zero). What are the expected counts? Well, since there are 32 cars, we need to multiply 32 by the percentages listed in the research question. For 4-cylinder and 6-cylinder cars, if the distribution of engines in 1974 were the same as today, there would be $32 * 0.38$ or about 12.2 cars we would expect to see in our sample that have 4-cylinder engines, and the same for 6-cylinder cars. For 8-cylinder cars, we expect $32 * 0.24$ or about 7.7 cars in our sample to have 8-cylinder engines. These "expected" counts are usually denoted by the letter $E$. Why aren't the expected counts whole numbers? In any given data set, of course, we will see a whole number of cars with 4, 6, or 8 cylinders. However, since we're looking only at expected counts, they are the average over lots of possible sets of 32 cars under the assumption of the null. We don't need for these averages to be whole numbers. How should the deviation between the data and the null distribution be measured? We could simply look at the difference between the observed counts and the expected counts $O - E$. However, there will be some positive values (cells where we have more than the expected number of cars) and some negative values (cells where we have fewer than the expected number of cars). These will all cancel out. If this sounds vaguely familiar, it is because we encountered the same problem with the formula for the standard deviation. The differences $y - \bar{y}$ had the same issue. Do you recall the solution in that case? It was to square these values, making them all positive. So instead of $O - E$, we will consider $(O - E)^{2}$. Finally, to make sure that cells with large expected values don't dominate, we divide by $E$: $$ \frac{(O - E)^{2}}{E}. $$ This puts each cell on equal footing. Now that we have a reasonable measure of the deviation between observed and expected counts for each cell, we define $\chi^{2}$ ("chi-squared", pronounced "kye-squared"---rhymes with "die-scared", or if that's too dark, how about "pie-shared"^[Rhyming is fun!]) as the sum of all these fractions, one for each cell: $$ \chi^{2} = \sum \frac{(O - E)^{2}}{E}. $$ A $\chi^{2}$ value of zero would indicate perfect agreement between observed and expected values. As the $\chi^{2}$ value gets larger and larger, this indicates more and more deviation between observed and expected values. As an example, for our data, we calculate chi-squared as follows: $$ \chi^{2} = \frac{(11 - 12.2)^{2}}{12.2} + \frac{(7 - 12.2)^{2}}{12.2} + \frac{(14 - 7.7)^{2}}{7.7} \approx 7.5. $$ Or we could just do it in R with the `infer` package. To do so, we have to state explicitly the proportions that correspond to the null hypothesis. In this case, since the order of entries in the frequency table is 4-cylinder, 6-cylinder, then 8-cylinder, we need to give `infer` a vector of entries `c("4" = 0.38, "6" = 0.38, "8" = 0.24)` that represents the 38%, 38%, and 24% expected for 4, 6, and 8 cylinders respectively. ```{r} obs_chisq <- mtcars2 %>% specify(response = cyl_fct) %>% hypothesize(null = "point", p = c("4" = 0.38, "6" = 0.38, "8" = 0.24)) %>% calculate(stat = "chisq") obs_chisq ``` ## The chi-square distribution We know that even if the true distribution were 38%, 38%, 24%, we would not see exactly 12.2, 12.2, 7.7 in a sample of 32 cars. (In fact, the "true" distribution is physically impossible because these are not whole numbers!) So what kinds of numbers could we get? Let's do a quick simulation to find out. Under the assumption of the null, there should be a 38%, 38%, and 24% chance of seeing 4, 6, or 8 cylinders, respectively. To get a sense of the extent of sampling variability, we could use the `sample` command to see what happens in a sample of size 32 taken from a population where the true percentages are 38%, 38%, and 24%. ```{r} set.seed(99999) sample1 <- sample(c(4, 6, 8), size = 32, replace = TRUE, prob = c(0.38, 0.38, 0.24)) sample1 ``` ```{r} sample1 %>% table() ``` ```{r} sample2 <- sample(c(4, 6, 8), size = 32, replace = TRUE, prob = c(0.38, 0.38, 0.24)) sample2 ``` ```{r} sample2 %>% table() ``` ```{r} sample3 <- sample(c(4, 6, 8), size = 32, replace = TRUE, prob = c(0.38, 0.38, 0.24)) sample3 ``` ```{r} sample3 %>% table() ``` We can calculate the chi-squared value for each of these samples to get a sense of the possibilities. The `chisq.test` command from base R is a little unusual because it requires a frequency table (generated from the `table` command) as input. We will never use the `chisq.test` command directly because we will always use `infer` to do this work. But just to see some examples: ```{r} sample1 %>% table() %>% chisq.test() ``` ```{r} sample2 %>% table() %>% chisq.test() ``` ```{r} sample3 %>% table() %>% chisq.test() ``` ##### Exercise 1 Look more carefully at the three random samples above. Why does sample 1 have a chi-squared closer to 0 while samples 2 and 3 have a chi-squared values that are a little larger? (Hint: look at the counts of 4s, 6s, and 8s in those samples. How do those counts compare to the expected number of 4s, 6s, and 8s?) ::: {.answer} Please write up your answer here. ::: ***** The `infer` pipeline below (the `generate` command specifically) takes the values "4", "6", or "8" and grabs them at random according to the probabilities specified until it has 32 values. In other words, it will randomly select "4" about 38% of the time, "6" about 38% of the time, and "8" about 24% of the time, until it gets a list of 32 total cars. Then it will calculate the chi-squared value for that simulated set of 32 cars. But because randomness is involved, the simulated samples are subject to sampling variability and the chi-square values obtained will differ from each other. This is exactly what we did above with the `sample` command and the `chisq` command, but the benefit now is that we get 1000 random samples very quickly. ```{r} set.seed(99999) cyl_test_sim <- mtcars2 %>% specify(response = cyl_fct) %>% hypothesize(null = "point", p = c("4" = 0.38, "6" = 0.38, "8" = 0.24)) %>% generate(reps = 1000, type = "draw") %>% calculate(stat = "chisq") cyl_test_sim ``` The "stat" column above contains 1000 random values of $\chi^{2}$. Let's graph these values and include the chi-squared value for our actual data in the same graph. ```{r} cyl_test_sim %>% visualize() + shade_p_value(obs_chisq, direction = "greater") ``` A few things are apparent: 1. The values are all positive. (The leftmost bar is sitting at 0, but it represents values greater than zero.) This makes sense when you remember that each piece of the $\chi^{2}$ calculation was positive. This is different from our earlier simulations that looked like normal models. (Z scores can be positive or negative, but not $\chi^{2}$.) 2. This is a severely right-skewed graph. Although most values are near zero, the occasional unusual sample can have a large value of $\chi^{2}$. 3. You can see that our sample (the red line) is pretty far to the right. It is an unusual value given the assumption of the null hypothesis. In fact, we can count the proportion of sampled values that are to the right of the red line: ```{r} cyl_test_sim %>% get_p_value(obs_chisq, direction = "greater") ``` This is the simulated P-value. Keep this number in mind when we calculate the P-value using a sampling distribution model below. ## Chi-square as a sampling distribution model Just like there was a mathematical model for our simulated data before (the normal model back then), there is also a mathematical model for this type of simulated data. It's called (not surprisingly) the *chi-square distribution*. There is one new idea, though. Although all normal models have the same bell shape, there are many different chi-square models. This is because the number of cells can change the sampling distribution. Our engine cylinder example has three cells (corresponding to the categories "4", "6", and "8"). But what if there were 10 categories? The shape of the chi-square model would be different. The terminology used by statisticians to distinguish these models is *degrees of freedom*, abbreviated $df$. The reason for this name and the mathematics behind it are somewhat technical. Suffice it to say for now that if there are $c$ cells, you use $c - 1$ degrees of freedom. For our car example, there are 3 cylinder categories, so $df = 2$. Look at the graph below that shows the theoretical chi-square models for varying degrees of freedom. ```{r} # Don't worry about the syntax here. # You won't need to know how to do this on your own. ggplot(data.frame(x = c(0, 20)), aes(x)) + stat_function(fun = dchisq, args = list(df = 2), aes(color = "2")) + stat_function(fun = dchisq, args = list(df = 5), aes(color = "5" )) + stat_function(fun = dchisq, args = list(df = 10), aes(color = "10")) + scale_color_manual(name = "df", values = c("2" = "red", "5" = "blue", "10" = "green"), breaks = c("2", "5", "10")) ``` The red curve (corresponding to $df = 2$) looks a lot like our simulation above. But as the degrees of freedom increase, the mode shifts further to the right. ## Chi-square goodness-of-fit test The formal inferential procedure for examining whether data from a categorical variable fits a proposed distribution in the population is called a *chi-square goodness-of-fit test*. We can use the chi-square model as the sampling distribution as long as the sample size is large enough. This is checked by calculating that the expected cell counts (not the observed cell counts!) are at least 5 in each cell. The following `infer` pipeline will run a hypothesis test using the theoretical chi-squared distribution with 2 degrees of freedom. ```{r} cyl_test <- mtcars2 %>% specify(response = cyl_fct) %>% assume(distribution = "chisq") cyl_test ``` Here is the theoretical distribution: ```{r} cyl_test %>% visualize() ``` And here it is will our test statistic (the chi-squared value for our observed data) marked: ```{r} cyl_test %>% visualize() + shade_p_value(obs_chisq, direction = "greater") ``` Finally, here is the P-value associated with the shaded area to the right of the test statistic: ```{r} cyl_test %>% get_p_value(obs_chisq, direction = "greater") ``` Note that this P-value is quite similar to the P-value derived from the simulation earlier. We'll walk through the engine cylinder example from top to bottom using the rubric. Most of this is just repeating work we've already done, but showing this work in the context of the rubric will help you as you take over in the "Your Turn" section later. ## Exploratory data analysis ### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure. Type `?mtcars` at the Console to read the help file. *Motor Trend* is a reputable publication and, therefore, we do not doubt the accuracy of the data. It's not clear, however, why these specific 32 cars were chosen and if they reflect a representative sample of cars on the road in 1974. ```{r} mtcars ``` ```{r} glimpse(mtcars) ``` ### Prepare the data for analysis. ```{r} # Although we've already done this above, # we include it here again for completeness. mtcars2 <- mtcars %>% mutate(cyl_fct = factor(cyl)) mtcars2 ``` ```{r} glimpse(mtcars2) ``` ### Make tables or plots to explore the data visually. ```{r} tabyl(mtcars2, cyl_fct) %>% adorn_totals() %>% adorn_pct_formatting() ``` ## Hypotheses ### Identify the sample (or samples) and a reasonable population (or populations) of interest. The sample is a set of `r NROW(mtcars2)` cars from a 1974 Motor Trends magazine. The population is all cars from 1974. ### Express the null and alternative hypotheses as contextually meaningful full sentences. $H_{0}:$ In 1974, the proportion of cars with 4, 6, and 8 cylinders was 38%, 38%, and 24%, respectively. $H_{A}:$ In 1974, the proportion of cars with 4, 6, and 8 cylinders was not 38%, 38%, and 24%. ### Express the null and alternative hypotheses in symbols (when possible). $H_{0}: p_{4} = 0.38, p_{6} = 0.38, p_{8} = 0.24$ There is no easy way to express the alternate hypothesis in symbols because any deviation in any of the categories can lead to rejection of the null. You can't just say $p_{4} \neq 0.38, p_{6} \neq 0.38, p_{8} \neq 0.24$ because one of these categories might have the correct proportion with the other two different and that would still be consistent with the alternative hypothesis. So the only requirement here is to express the null in symbols. ## Model ### Identify the sampling distribution model. We use a $\chi^{2}$ model with 2 degrees of freedom. Commentary: Unlike the normal model, there are infinitely many different $\chi^{2}$ models, so you have to specify the degrees of freedom when you identify it as the sampling distribution model. ### Check the relevant conditions to ensure that model assumptions are met. * Random - We do not know how Motor Trends magazine sampled these 32 cars, so we're not sure if this list is random or representative of all cars from 1974. We should be cautious in our conclusions. * 10% - As long as there are at least 320 different car models, we are okay. This sounds like a lot, so this condition might not quite be met. Again, we need to be careful. (Also note that the population is not all automobiles manufactured in 1974. It is all *types* of automobile manufactured in 1974. There's a big difference.) * Expected cell counts - This condition says that under the null, we should see at least 5 cars in each category. The expected counts are $32(0.38) = 12.2$, $32(0.38) = 12.2$, and $32(0.24) = 7.7$. So this condition is met. Commentary: The expected counts condition is necessary for using the theoretical chi-squared distribution. If we were using simulation instead, we would not need this condition. ## Mechanics ### Compute the test statistic. ```{r} obs_chisq <- mtcars2 %>% specify(response = cyl_fct) %>% hypothesize(null = "point", p = c("4" = 0.38, "6" = 0.38, "8" = 0.24)) %>% calculate(stat = "chisq") obs_chisq ``` ### Report the test statistic in context (when possible). The value of $\chi^{2}$ is `r obs_chisq %>% pull(1)`. Commentary: The $\chi^{2}$ test statistic is, of course, the same value we computed manually by hand earlier. Also, the formula for $\chi^{2}$ is a complicated function of observed and expected values, making it difficult to say anything about this number in the context of cars and engine cylinders. So even though the requirement is to "report the test statistic in context," there's not much one can say here other than just to report the test statistic. ### Plot the null distribution. ```{r} cyl_test <- mtcars2 %>% specify(response = cyl_fct) %>% assume(distribution = "chisq") cyl_test ``` ```{r} cyl_test %>% visualize() + shade_p_value(obs_chisq, direction = "greater") ``` Commentary: We will use the theoretical distribution ### Calculate the P-value. ```{r} cyl_test_p <- cyl_test %>% get_p_value(obs_chisq, direction = "greater") cyl_test_p ``` ### Interpret the P-value as a probability given the null. The P-value is `r cyl_test_p`. If the true distribution of cars in 1974 were 38% 4-cylinder, 38% 6-cylinder, and 24% 8-cylinder, there would be a `r 100 * cyl_test_p`% chance of seeing data at least as extreme as what we saw. ## Conclusion ### State the statistical conclusion. We reject the null. ### State (but do not overstate) a contextually meaningful conclusion. There is sufficient evidence that in 1974, the distribution of cars was not 38% 4-cylinder, 38% 6-cylinder, and 24% 8-cylinder. ### Express reservations or uncertainty about the generalizability of the conclusion. As long as we restrict our attention to cars in 1974, we are pretty safe, although we are still uncertain if the sample we had was representative of all cars in 1974. ### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses. If we made a Type I error, that would mean the true distribution of cars in 1974 was 38% 4-cylinder, 38% 6-cylinder, and 24% 8-cylinder, but our sample showed otherwise. ## Confidence interval There is no confidence interval for a chi-square test. Since our test is not about measuring some parameter of interest (like $p$ or $p_{1} - p_{2}$), there is no interval to produce. ## Your turn Use the `hsb2` data and determine if the proportion of high school students who attend general programs, academic programs, and vocational programs is 15%, 60%, and 25% respectively. The rubric outline is reproduced below. You may refer to the worked example above and modify it accordingly. Remember to strip out all the commentary. That is just exposition for your benefit in understanding the steps, but is not meant to form part of the formal inference process. Another word of warning: the copy/paste process is not a substitute for your brain. You will often need to modify more than just the names of the data frames and variables to adapt the worked examples to your own work. Do not blindly copy and paste code without understanding what it does. And you should **never** copy and paste text. All the sentences and paragraphs you write are expressions of your own analysis. They must reflect your own understanding of the inferential process. **Also, so that your answers here don't mess up the code chunks above, use new variable names everywhere.** ##### Exploratory data analysis ###### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure. ::: {.answer} Please write up your answer here ```{r} # Add code here to print the data ``` ```{r} # Add code here to glimpse the variables ``` ::: ###### Prepare the data for analysis. [Not always necessary.] ::: {.answer} ```{r} # Add code here to prepare the data for analysis. ``` ::: ###### Make tables or plots to explore the data visually. ::: {.answer} ```{r} # Add code here to make tables or plots. ``` ::: ##### Hypotheses ###### Identify the sample (or samples) and a reasonable population (or populations) of interest. ::: {.answer} Please write up your answer here. ::: ###### Express the null and alternative hypotheses as contextually meaningful full sentences. ::: {.answer} $H_{0}:$ Null hypothesis goes here. $H_{A}:$ Alternative hypothesis goes here. ::: ###### Express the null and alternative hypotheses in symbols (when possible). ::: {.answer} $H_{0}: math$ $H_{A}: math$ ::: ##### Model ###### Identify the sampling distribution model. ::: {.answer} Please write up your answer here. ::: ###### Check the relevant conditions to ensure that model assumptions are met. ::: {.answer} Please write up your answer here. (Some conditions may require R code as well.) ::: ##### Mechanics ###### Compute the test statistic. ::: {.answer} ```{r} # Add code here to compute the test statistic. ``` ::: ###### Report the test statistic in context (when possible). ::: {.answer} Please write up your answer here. ::: ###### Plot the null distribution. ::: {.answer} ```{r} # Add code here to plot the null distribution. ``` ::: ###### Calculate the P-value. ::: {.answer} ```{r} # Add code here to calculate the P-value. ``` ::: ###### Interpret the P-value as a probability given the null. ::: {.answer} Please write up your answer here. ::: ##### Conclusion ###### State the statistical conclusion. ::: {.answer} Please write up your answer here. ::: ###### State (but do not overstate) a contextually meaningful conclusion. ::: {.answer} Please write up your answer here. ::: ###### Express reservations or uncertainty about the generalizability of the conclusion. ::: {.answer} Please write up your answer here. ::: ###### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses. ::: {.answer} Please write up your answer here. ::: ## Bonus section: residuals The chi-square test can tell us if there is some difference from the expected distribution of counts across the categories, but it doesn't tell us which category has a higher or lower count than expected. For that, we'll need to turn to another tool: *residuals*. For technical reasons, the `infer` package doesn't provide residuals, so we'll have to turn to slightly different tools. Here's how this works; we'll return to the example of distribution of cars across the different categories of number of cylinders. The function we'll use is called `chisq.test`. It requires us to give it input in the form of a table of counts, together with the proportions we wish to compare to: ```{r} table(mtcars2$cyl_fct) %>% chisq.test(p = c(.38, .38, .24)) -> cyl_chisq.test cyl_chisq.test ``` Notice that the chi-squared value 7.5011 and the p-value 0.0235 are the same as those we calculated using `infer` tools above. Here's how to obtain the table of residuals: ```{r} cyl_chisq.test$residuals ``` What do these numbers mean in the real world? Not much. (Essentially, they are the values that were squared to become the individual cell contributions to the overall chi-squared score of the table.) What we'll do with them is look for the most positive and most negative values. - We see that the 8-cylinder column has the most positive value: this means that the number of 8-cylinder cars in 1974 was substantially *higher* than we expected. - We see that the 6-cylinder column has the most negative value: this means that the number of 6-cylinder cars in 1974 was substantially *lower* than we expected. ### Your turn Determine which of the high school program types is the most substantially overrepresented and the most substantially underrepresented, according to our hypothesized distribution. ::: {.answer} ```{r} # Add code here to produce the chisq.test result. # Add code here to examine the residuals. ``` Please write your answer here. ::: ## Conclusion When a categorical variable has three or more categories, we can run a chi-square goodness-of-fit test to determine if the distribution of counts across those categories matches some pre-specified null hypothesis. The key new mathematical tool we need is the chi-square distribution, a way of measuring the deviation between observed counts and expected counts according to the null. ### Preparing and submitting your assignment 1. From the "Run" menu, select "Restart R and Run All Chunks". 2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors. 3. Spell check your document by clicking the icon with "ABC" and a check mark. 4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file. 5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again. If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.