--- title: "Multiway Frequency Tables" --- ## Packages ```{r} library(tidyverse ) ``` ## Multi-way frequency analysis - A study of gender and eyewear-wearing finds the following frequencies: ``` gender contacts glasses none female 121 32 129 male 42 37 85 ``` - Is there association between eyewear and gender? - Normally answer this with chisquare test (based on observed and expected frequencies from null hypothesis of no association). - Two categorical variables and a frequency. - We assess in way that generalizes to more categorical variables. ## The data file ``` gender contacts glasses none female 121 32 129 male 42 37 85 ``` - This is *not tidy!* - Two variables are gender and *eyewear*, and those numbers all frequencies. ```{r bMultiway-1, message=F} my_url <- "http://ritsokiguess.site/datafiles/eyewear.txt" (eyewear <- read_delim(my_url, " ")) ``` ## Tidying the data ```{r bMultiway-2, size="footnotesize"} eyewear %>% pivot_longer(contacts:none, names_to="eyewear", values_to="frequency") -> eyes eyes ``` ## Making tidy data back into a table - use `pivot_wider` - which you can also use to reformat: ```{r bMultiway-3} eyes %>% pivot_wider(names_from = gender, values_from = frequency) ``` ## Modelling - Predict frequency from other factors and combos. - `glm` with `poisson` family. ```{r bMultiway-4 } eyes.1 <- glm(frequency ~ gender * eyewear, data = eyes, family = "poisson" ) ``` - Called **log-linear model**. ## What can we get rid of? \small ```{r bMultiway-5 } drop1(eyes.1, test = "Chisq") ``` nothing! ## Conclusions - `drop1` says what we can remove at this step. Significant = must stay. - Cannot remove anything. - Frequency depends on `gender-eye wear` *combination*, cannot be simplified further. - Gender and eyewear are *associated*. - For modelling, stop here. ## Making a graph ```{r} ggplot(eyes, aes(x = gender, y = frequency, fill = eyewear)) + geom_col(position = "fill") ``` ## Conclusions - Females are more likely to wear contacts than males are. - Females are *less* likely to wear glasses than males are. - The previous two comments are the reasons for the significant association. ## Code comments 1/2 - The code again: ```{r} #| eval: false ggplot(eyes, aes(x = gender, y = frequency, fill = eyewear)) + geom_col(position = "fill") ``` - Variation on two-variable bar chart that we saw in C32. - Comparing (most easily) *proportions*, so `fill` clearer than `dodge`. - Each row of dataframe represents many people (the number in `frequency`), so use `geom_col` rather than `geom_bar`. - `geom_col` takes a `y` that should be the frequency. ## Code comments 2/2 ```{r} #| eval: false ggplot(eyes, aes(x = gender, y = frequency, fill = eyewear)) + geom_col(position = "fill") ``` - Often in this work, one variable in association is explanatory rather than response. Have that as `x` (here `gender`); `eyewear` is response and goes in `fill`. - Interpretation: out of each category of explanatory ("out of females"), what proportion in each response category and where do they differ? ## No association - Suppose table had been as shown below: ```{r bMultiway-8, message=F} my_url <- "http://ritsokiguess.site/datafiles/eyewear2.txt" eyewear2 <- read_table(my_url) eyewear2 eyewear2 %>% pivot_longer(contacts:none, names_to = "eyewear", values_to = "frequency") -> eyes2 ``` ## Comments - Females and males wear contacts and glasses *in same proportions* - though more females and more contact-wearers. - No *association* between gender and eyewear. ## Analysis for revised data ```{r bMultiway-9} eyes.2 <- glm(frequency ~ gender * eyewear, data = eyes2, family = "poisson" ) drop1(eyes.2, test = "Chisq") ``` No longer any association. Take out interaction. ## No interaction ```{r bMultiway-10 } eyes.3 <- update(eyes.2, . ~ . - gender:eyewear) drop1(eyes.3, test = "Chisq") ``` - More females (gender effect) over all eyewear - fewer glasses-wearers (eyewear effect) over both genders - no association (no interaction). ## Graph shows no association ```{r} ggplot(eyes2, aes(x = gender, y = frequency, fill = eyewear)) + geom_col(position = "fill") ``` ## Chest pain, being overweight and being a smoker - In a hospital emergency department, 176 subjects who attended for acute chest pain took part in a study. - Each subject had a normal or abnormal electrocardiogram reading (ECG), were overweight (as judged by BMI) or not, and were a smoker or not. - How are these three variables related, or not? ## The data In modelling-friendly format: ``` ecg bmi smoke count abnormal overweight yes 47 abnormal overweight no 10 abnormal normalweight yes 8 abnormal normalweight no 6 normal overweight yes 25 normal overweight no 15 normal normalweight yes 35 normal normalweight no 30 ``` ## First step \small ```{r bMultiway-11, message=F} my_url <- "http://ritsokiguess.site/datafiles/ecg.txt" chest <- read_delim(my_url, " ") chest.1 <- glm(count ~ ecg * bmi * smoke, data = chest, family = "poisson" ) drop1(chest.1, test = "Chisq") ``` \normalsize That 3-way interaction comes out. ## Removing the 3-way interaction \small ```{r bMultiway-12} chest.2 <- update(chest.1, . ~ . - ecg:bmi:smoke) drop1(chest.2, test = "Chisq") ``` \normalsize At $\alpha=0.05$, `bmi:smoke` comes out. ## Removing `bmi:smoke` ```{r bMultiway-13} chest.3 <- update(chest.2, . ~ . - bmi:smoke) drop1(chest.3, test = "Chisq") ``` - `ecg:smoke` has become significant. So we have to stop. - `ecg` is associated with both `bmi` and `smoke`, but separately (it doesn't depend on the combination of `bmi` and `smoke`). ## Understanding the final model - For each of the significant associations, make a bar chart (here, two-variable because two-way interactions) - Here, `ecg` is response (patients came into the study being smokers or overweight) so use as `fill` in both graphs. - `y` is the frequency column. ## `ecg:bmi` ```{r} ggplot(chest, aes(x = bmi, y = count, fill = ecg)) + geom_col(position = "fill") ``` ## Comment - Most of the normal weight people had a normal ECG as well, but for the overweight people, a small majority had an abnormal ECG. ## `ecg:smoke` ```{r} ggplot(chest, aes(x = smoke, y = count, fill = ecg)) + geom_col(position = "fill") ``` ## Comments - Most nonsmokers have a normal ECG, but smokers are about 50--50 normal and abnormal ECG. - Don't look at `smoke:bmi` since not significant. ## Simpson's paradox: the airlines example ``` Alaska Airlines America West Airport On time Delayed On time Delayed Los Angeles 497 62 694 117 Phoenix 221 12 4840 415 San Diego 212 20 383 65 San Francisco 503 102 320 129 Seattle 1841 305 201 61 Total 3274 501 6438 787 ``` Use `status` as variable name for "on time/delayed". - Alaska: 13.3% flights delayed ($501/(3274+501)$). - America West: 10.9% ($787/(6438+787)$). - America West more punctual, right? ## Arranging the data - Can only have single thing in columns, so we have to construct column names like this: \small ``` airport aa_ontime aa_delayed aw_ontime aw_delayed LosAngeles 497 62 694 117 Phoenix 221 12 4840 415 SanDiego 212 20 383 65 SanFrancisco 503 102 320 129 Seattle 1841 305 201 61 ``` \normalsize - Read in: ```{r bMultiway-16, message=F} my_url <- "http://ritsokiguess.site/datafiles/airlines.txt" airlines <- read_table(my_url) ``` ## Tidying - Some tidying gets us the right layout, with frequencies all in one column and the airline and delayed/on time status separated out. This uses one of the fancy versions of `pivot_longer`: ```{r bMultiway-17, message=F, size="small"} airlines %>% pivot_longer(-airport, names_to = c("airline", "status"), names_sep = "_", values_to = "freq" ) -> punctual ``` ## The data frame `punctual` ```{r bMultiway-18, echo=F} punctual ``` ## Proportions delayed by airline ```{r} ggplot(punctual, aes(x = airline, y = freq, fill = status)) + geom_col(position = "fill") ``` ## Comment - Most flights are on time, but Alaska Airlines is late a little more often. ## Proportion delayed by airport, for each airline We now have *three* categorical variables, so use one of the explanatories (for me, airport) as facets: ## The graph(s) ```{r} ggplot(punctual, aes(x = airline, y = freq, fill = status)) + geom_col(position = "fill") + facet_wrap(~ airport) ``` ## Simpson's Paradox - America West more punctual overall, - but worse at *every single* airport! - How is that possible? - Log-linear analysis sheds some light. ## Model 1 and output ```{r bMultiway-22} punctual.1 <- glm(freq ~ airport * airline * status, data = punctual, family = "poisson" ) drop1(punctual.1, test = "Chisq") ``` ## Remove 3-way interaction \footnotesize ```{r bMultiway-23} punctual.2 <- update(punctual.1, ~ . - airport:airline:status) drop1(punctual.2, test = "Chisq") ``` \normalsize Stop here, and draw graphs to understand significant results. ## `airline:status`: ```{r} ggplot(punctual, aes(x = airline, y = freq, fill = status)) + geom_col(position = "fill") ``` ## Comments - We did this one before. - Slightly more of Alaska Airlines' flights delayed overall. ## `airport:status`: ```{r} ggplot(punctual, aes(x = airport, y = freq, fill = status)) + geom_col(position = "fill") ``` ## Comments - Flights into San Francisco (and maybe Seattle) are often late, and flights into Phoenix are usually on time. - Considerable variation among airports. ## `airport:airline`: ```{r} ggplot(punctual, aes(x = airport, y = freq, fill = airline)) + geom_col(position = "fill") ``` ## Comments - What fraction of each airline's flights are to each airport. - Most of Alaska Airlines' flights to Seattle and San Francisco. - Most of America West's flights to Phoenix. ## The resolution - Most of America West's flights to Phoenix, where it is easy to be on time. - Most of Alaska Airlines' flights to San Francisco and Seattle, where it is difficult to be on time. - Overall comparison looks bad for Alaska because of this. - But, *comparing like with like*, if you compare each airline's performance *to the same airport*, Alaska does better. - Aggregating over the very different airports was a (big) mistake: that was the cause of the Simpson's paradox. - Alaska Airlines is *more* punctual when you do the proper comparison. ## Ovarian cancer: a four-way table - Retrospective study of ovarian cancer done in 1973. - Information about 299 women operated on for ovarian cancer 10 years previously. - Recorded: - stage of cancer (early or advanced) - type of operation (radical or limited) - X-ray treatment received (yes or no) - 10-year survival (yes or no) - Survival looks like response (suggests logistic regression). - Log-linear model finds any associations at all. ## The data after tidying: \scriptsize ``` stage operation xray survival freq early radical no no 10 early radical no yes 41 early radical yes no 17 early radical yes yes 64 early limited no no 1 early limited no yes 13 early limited yes no 3 early limited yes yes 9 advanced radical no no 38 advanced radical no yes 6 advanced radical yes no 64 advanced radical yes yes 11 advanced limited no no 3 advanced limited no yes 1 advanced limited yes no 13 advanced limited yes yes 5 ``` \normalsize ## Reading in data \small ```{r bMultiway-27, message=F} my_url <- "http://ritsokiguess.site/datafiles/cancer.txt" cancer <- read_delim(my_url, " ") cancer %>% slice(1:6) ``` \normalsize ## Model 1 hopefully looking familiar by now: ```{r bMultiway-28} cancer.1 <- glm(freq ~ stage * operation * xray * survival, data = cancer, family = "poisson") ``` ## Output 1 See what we can remove: \scriptsize ```{r bMultiway-29} drop1(cancer.1, test = "Chisq") ``` \normalsize Non-significant interaction can come out. ## Model 2 \scriptsize ```{r bMultiway-30} cancer.2 <- update(cancer.1, . ~ . - stage:operation:xray:survival) drop1(cancer.2, test = "Chisq") ``` \normalsize Least significant term is `stage:xray:survival`: remove. ## Take out `stage:xray:survival` \scriptsize ```{r bMultiway-31} cancer.3 <- update(cancer.2, . ~ . - stage:xray:survival) drop1(cancer.3, test = "Chisq") ``` \normalsize `operation:xray:survival` comes out next. ## Remove `operation:xray:survival` \scriptsize ```{r bMultiway-32} cancer.4 <- update(cancer.3, . ~ . - operation:xray:survival) drop1(cancer.4, test = "Chisq") ``` \normalsize ## Comments - `stage:operation:xray` has now become significant, so won't remove that. - Shows value of removing terms one at a time. - There are no higher-order interactions containing both `xray` and `survival`, so now we get to test (and remove) `xray:survival`. ## Remove `xray:survival` \scriptsize ```{r bMultiway-33} cancer.5 <- update(cancer.4, . ~ . - xray:survival) drop1(cancer.5, test = "Chisq") ``` \normalsize ## Remove `stage:operation:survival` \scriptsize ```{r bMultiway-34} cancer.6 <- update(cancer.5, . ~ . - stage:operation:survival) drop1(cancer.6, test = "Chisq") ``` \normalsize ## Last step? Remove `operation:survival`. \footnotesize ```{r bMultiway-35, size="footnotesize"} cancer.7 <- update(cancer.6, . ~ . - operation:survival) drop1(cancer.7, test = "Chisq") ``` \normalsize Finally done! ## Conclusions - What matters is things associated with `survival` (`survival` is "response"). - Only significant such term is `stage:survival`. ## The graph ```{r} ggplot(cancer, aes(x = stage, y = freq, fill = survival)) + geom_col(position = "fill") ``` ## Comments - Most people in early stage of cancer survived, and most people in advanced stage did not survive. - This true *regardless* of type of operation or whether or not X-ray treatment was received. These things have no impact on survival. ## What about that other interaction? ```{r} ggplot(cancer, aes(x = stage, y = freq, fill = xray)) + geom_col(position = "fill") + facet_wrap(~ operation) ``` ## Comments - The association is between `stage` and `xray` *only for those who had the limited operation*. - For those who had the radical operation, there was no association between `stage` and `xray`. - This is of less interest than associations with `survival`. ## General procedure - Start with "complete model" including all possible interactions. - `drop1` gives highest-order interaction(s) remaining, remove least non-significant. - Repeat as necessary until everything significant. - Look at graphs of significant interactions. - Main effects not usually very interesting. - Interactions with "response" usually of most interest: show association with response.