--- title: "Logistic Regression" editor: markdown: wrap: 72 --- ## Logistic regression - When response variable is measured/counted, regression can work well. - But what if response is yes/no, lived/died, success/failure? - Model *probability* of success. - Probability must be between 0 and 1; need method that ensures this. - *Logistic regression* does this. In R, is a *generalized linear model* with binomial "family": ```{r bLogistic-1, eval=FALSE} glm(y ~ x, family="binomial") ``` - Begin with simplest case. ## Packages ```{r bLogistic-2, message = FALSE} library(MASS) library(tidyverse) library(marginaleffects) library(broom) library(nnet) library(conflicted) conflict_prefer("select", "dplyr") conflict_prefer("filter", "dplyr") conflict_prefer("rename", "dplyr") conflict_prefer("summarize", "dplyr") ``` ## The rats, part 1 - Rats given dose of some poison; either live or die: ``` dose status 0 lived 1 died 2 lived 3 lived 4 died 5 died ``` ## Read in: ```{r bLogistic-3, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/rat.txt" rats <- read_delim(my_url, " ") rats ``` ## Basic logistic regression - Make response into a factor first: ```{r bLogistic-4} rats2 <- rats %>% mutate(status = factor(status)) rats2 ``` - then fit model: ```{r bLogistic-5, error=T} status.1 <- glm(status ~ dose, family = "binomial", data = rats2) ``` ## Output ```{r bLogistic-6} summary(status.1) ``` ## Interpreting the output - Like (multiple) regression, get tests of significance of individual $x$'s - Here not significant (only 6 observations). - "Slope" for dose is negative, meaning that as dose increases, probability of event modelled (survival) decreases. ## Output part 2: predicted survival probs ```{r bLogistic-7 } cbind(predictions(status.1)) %>% select(dose, estimate, conf.low, conf.high) ``` ## On a graph ```{r, fig.height=4} plot_predictions(status.1, condition = "dose") ``` ## The rats, more - More realistic: more rats at each dose (say 10). - Listing each rat on one line makes a big data file. - Use format below: dose, number of survivals, number of deaths. ``` dose lived died 0 10 0 1 7 3 2 6 4 3 4 6 4 2 8 5 1 9 ``` - 6 lines of data correspond to 60 actual rats. - Saved in `rat2.txt`. ## These data ```{r bLogistic-8} my_url <- "http://ritsokiguess.site/datafiles/rat2.txt" rat2 <- read_delim(my_url, " ") rat2 ``` ## Response matrix: - Each row contains *multiple* observations. - Create *two-column* response with `cbind`: - #survivals in first column, - #deaths in second. ## Fit logistic regression - constructing the response in the `glm`: ```{r bLogistic-11} rat2.1 <- glm(cbind(lived, died) ~ dose, family = "binomial", data = rat2) ``` ## Output Significant effect of dose now: ```{r bLogistic-12} summary(rat2.1) ``` ## Predicted survival probs ```{r bLogistic-13 } #| warning = FALSE new <- datagrid(model = rat2.1, dose = 0:5) cbind(predictions(rat2.1, newdata = new)) %>% select(estimate, dose, conf.low, conf.high) ``` ## On a picture ```{r} #| error: true plot_predictions(rat2.1, condition = "dose") ``` ## Comments - Significant effect of dose. - Effect of larger dose is to *decrease* survival probability ("slope" negative; also see in decreasing predictions.) - Confidence intervals around prediction narrower (more data). ## Multiple logistic regression - With more than one $x$, works much like multiple regression. - Example: study of patients with blood poisoning severe enough to warrant surgery. Relate survival to other potential risk factors. - Variables, 1=present, 0=absent: - survival (death from sepsis=1), response - shock - malnutrition - alcoholism - age (as numerical variable) - bowel infarction - See what relates to death. ## Read in data ```{r bLogistic-14} my_url <- "http://ritsokiguess.site/datafiles/sepsis.txt" sepsis <- read_delim(my_url, " ") sepsis ``` ## Make sure categoricals really are ```{r} sepsis %>% mutate(across(-age, \(x) factor(x))) -> sepsis ``` ## The data (some) ```{r bLogistic-15} sepsis ``` ## Fit model ```{r bLogistic-16 } sepsis.1 <- glm(death ~ shock + malnut + alcohol + age + bowelinf, family = "binomial", data = sepsis ) ``` ## Output part 1 ```{r bLogistic-17} summary(sepsis.1) tidy(sepsis.1) ``` - All P-values fairly small - but `malnut` not significant: remove. ## Removing `malnut` ```{r bLogistic-18} sepsis.2 <- update(sepsis.1, . ~ . - malnut) tidy(sepsis.2) ``` - Everything significant now. ## Comments - Most of the original $x$'s helped predict death. Only `malnut` seemed not to add anything. - Removed `malnut` and tried again. - Everything remaining is significant (though `bowelinf` actually became *less* significant). - All coefficients are *positive*, so having any of the risk factors (or being older) *increases* risk of death. ## Predictions from model without "malnut" - A few (rows of original dataframe) chosen "at random": ```{r bLogistic-19} sepsis %>% slice(c(4, 1, 2, 11, 32)) -> new new cbind(predictions(sepsis.2, newdata = new)) %>% select(estimate, conf.low, conf.high, shock:bowelinf) ``` ## Comments - Survival chances pretty good if no risk factors, though decreasing with age. - Having more than one risk factor reduces survival chances dramatically. - Usually good job of predicting survival; sometimes death predicted to survive. ## Another way to assess effects of `age`: ```{r} new <- datagrid(model = sepsis.2, age = seq(30, 70, 10)) new ``` ## Assessing age effect ```{r} cbind(predictions(sepsis.2, newdata = new)) %>% select(estimate, shock:age) ``` ## Assessing shock effect ```{r} new <- datagrid(shock = c(0, 1), model = sepsis.2) new cbind(predictions(sepsis.2, newdata = new)) %>% select(estimate, death:shock) ``` ## Assessing proportionality of odds for age - An assumption we made is that log-odds of survival depends linearly on age. - Hard to get your head around, but basic idea is that survival chances go continuously up (or down) with age, instead of (for example) going up and then down. - In this case, seems reasonable, but should check: ## Residuals vs. age ```{r virtusentella,fig.height=3.4, warning=F} sepsis.2 %>% augment(sepsis) %>% ggplot(aes(x = age, y = .resid, colour = death)) + geom_point() ``` ## Comments - No apparent problems overall. - Confusing "line" across: no risk factors, survived. ## Probability and odds For probability $p$, odds is $p/(1-p)$: | Prob | Odds | Log-odds | Words | |------|------------------|----------|------------| | 0.5 | 0.5 / 0.5 = 1.00 | 0.00 | even money | | 0.1 | 0.1 / 0.9 = 0.11 | -2.20 | 9 to 1 | | 0.4 | 0.4 / 0.6 = 0.67 | -0.41 | 1.5 to 1 | | 0.8 | 0.8 / 0.2 = 4.00 | 1.39 | 4 to 1 on | - Gamblers use odds: if you win at 9 to 1 odds, get original stake back plus 9 times the stake. - Probability has to be between 0 and 1 - Odds between 0 and infinity - *Log*-odds can be anything: any log-odds corresponds to valid probability. ## Odds ratio - Suppose 90 of 100 men drank wine last week, but only 20 of 100 women. - Prob of man drinking wine $90/100=0.9$, woman $20/100=0.2$. - Odds of man drinking wine $0.9/0.1=9$, woman $0.2/0.8=0.25$. - Ratio of odds is $9/0.25=36$. - Way of quantifying difference between men and women: \`\`odds of drinking wine 36 times larger for males than females''. ## Sepsis data again - Recall prediction of probability of death from risk factors: ```{r bLogistic-20} sepsis summary(sepsis.2) sepsis.2.tidy <- tidy(sepsis.2) sepsis.2.tidy ``` - Slopes in column `estimate`. ## Multiplying the odds - Can interpret slopes by taking "exp" of them. We ignore intercept. ```{r expo} sepsis.2.tidy %>% mutate(exp_coeff=exp(estimate)) %>% select(term, exp_coeff) ``` ## Interpretation ```{r bLogistic-21, ref.label="expo", echo=F} ``` - These say "how much do you *multiply* odds of death by for increase of 1 in corresponding risk factor?" Or, what is odds ratio for that factor being 1 (present) vs. 0 (absent)? - Eg. being alcoholic vs. not increases odds of death by 24 times - One year older multiplies odds by about 1.1 times. Over 40 years, about $1.09^{40}=31$ times. ## Odds ratio and relative risk - **Relative risk** is ratio of probabilities. - Above: 90 of 100 men (0.9) drank wine, 20 of 100 women (0.2). - Relative risk 0.9/0.2=4.5. (odds ratio was 36). - When probabilities small, relative risk and odds ratio similar. - Eg. prob of man having disease 0.02, woman 0.01. - Relative risk $0.02/0.01=2$. ## Odds ratio vs. relative risk - Odds for men and for women: ```{r bLogistic-22 } (od1 <- 0.02 / 0.98) # men (od2 <- 0.01 / 0.99) # women ``` - Odds ratio ```{r bLogistic-23 } od1 / od2 ``` - Very close to relative risk of 2. ## More than 2 response categories - With 2 response categories, model the probability of one, and prob of other is one minus that. So doesn't matter which category you model. - With more than 2 categories, have to think more carefully about the categories: are they - *ordered*: you can put them in a natural order (like low, medium, high) - *nominal*: ordering the categories doesn't make sense (like red, green, blue). - R handles both kinds of response; learn how. ## Ordinal response: the miners - Model probability of being in given category *or lower*. - Example: coal-miners often suffer disease pneumoconiosis. Likelihood of disease believed to be greater among miners who have worked longer. - Severity of disease measured on categorical scale: none, moderate, severe. ## Miners data - Data are frequencies: ``` Exposure None Moderate Severe 5.8 98 0 0 15.0 51 2 1 21.5 34 6 3 27.5 35 5 8 33.5 32 10 9 39.5 23 7 8 46.0 12 6 10 51.5 4 2 5 ``` ## Reading the data Data in aligned columns with more than one space between, so: ```{r bLogistic-24 } my_url <- "http://ritsokiguess.site/datafiles/miners-tab.txt" freqs <- read_table(my_url) ``` ## The data ```{r bLogistic-25 } freqs ``` ## Tidying ```{r bLogistic-26 } freqs %>% pivot_longer(-Exposure, names_to = "Severity", values_to = "Freq") %>% mutate(Severity = fct_inorder(Severity)) -> miners ``` ## Result ```{r bLogistic-27 } miners ``` ## Plot proportions against exposure ```{r bLogistic-28, fig.height=3.5, message=F} miners %>% group_by(Exposure) %>% mutate(proportion = Freq / sum(Freq)) -> prop prop ggplot(prop, aes(x = Exposure, y = proportion, colour = Severity)) + geom_point() + geom_smooth(se = F) ``` ## Reminder of data setup ```{r bLogistic-29 } miners ``` ## Fitting ordered logistic model Use function `polr` from package `MASS`. Like `glm`. ```{r bLogistic-34 } sev.1 <- polr(Severity ~ Exposure, weights = Freq, data = miners ) ``` ## Output: not very illuminating ```{r} sev.1 <- polr(Severity ~ Exposure, weights = Freq, data = miners, Hess = TRUE ) ``` ```{r bLogistic-35 } summary(sev.1) ``` ## Does exposure have an effect? Fit model without `Exposure`, and compare using `anova`. Note `1` for model with just intercept: ```{r bLogistic-36, echo=F} w <- getOption("width") options(width = w - 20) ``` ```{r bLogistic-37} sev.0 <- polr(Severity ~ 1, weights = Freq, data = miners) anova(sev.0, sev.1) ``` Exposure definitely has effect on severity of disease. ## Another way - What (if anything) can we drop from model with `exposure`? ```{r bLogistic-38 } drop1(sev.1, test = "Chisq") ``` - Nothing. Exposure definitely has effect. ## Predicted probabilities 1/2 ```{r} freqs %>% select(Exposure) -> new new ``` ## Predicted probabilities 2/2 ```{r} cbind(predictions(sev.1, newdata = new)) %>% select(group, estimate, Exposure) %>% pivot_wider(names_from = group, values_from = estimate) ``` ## Plot of predicted probabilities ```{r} plot_predictions(model = sev.1, condition = c("Exposure", "group"), type = "probs") + geom_point(data = prop, aes(x = Exposure, y = proportion, colour = Severity)) -> ggg ``` ## The graph ```{r, fig.height=4.5} ggg ``` ## Comments - Model appears to match data well enough. - As exposure goes up, prob of None goes down, Severe goes up (sharply for high exposure). - So more exposure means worse disease. ## Unordered responses - With unordered (nominal) responses, can use *generalized logit*. - Example: 735 people, record age and sex (male 0, female 1), which of 3 brands of some product preferred. - Data in `mlogit.csv` separated by commas (so `read_csv` will work): ```{r bLogistic-45 } my_url <- "http://ritsokiguess.site/datafiles/mlogit.csv" brandpref <- read_csv(my_url) ``` ## The data (some) ```{r bLogistic-46 } brandpref ``` ## Bashing into shape - `sex` and `brand` not meaningful as numbers, so turn into factors: ```{r bLogistic-47 } brandpref %>% mutate(sex = ifelse(sex == 1, "female", "male"), sex = factor(sex), brand = factor(brand) ) -> brandpref ``` ```{r} brandpref %>% count(sex) ``` ## Fitting model - We use `multinom` from package `nnet`. Works like `polr`. ```{r bLogistic-48 } library(nnet) levels(brandpref$sex) brands.1 <- multinom(brand ~ age + sex, data = brandpref) ``` ## Can we drop anything? - Unfortunately `drop1` seems not to work: ```{r bLogistic-49, error=TRUE} drop1(brands.1, test = "Chisq", trace = 0) ``` - So, fall back on fitting model without what you want to test, and comparing using `anova`. ## Do age/sex help predict brand? 1/3 Fit models without each of age and sex: ```{r bLogistic-50 } brands.2 <- multinom(brand ~ age, data = brandpref) brands.3 <- multinom(brand ~ sex, data = brandpref) ``` ## Do age/sex help predict brand? 2/3 ```{r bLogistic-51} anova(brands.2, brands.1) anova(brands.3, brands.1) ``` ## Do age/sex help predict brand? 3/3 - `age` definitely significant (second `anova`) - `sex` significant also (first `anova`), though P-value less dramatic - Keep both. - Expect to see a large effect of `age`, and a smaller one of `sex`. ## Another way to build model - Start from model with everything and run `step`: ```{r} #| echo = FALSE, #| message = FALSE, #| results = "hide" brands.1 <- with(brandpref, multinom(brand ~ age + sex)) ``` ```{r bLogistic-52 } step(brands.1, trace = 0) ``` - Final model contains both `age` and `sex` so neither could be removed. ## Making predictions Find age 5-number summary, and the two sexes: ```{r} summary(brandpref) ``` Space the ages out a bit for prediction (see over). ## Combinations ```{r} new <- datagrid(age = seq(24, 30, 2), sex = c("female", "male"), model = brands.1) new ``` ## The predictions ```{r bLogistic-54 } cbind(predictions(brands.1, newdata = new)) %>% select(group, estimate, age, sex) %>% pivot_wider(names_from = group, values_from = estimate) ``` ## Comments - Young males prefer brand 1, but older males prefer brand 3. - Females similar, but like brand 1 less and brand 2 more. - A clear `brand` effect, but the `sex` effect is less clear. ## Making a plot - I thought `plot_predictions` doesn't work as we want, but I was (sort of) wrong about that: ```{r} plot_predictions(brands.1, condition = c("age", "brand", "sex"), type = "probs") ``` ## Making it go - We have to include `group` in the `condition`: ```{r} plot_predictions(brands.1, condition = c("age", "group")) ``` - This picks the most common `sex` in the data (females). - See younger females prefer brand 1, older ones preferring brand 3. ## For each `sex` If we add the other variable to the *end*, we get facets for `sex`: ```{r} plot_predictions(brands.1, condition = c("age", "group", "sex")) ``` Not actually much difference between males and females. ## A better graph - but the male-female difference *was* significant. How? - *don't* actually plot the graph, then plot the right things: ```{r} plot_predictions(brands.1, condition = c("age", "brand", "sex"), type = "probs", draw = FALSE) %>% ggplot(aes(x = age, y = estimate, colour = group, linetype = sex)) + geom_line() -> g ``` ## The graph ```{r, fig.height=4.5} g ``` ## Digesting the plot - Brand vs. age: younger people (of both genders) prefer brand 1, but older people (of both genders) prefer brand 3. (Explains significant age effect.) - Brand vs. sex: females (solid) like brand 1 less than males (dashed), like brand 2 more (for all ages). - Not much brand difference between genders (solid and dashed lines of same colours close), but enough to be significant. - Model didn't include interaction, so modelled effect of gender on brand same for each age, modelled effect of age same for each gender. (See also later.) ## Alternative data format Summarize all people of same brand preference, same sex, same age on one line of data file with frequency on end: ```{r} brandpref ``` ``` 1 0 24 1 1 0 26 2 1 0 27 4 1 0 28 4 1 0 29 7 1 0 30 3 ... ``` Whole data set in 65 lines not 735! But how? ## Getting alternative data format ```{r bLogistic-60, warning=FALSE, message=FALSE} brandpref %>% group_by(age, sex, brand) %>% summarize(Freq = n()) %>% ungroup() -> b b ``` ## Fitting models, almost the same - Just have to remember `weights` to incorporate frequencies. - Otherwise `multinom` assumes you have just 1 obs on each line! - Again turn (numerical) `sex` and `brand` into factors: ```{r bLogistic-61, results="hide"} b %>% mutate(sex = factor(sex)) %>% mutate(brand = factor(brand)) -> bf b.1 <- multinom(brand ~ age + sex, data = bf, weights = Freq) b.2 <- multinom(brand ~ age, data = bf, weights = Freq) ``` ## P-value for `sex` identical ```{r bLogistic-62} anova(b.2, b.1) ``` Same P-value as before, so we haven't changed anything important. ## Trying interaction between age and gender ```{r bLogistic-69, echo=F} options(width = 60) ``` ```{r bLogistic-70 } brands.4 <- update(brands.1, . ~ . + age:sex) anova(brands.1, brands.4) ``` - No evidence that effect of age on brand preference differs for the two genders. ## Make graph again ```{r} plot_predictions(brands.4, condition = c("age", "brand", "sex"), type = "probs", draw = FALSE) %>% ggplot(aes(x = age, y = estimate, colour = group, linetype = sex)) + geom_line() -> g4 ``` ## Not much difference in the graph ```{r, fig.height=4.5} g4 ``` ## Compare model without interaction ```{r, fig.height=4.5} g ```