---
title: "Logistic Regression"
editor: 
  markdown: 
    wrap: 72
---

## Logistic regression

-   When response variable is measured/counted, regression can work
    well.

-   But what if response is yes/no, lived/died, success/failure?

-   Model *probability* of success.

-   Probability must be between 0 and 1; need method that ensures this.

-   *Logistic regression* does this. In R, is a *generalized linear
    model* with binomial "family":

```{r bLogistic-1, eval=FALSE}
glm(y ~ x, family="binomial")
```

-   Begin with simplest case.

## Packages

```{r bLogistic-2, message = FALSE}
library(MASS)
library(tidyverse)
library(marginaleffects)
library(broom)
library(nnet)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("rename", "dplyr")
conflict_prefer("summarize", "dplyr")
```

## The rats, part 1

-   Rats given dose of some poison; either live or die:

```         
dose status
0 lived
1 died
2 lived
3 lived
4 died
5 died
```

## Read in:

```{r bLogistic-3, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/rat.txt"
rats <- read_delim(my_url, " ")
rats
```

## Basic logistic regression

-   Make response into a factor first:

```{r bLogistic-4}
rats2 <- rats %>% mutate(status = factor(status))
rats2
```

-   then fit model:

```{r bLogistic-5, error=T}
status.1 <- glm(status ~ dose, family = "binomial", data = rats2)
```

## Output

```{r bLogistic-6}
summary(status.1)
```

## Interpreting the output

-   Like (multiple) regression, get tests of significance of individual
    $x$'s

-   Here not significant (only 6 observations).

-   "Slope" for dose is negative, meaning that as dose increases,
    probability of event modelled (survival) decreases.

## Output part 2: predicted survival probs

```{r bLogistic-7 }
cbind(predictions(status.1)) %>% 
  select(dose, estimate, conf.low, conf.high)
```

## On a graph

```{r, fig.height=4}
plot_predictions(status.1, condition = "dose")
```

## The rats, more

-   More realistic: more rats at each dose (say 10).

-   Listing each rat on one line makes a big data file.

-   Use format below: dose, number of survivals, number of deaths.

```         

dose lived died
0    10    0
1     7    3 
2     6    4 
3     4    6 
4     2    8 
5     1    9  
```

-   6 lines of data correspond to 60 actual rats.

-   Saved in `rat2.txt`.

## These data

```{r bLogistic-8}
my_url <- "http://ritsokiguess.site/datafiles/rat2.txt"
rat2 <- read_delim(my_url, " ")
rat2
```

## Response matrix:

-   Each row contains *multiple* observations.
-   Create *two-column* response with `cbind`:
    -   #survivals in first column,
    -   #deaths in second.


## Fit logistic regression

-   constructing the response in the `glm`:

```{r bLogistic-11}
rat2.1 <- glm(cbind(lived, died) ~ dose, family = "binomial", data = rat2)
```

## Output

Significant effect of dose now:

```{r bLogistic-12}
summary(rat2.1)
```

## Predicted survival probs

```{r bLogistic-13 }
#| warning = FALSE
new <- datagrid(model = rat2.1, dose = 0:5)
cbind(predictions(rat2.1, newdata = new)) %>% 
  select(estimate, dose, conf.low, conf.high)
```

## On a picture

```{r}
#| error: true
plot_predictions(rat2.1, condition = "dose")
```

## Comments

-   Significant effect of dose.

-   Effect of larger dose is to *decrease* survival probability ("slope"
    negative; also see in decreasing predictions.)

-   Confidence intervals around prediction narrower (more data).

## Multiple logistic regression

-   With more than one $x$, works much like multiple regression.

-   Example: study of patients with blood poisoning severe enough to
    warrant surgery. Relate survival to other potential risk factors.

-   Variables, 1=present, 0=absent:

    -   survival (death from sepsis=1), response
    -   shock
    -   malnutrition
    -   alcoholism
    -   age (as numerical variable)
    -   bowel infarction

-   See what relates to death.

## Read in data

```{r bLogistic-14}
my_url <- 
  "http://ritsokiguess.site/datafiles/sepsis.txt"
sepsis <- read_delim(my_url, " ")
sepsis
```

## Make sure categoricals really are

```{r}
sepsis %>% 
  mutate(across(-age, \(x) factor(x))) -> sepsis
```

## The data (some)

```{r bLogistic-15}
sepsis
```

## Fit model

```{r bLogistic-16 }
sepsis.1 <- glm(death ~ shock + malnut + alcohol + age +
  bowelinf,
family = "binomial",
data = sepsis
)
```

## Output part 1

```{r bLogistic-17}
summary(sepsis.1)
tidy(sepsis.1)
```

-   All P-values fairly small

-   but `malnut` not significant: remove.

## Removing `malnut`

```{r bLogistic-18}
sepsis.2 <- update(sepsis.1, . ~ . - malnut)
tidy(sepsis.2)
```

-   Everything significant now.

## Comments

-   Most of the original $x$'s helped predict death. Only `malnut`
    seemed not to add anything.

-   Removed `malnut` and tried again.

-   Everything remaining is significant (though `bowelinf` actually
    became *less* significant).

-   All coefficients are *positive*, so having any of the risk factors
    (or being older) *increases* risk of death.

## Predictions from model without "malnut"

-   A few (rows of original dataframe) chosen "at random":

```{r bLogistic-19}
sepsis %>% slice(c(4, 1, 2, 11, 32)) -> new
new
cbind(predictions(sepsis.2, newdata = new)) %>% 
  select(estimate, conf.low, conf.high, shock:bowelinf) 
```

## Comments

-   Survival chances pretty good if no risk factors, though decreasing
    with age.

-   Having more than one risk factor reduces survival chances
    dramatically.

-   Usually good job of predicting survival; sometimes death predicted
    to survive.

## Another way to assess effects

of `age`:

```{r}
new <- datagrid(model = sepsis.2, age = seq(30, 70, 10))
new
```

## Assessing age effect

```{r}
cbind(predictions(sepsis.2, newdata = new)) %>% 
  select(estimate, shock:age)
```

## Assessing shock effect

```{r}
new <- datagrid(shock = c(0, 1), model = sepsis.2)
new
cbind(predictions(sepsis.2, newdata = new)) %>% 
  select(estimate, death:shock)
```

## Assessing proportionality of odds for age

-   An assumption we made is that log-odds of survival depends linearly
    on age.

-   Hard to get your head around, but basic idea is that survival
    chances go continuously up (or down) with age, instead of (for
    example) going up and then down.

-   In this case, seems reasonable, but should check:

## Residuals vs. age

```{r virtusentella,fig.height=3.4, warning=F}
sepsis.2 %>% augment(sepsis) %>% 
  ggplot(aes(x = age, y = .resid, colour = death)) +
  geom_point()
```

## Comments

-   No apparent problems overall.

-   Confusing "line" across: no risk factors, survived.

## Probability and odds

For probability $p$, odds is $p/(1-p)$:

| Prob | Odds             | Log-odds | Words      |
|------|------------------|----------|------------|
| 0.5  | 0.5 / 0.5 = 1.00 | 0.00     | even money |
| 0.1  | 0.1 / 0.9 = 0.11 | -2.20    | 9 to 1     |
| 0.4  | 0.4 / 0.6 = 0.67 | -0.41    | 1.5 to 1   |
| 0.8  | 0.8 / 0.2 = 4.00 | 1.39     | 4 to 1 on  |

-   Gamblers use odds: if you win at 9 to 1 odds, get original stake
    back plus 9 times the stake.

-   Probability has to be between 0 and 1

-   Odds between 0 and infinity

-   *Log*-odds can be anything: any log-odds corresponds to valid
    probability.

## Odds ratio

-   Suppose 90 of 100 men drank wine last week, but only 20 of 100
    women.

-   Prob of man drinking wine $90/100=0.9$, woman $20/100=0.2$.

-   Odds of man drinking wine $0.9/0.1=9$, woman $0.2/0.8=0.25$.

-   Ratio of odds is $9/0.25=36$.

-   Way of quantifying difference between men and women: \`\`odds of
    drinking wine 36 times larger for males than females''.

## Sepsis data again

-   Recall prediction of probability of death from risk factors:

```{r bLogistic-20}
sepsis
summary(sepsis.2)
sepsis.2.tidy <- tidy(sepsis.2)
sepsis.2.tidy
```

-   Slopes in column `estimate`.

## Multiplying the odds

-   Can interpret slopes by taking "exp" of them. We ignore intercept.

```{r expo}
sepsis.2.tidy %>% 
  mutate(exp_coeff=exp(estimate)) %>% 
  select(term, exp_coeff)
```

## Interpretation

```{r bLogistic-21, ref.label="expo", echo=F}
```

-   These say "how much do you *multiply* odds of death by for increase
    of 1 in corresponding risk factor?" Or, what is odds ratio for that
    factor being 1 (present) vs. 0 (absent)?

-   Eg. being alcoholic vs. not increases odds of death by 24 times

-   One year older multiplies odds by about 1.1 times. Over 40 years,
    about $1.09^{40}=31$ times.

## Odds ratio and relative risk

-   **Relative risk** is ratio of probabilities.

-   Above: 90 of 100 men (0.9) drank wine, 20 of 100 women (0.2).

-   Relative risk 0.9/0.2=4.5. (odds ratio was 36).

-   When probabilities small, relative risk and odds ratio similar.

-   Eg. prob of man having disease 0.02, woman 0.01.

-   Relative risk $0.02/0.01=2$.

## Odds ratio vs. relative risk

-   Odds for men and for women:

```{r bLogistic-22 }
(od1 <- 0.02 / 0.98) # men
(od2 <- 0.01 / 0.99) # women
```

-   Odds ratio

```{r bLogistic-23 }
od1 / od2
```

-   Very close to relative risk of 2.

## More than 2 response categories

-   With 2 response categories, model the probability of one, and prob
    of other is one minus that. So doesn't matter which category you
    model.

-   With more than 2 categories, have to think more carefully about the
    categories: are they

-   *ordered*: you can put them in a natural order (like low, medium,
    high)

-   *nominal*: ordering the categories doesn't make sense (like red,
    green, blue).

-   R handles both kinds of response; learn how.

## Ordinal response: the miners

-   Model probability of being in given category *or lower*.

-   Example: coal-miners often suffer disease pneumoconiosis. Likelihood
    of disease believed to be greater among miners who have worked
    longer.

-   Severity of disease measured on categorical scale: none, moderate,
    severe.

## Miners data

-   Data are frequencies:

```         
Exposure None Moderate Severe
5.8       98      0       0
15.0      51      2       1
21.5      34      6       3
27.5      35      5       8
33.5      32     10       9
39.5      23      7       8
46.0      12      6      10
51.5       4      2       5
```

## Reading the data

Data in aligned columns with more than one space between, so:

```{r bLogistic-24 }
my_url <- "http://ritsokiguess.site/datafiles/miners-tab.txt"
freqs <- read_table(my_url)
```

## The data

```{r bLogistic-25 }
freqs
```

## Tidying

```{r bLogistic-26 }
freqs %>%
  pivot_longer(-Exposure, names_to = "Severity", values_to = "Freq") %>%
  mutate(Severity = fct_inorder(Severity)) -> miners
```

## Result

```{r bLogistic-27 }
miners
```

## Plot proportions against exposure

```{r bLogistic-28, fig.height=3.5, message=F}
miners %>% 
  group_by(Exposure) %>% 
  mutate(proportion = Freq / sum(Freq)) -> prop
prop
ggplot(prop, aes(x = Exposure, y = proportion,
                   colour = Severity)) + 
  geom_point() + geom_smooth(se = F)
```

## Reminder of data setup

```{r bLogistic-29 }
miners
```

## Fitting ordered logistic model

Use function `polr` from package `MASS`. Like `glm`.

```{r bLogistic-34 }
sev.1 <- polr(Severity ~ Exposure,
  weights = Freq,
  data = miners
)
```

## Output: not very illuminating

```{r}
sev.1 <- polr(Severity ~ Exposure,
  weights = Freq,
  data = miners,
  Hess = TRUE
)
```

```{r bLogistic-35 }
summary(sev.1)
```

## Does exposure have an effect?

Fit model without `Exposure`, and compare using `anova`. Note `1` for
model with just intercept:

```{r bLogistic-36, echo=F}
w <- getOption("width")
options(width = w - 20)
```

```{r bLogistic-37}
sev.0 <- polr(Severity ~ 1, weights = Freq, data = miners)
anova(sev.0, sev.1)
```

Exposure definitely has effect on severity of disease.

## Another way

-   What (if anything) can we drop from model with `exposure`?

```{r bLogistic-38 }
drop1(sev.1, test = "Chisq")
```

-   Nothing. Exposure definitely has effect.

## Predicted probabilities 1/2

```{r}
freqs %>% select(Exposure) -> new
new
```

## Predicted probabilities 2/2

```{r}
cbind(predictions(sev.1, newdata = new)) %>%
  select(group, estimate, Exposure) %>% 
  pivot_wider(names_from = group, values_from = estimate)
```

## Plot of predicted probabilities

```{r}
plot_predictions(model = sev.1, condition = c("Exposure", "group"), type = "probs") + 
  geom_point(data = prop, aes(x = Exposure, y = proportion, colour = Severity)) -> ggg
```

## The graph

```{r, fig.height=4.5}
ggg
```

## Comments

-   Model appears to match data well enough.

-   As exposure goes up, prob of None goes down, Severe goes up (sharply
    for high exposure).

-   So more exposure means worse disease.

## Unordered responses

-   With unordered (nominal) responses, can use *generalized logit*.

-   Example: 735 people, record age and sex (male 0, female 1), which of
    3 brands of some product preferred.

-   Data in `mlogit.csv` separated by commas (so `read_csv` will work):

```{r bLogistic-45 }
my_url <- "http://ritsokiguess.site/datafiles/mlogit.csv"
brandpref <- read_csv(my_url)
```

## The data (some)

```{r bLogistic-46 }
brandpref
```

## Bashing into shape

-   `sex` and `brand` not meaningful as numbers, so turn into factors:

```{r bLogistic-47 }
brandpref %>%
  mutate(sex = ifelse(sex == 1, "female", "male"), 
         sex = factor(sex),
         brand = factor(brand)
         ) -> brandpref
```

```{r}
brandpref %>% count(sex)
```

## Fitting model

-   We use `multinom` from package `nnet`. Works like `polr`.

```{r bLogistic-48 }
library(nnet)
levels(brandpref$sex)

brands.1 <- multinom(brand ~ age + sex, data = brandpref)
```

## Can we drop anything?

-   Unfortunately `drop1` seems not to work:

```{r bLogistic-49, error=TRUE}
drop1(brands.1, test = "Chisq", trace = 0)
```

-   So, fall back on fitting model without what you want to test, and
    comparing using `anova`.

## Do age/sex help predict brand? 1/3

Fit models without each of age and sex:

```{r bLogistic-50 }
brands.2 <- multinom(brand ~ age, data = brandpref)
brands.3 <- multinom(brand ~ sex, data = brandpref)
```

## Do age/sex help predict brand? 2/3

```{r bLogistic-51}
anova(brands.2, brands.1)
anova(brands.3, brands.1)
```

## Do age/sex help predict brand? 3/3

-   `age` definitely significant (second `anova`)

-   `sex` significant also (first `anova`), though P-value less dramatic

-   Keep both.

-   Expect to see a large effect of `age`, and a smaller one of `sex`.

## Another way to build model

-   Start from model with everything and run `step`:

```{r}
#| echo = FALSE,
#| message = FALSE,
#| results = "hide"
brands.1 <- with(brandpref, multinom(brand ~ age + sex))
```

```{r bLogistic-52 }
step(brands.1, trace = 0)
```

-   Final model contains both `age` and `sex` so neither could be
    removed.

## Making predictions

Find age 5-number summary, and the two sexes:

```{r}
summary(brandpref)
```

Space the ages out a bit for prediction (see over).

## Combinations

```{r}
new <- datagrid(age = seq(24, 30, 2), 
                sex = c("female", "male"), model = brands.1)
new
```

## The predictions

```{r bLogistic-54 }
cbind(predictions(brands.1, newdata = new)) %>%
  select(group, estimate, age, sex) %>% 
  pivot_wider(names_from = group, values_from = estimate)
```

## Comments

-   Young males prefer brand 1, but older males prefer brand 3.

-   Females similar, but like brand 1 less and brand 2 more.

-   A clear `brand` effect, but the `sex` effect is less clear.

## Making a plot

-   I thought `plot_predictions` doesn't work as we want, but I was
    (sort of) wrong about that:

```{r}
plot_predictions(brands.1, condition = c("age", "brand", "sex"), 
         type = "probs")
```


## Making it go 

-   We have to include `group` in the `condition`:

```{r}
plot_predictions(brands.1, condition = c("age", "group"))
```

- This picks the most common `sex` in the data (females).
- See younger females prefer brand 1, older ones preferring brand 3.

## For each `sex`

If we add the
other variable to the *end*, we get facets for `sex`:

```{r}
plot_predictions(brands.1, condition = c("age", "group", "sex"))
```

Not actually much difference between males and
females.

## A better graph

- but the male-female difference *was* significant. How?
- *don't* actually plot the graph, then plot the right things:

```{r}
plot_predictions(brands.1, condition = c("age", "brand", "sex"), 
         type = "probs", draw = FALSE)  %>% 
  ggplot(aes(x = age, y = estimate, colour = group, 
             linetype = sex)) +
  geom_line() -> g
```


## The graph

```{r, fig.height=4.5}
g
```

## Digesting the plot

-   Brand vs. age: younger people (of both genders) prefer brand 1, but
    older people (of both genders) prefer brand 3. (Explains significant
    age effect.)

-   Brand vs. sex: females (solid) like brand 1 less than males
    (dashed), like brand 2 more (for all ages).

-   Not much brand difference between genders (solid and dashed lines of
    same colours close), but enough to be significant.

-   Model didn't include interaction, so modelled effect of gender on
    brand same for each age, modelled effect of age same for each
    gender. (See also later.)

## Alternative data format

Summarize all people of same brand preference, same sex, same age on one
line of data file with frequency on end:

```{r}
brandpref
```

```         
1 0 24 1
1 0 26 2
1 0 27 4
1 0 28 4
1 0 29 7
1 0 30 3
...
```

Whole data set in 65 lines not 735! But how?

## Getting alternative data format

```{r bLogistic-60, warning=FALSE, message=FALSE}
brandpref %>%
  group_by(age, sex, brand) %>%
  summarize(Freq = n()) %>%
  ungroup() -> b
b
```

## Fitting models, almost the same

-   Just have to remember `weights` to incorporate frequencies.

-   Otherwise `multinom` assumes you have just 1 obs on each line!

-   Again turn (numerical) `sex` and `brand` into factors:

```{r bLogistic-61, results="hide"}
b %>%
  mutate(sex = factor(sex)) %>%
  mutate(brand = factor(brand)) -> bf
b.1 <- multinom(brand ~ age + sex, data = bf, weights = Freq)
b.2 <- multinom(brand ~ age, data = bf, weights = Freq)
```

## P-value for `sex` identical

```{r bLogistic-62}
anova(b.2, b.1)
```

Same P-value as before, so we haven't changed anything important.

## Trying interaction between age and gender

```{r bLogistic-69, echo=F}
options(width = 60)
```

```{r bLogistic-70 }
brands.4 <- update(brands.1, . ~ . + age:sex)
anova(brands.1, brands.4)
```

-   No evidence that effect of age on brand preference differs for the
    two genders.

## Make graph again

```{r}
plot_predictions(brands.4, condition = c("age", "brand", "sex"), 
         type = "probs", draw = FALSE)  %>% 
  ggplot(aes(x = age, y = estimate, colour = group, 
             linetype = sex)) +
  geom_line() -> g4
```

## Not much difference in the graph

```{r, fig.height=4.5}
g4
```

## Compare model without interaction

```{r, fig.height=4.5}
g
```