---
title: "CH 13: Logistic Regression"
output: pdf_document
---

\renewcommand{\vec}[1]{\mathbf{#1}}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.height = 4, fig.width = 6, fig.align = 'center')
library(tidyverse) 
library(rstanarm)
library(rstantools)
set.seed(11062020)
```

### Motivation

Let's assume that we have access to the underlying candy face off data. 

\vfill

Consider the following model:

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$
\vfill

\vfill

__Q:__ What issues might we have with this model?

\vfill

__Q:__ What are some possible solutions?

\vfill

\newpage

Logistic regression is a special case of 

\vfill

### Logistic Regression

The logistic function maps an input from the unit range (0,1) to the real line:

$$logit(x) = \log \left(\frac{x}{1-x}\right)$$

\vfill


\vfill

The `qlogis` (for logit) and `plogis` (inverse-logit) functions in R can be used for this calculation. For instance `plogis(1) =` `r plogis(1)`.

\vfill

Formally, the inverse-logistic function is used as part of the GLM:

\vfill
\vfill

\newpage

Recall the `beer` dataset, but now instead of trying to model consumption, lets consider whether a day is a weekday or weekend.

```{r, message = F}
beer <- read_csv('http://math.montana.edu/ahoegh/Data/Brazil_cerveja.csv') %>% mutate(consumed = consumed - mean(consumed))
```


```{r, message = F}
beer %>% ggplot(aes(y = weekend, x = consumed)) + 
  geom_point(alpha = .1) + 
  geom_smooth(formula = 'y~x', method = 'lm', se =F) + 
  geom_smooth(formula = 'y~x', method = 'loess', color = 'red', se = F) + 
  geom_rug() + ggtitle('Weekend vs. Consumption: comparing lm and loess') + 
  theme_bw() + xlab('Difference in consumption from average daily consumption (L)')
```
\vfill

```{r}
bayes_logistic <- stan_glm(weekend ~ consumed, data = beer,
                           family = binomial(link = "logit"), refresh = 0)
```

\vfill

```{r}
freq_logistic <- glm(weekend ~ consumed, data = beer,
                           family = binomial(link = "logit"))
```

\vfill

\newpage

Now how to  interpret the model coefficients? 

```{r}
bayes_logistic
```
\vfill
```{r}
summary(freq_logistic)
```

\newpage

Interpreting the coefficients can be challenging due to the non-linear relationship between the outcome and the predictors. 

### Predictive interpretation

One way to interpret the coefficients is in a predictive standpoint.  For instance, consider an day with average consumption, then the probability of a weekend would be `invlogit(-1.2 + 0.3 * 0) =` `r round(plogis(-1.2),2)`, where as the probability of a day with 10 more liters of consumption (relative to an average day) would have a weekend probability of `invlogit(-1.2 + 0.3 * 10) =` `r round(plogis(-1.2 + 0.3 * 10),2)`

\vfill

Of course, we should always think about uncertainty, so we can extract simulations from the model. 

\vfill

`posterior_linpred` was useful with regression
```{r}
new_data <- data.frame(consumed = c(0,10))
posterior_sims <- posterior_linpred(bayes_logistic, newdata = new_data)
summary(posterior_sims)
```

\vfill

```{r}
posterior_sims <- posterior_epred(bayes_logistic, newdata = new_data)
summary(posterior_sims)
```

\newpage

It can also be useful to consider predictions of an individual data point.

```{r}
new_obs <- posterior_predict(bayes_logistic, newdata = new_data)
head(new_obs)
colMeans(new_obs)
```

### Model Comparison

We can use cross validation in the same manner a standard linear models.

```{r}
loo(bayes_logistic)

temp_model <- stan_glm(weekend~max_tmp, data = beer, refresh=0)
loo(temp_model)
```