---
title: "Logistic Regression"
output: html_notebook
---
[Packages](#packages)  
[Background](#background)    
[Example 1](#example-1)    
[Example 2](#example-2)    
[Example 3](#example-3)  
[Example 4](#example-4)    

### packages
```{r}
library(ggplot2)
library(dplyr)
library(pander)
library(ggthemes)
```

#### background
Our logistic regression model allows us to find the relationship between a binary outcome measure (0 vs 1) and a set of predictors. Previously we used linear regression which utlizes Ordinary Least Squares approach to estimate model parameters. In contrast, the logistic regression of y on a set of predictors produces parameter estimates through maximum likelihood method. 
Specifically, logistic regression models the logit-transformed probability as a linear combination of predictor variables.
Here's our logistic regression equation:

$$
\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots  + \beta_{p - 1} x_{p - 1}
$$ 
 
The left side of the equation depicts the logit (or log odds) term. P is probability that $y_i = 1$ and the logit is the natural logarithm of the odds, $p_i/(1-p_i)$. The right side is showing the linear combination of independent predictors. 

$$
\frac{p({\bf x})}{1 - p({\bf x})} = \frac{P[Y = 1 \mid {\bf X} = {\bf x}]}{P[Y = 0 \mid {\bf X} = {\bf x}]}
$$


$$
\text{logit}(x) = \log\left(\frac{x}{1 - x}\right)
$$


In order to calculate the predicted logit for data point i you can solve the above equation by substituting the data point values for the independent variables into the logistic regression equation with the samples coefficient estimates. 

$$
logit_i = \beta_0 + \beta_1 x_{1i} + \ldots  + \beta_{p - 1} x_{pi - 1}
$$ 
Predicted probability for i can be calculated by the inverse logit (i.e., logistic function) which outputs values between 0 and 1.

$$
p_i = \frac{e^{\text{logit}_i}}{1 + e^{\text{logit}_i}}
$$

#### Example 1

Ok let's implement this in R using the Motor Trend Car Road Tests Dataset. We will be focuing on the 'vs' (V/S Engine configuration ) variable (0 means V-engine and 1 means Straight engine) and 'mpg' (Mile/Gallon)
```{r}
data(mtcars)
head(mtcars)
```


Fit both a linear and logistic regression
```{r}
# linear regression
fit_lm  = lm(vs ~ mpg, data = mtcars)

# logistic regression
fit_glm <- glm(vs ~ mpg, data = mtcars, family = binomial)

#Note: Also, glm is a less specific version of lm(). Ordinary linear regression can also be fit with
#glm(admit ~ gre, data = admit) with the family argument specified to gaussian as default
```


Let's overlay the repsonse curves of these two models.
```{r}
# Create scatter plot of VS vs MPG
plot(vs ~ mpg, data = mtcars, 
     pch = 20, ylab = "Estimated Probability", 
     main = "Ordinary vs Logistic Regression")

# Overlay a straight line using the regression model object
abline(fit_lm, col = "darkorange")

# Overlay a curve using the estimated logistic regression model

#    Creates a vector of length 100 with values ranging from minimum MPG to maximum MPG
x_pred <- seq(from=min(mtcars$mpg), to=max(mtcars$mpg),length.out=100)

#    Calculate predicted probabilites using the our logistic regression model object. Specifying `type ="response"`
#    will output predictions based upon the response scale, in our case that would be predicted probabilities. 
y_probs <- predict(fit_glm,data.frame(mpg = x_pred),type = "response")

lines(x_pred,y_probs, col = "dodgerblue", lty = 2)

legend("topleft", c("Ordinary", "Logistic", "Data"), lty = c(1, 2, 0), 
       pch = c(NA, NA, 20), lwd = 2, col = c("darkorange", "dodgerblue", "black"))

# Here's how to plot the log reg curve in ggplot
# ggplot(mtcars, aes(x=mpg, y=vs)) + geom_point() + 
#   stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
```

Model output:
```{r}
summary(fit_glm)
```
- Our predictor is in Logit units. 
- Deviance is a measure of the goodness of fit. Higher deviance indicates worse fit.
- R reports two forms of deviance – the null deviance and the residual deviance. The null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean).
- If your Null Deviance is really small, it means that the Null Model explains the data pretty well. Likewise with your Residual Deviance.
- The Fisher scoring algorithm is related to the maximum likelihood method used to fit this model. In this case it took 6 iterations of perform the fit.


### example 2

Here we'll be using some navigation data from my lab. 
- 'withdrew': our outcome measure telling us whether or not the participant dropped out of the study
- 'gender'
- 'gameEXP' : a measure of video game experience for each participant
```{r}
d <- read.csv("https://raw.githubusercontent.com/jdstokes/PSC204b/master/data/YSPdata.csv", header = TRUE, na.strings = "")
gameEXP <- data.frame(d$ExprVG,d$withdrew,d$"Sex")
colnames(gameEXP) <- c("gameEXP","withdrew","gender")
gameEXP <- gameEXP[which(!is.na(gameEXP$gameEXP) & gameEXP$withdrew != -99),]
gameEXP$gender[gameEXP$gender==1] = 'male'
gameEXP$gender[gameEXP$gender==0] = 'female'
```


Logisitic regression with no predictors.
```{r}
model1 <- glm(withdrew ~ 1, 
              family="binomial"(link="logit"), data = gameEXP)
round(coef(model1),3)
```

Our model has predicted has intercept of -0.651 logits. Let's use our inverse logit function from above to convert the intercept to a probability.
```{r}
intercept.logit <- as.vector(coef(model1))
odds <- exp(intercept.logit)
prob = odds / (1 + odds)
prob
```
What is this? Well, our p ends up being the overall probability of withdrawing from the study. Let's double check this.
```{r}
mean(gameEXP$withdrew)
```

### example 3
Not let's add run a logistic regression with just a single dichotomous predictor, gender.
```{r}
# Frequency table and more
T<-gameEXP %>%
group_by(gender,withdrew) %>%
summarise(freq=n())  %>%
mutate(all=sum(freq),prob=freq/all,odds=prob/(1-prob),logodds=log(odds))
pander(T)
```

Let's rerun our logistic regression model and incldue a binary predictor, gender
```{r}
model1 <- glm(withdrew ~ gender, 
              family="binomial"(link="logit"), data = gameEXP)
summary(model1)
```

Let's convert our gender coefficient term from logit to odds ratio. The odds ratio is interpreted as the effect of a one unit change in X in the predicted odds ratio all other variables held constant. Or, the (odds if the this variable is incremented by 1)/(odds if this variable is not incremented by 1). This is a multiplicative factor that allows you to move from the odds(x) to the odds(x+1).
```{r}
exp(coef(model1)[2])
```
So, our model is telling us that the odds for males dropping out of the study is 3.1 times than the odds for females. Let's check this be calculating the odds ratio ourself. 
```{r}
odds_WD_male <- (8/17)/(9/17)
odds_WD_male
odds_WD_female <- (4/18)/(14/18)
odds_WD_female
odds_ratio <- odds_WD_male/odds_WD_female #Remember, odds ratio of 1 means a 50/50 chance
odds_ratio
```

To describe this effect in terms of probability, we can evaluate our model for males and females separately. 
```{r}
coefs <- coef(model1)
names(coefs) <- NULL
predMale_logit <- coefs[1] + 1*coefs[2]
predFemale_logit <- coefs[1] + 0*coefs[2]

predMale_prob <- exp(predMale_logit)/(1+exp(predMale_logit))
predFemale_prob <- exp(predFemale_logit)/(1+exp(predFemale_logit))
prob_diff <-predMale_prob - predFemale_prob
prob_diff
```

```{r}
exp(coefs)
```


Being a male corresponds to 0.248 positive difference in the probability of withdrawal from the study.

```{r}
df_probs <- data.frame(gender=c('male','female'),probability = c(predMale_prob,predFemale_prob))
df_probs %>%
  ggplot(aes(y=probability,x= gender)) +
  ylim(0,1) +
  ylab("Probability of withdrawal") +
  geom_hline(yintercept=df_probs$probability[1],linetype="dashed") +
  geom_hline(yintercept=df_probs$probability[2],linetype="dashed") +
  scale_x_discrete("Gender")+
  geom_segment(aes(x = 1.5, y = probability[1], xend = 1.5, yend = probability[2]),arrow = arrow(length=unit(0.3,"cm"),
                                                                                                 ends="both"),color="blue") +
  geom_point(size = 5) +
  annotate("text", x = 1.6, y=mean(df_probs$probability),label = toString(round(prob_diff,3)),color = "blue") +
  theme_base()
```


### example 4

Logistic regression with a single continuous predictor variable, gaming experience (1-5; 1=low, 5=high)
```{r}
# model for game experience
model2 <- glm(withdrew ~ gameEXP, 
              family="binomial"(link="logit"), data = gameEXP)

# Output model coefficients etc.
summary(model2)
```

Odds
```{r}
# Convert game experience logit coefficient to odds ratio using inverse logit function
odds_ratio <- exp(coef(model2)[2])/(1+exp(coef(model2)[2]))

#         Interpreation: For every one unit increase in game experience, 
#         we expect 2/3 less likely withdrawal
```

Probability
```{r}
# We can evaluate how the probability differs with a unit difference in x near the central value (Gelman & Hill)
xVals <- seq(1,5,by = 0.25) # Generate x values
logit  <- coef(model2)[1]+coef(model2)[2]*xVals #Logit prediction
oddsWithdraw     <- exp(logit) # exponetiate to get the odds of dropout from logOdds
probWithdraw     <- oddsWithdraw/(1+oddsWithdraw) # odds/1+odds = probability
logOddsWithdraw  <- log(probWithdraw/(1-probWithdraw)) # Here's how we could transform back logit, just because


predict(model2,newdata = data.frame(gameEXP=xVals),type = "response")

#Let's subtract the predicted probability with 2 units of experience from the predicted probability with 2 units of experience
probWithdraw[5] - probWithdraw[9]

#         Interpretation: For a one unit increase in game experience,
#         we'd expect a 16% percent difference in probability of leaving the study
```

Probability, odds, log odds comparison

```{r}
# We can see how well the model fits
with(model2, null.deviance - deviance)

# Differences in degrees of freedom
with(model2, df.null - df.residual)

# We can use a chi square test to find out if our model fits better than the null. This will return a p value.
with(model2, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))


par(mfrow=c(1,3))

# plot probabilities
plot(gameEXP$gameEXP,gameEXP$withdrew,
     main="Probability: VR-Sickness -- Gaming EXP",
     xlab = c("Gaming Experience"),ylab = c("Probability of Withdrawal"))
points(xVals,probWithdraw,pch=16,col="green")
abline(h=0.5,col="blue")
abline(v=xVals[which.min(abs(probWithdraw-0.5))],col="blue")

# plot odds
plot(xVals,oddsWithdraw,pch=16,col="red",
     main="Odds: VR-Sickness -- Gaming Exp",
     xlab = c("Gaming Experience"),ylab = c("Odds of Withdrawal"),ylim = c(0,3))
abline(h=1,col="blue")
abline(v=xVals[which.min(abs(oddsWithdraw-1))],col="blue")

# plot log odds
plot(xVals,logOddsWithdraw,pch=16,col="black",
     main="Linear Logit Transform (Gaming Exp)",
     xlab = c("Gaming Experience"),ylab = c("Log-Odds of Withdrawal"),ylim = c(-2.5,1.5))
abline(h=0,col="blue")
abline(v=xVals[which.min(abs(logOddsWithdraw))],col="blue")
```