--- title: "Diagnostic of Logistic Regression" date: today format: html: theme: cosmo embed-resources: true number-sections: false toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- ```{r setup} knitr::opts_chunk$set(echo = TRUE, cache = FALSE) library(knitr) library(ggplot2) library(tidyverse) ``` ## Logistic Regression with a Quadratic Term Suppose the linear predictor in a logistic regression model has the following form: $$ \eta = \beta_0 + \beta_1 x_1^2 + \beta_2 x_2 $$ We set $\beta_0 = 1$, $\beta_1 = 3$, and $\beta_2 = 5$. We generate a dataset with $n = 1000$ observations from this model, where $x_1$ and $x_2$ are independent standard normal random variables. The response variable $y$ is generated from a Bernoulli distribution with parameter $p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}$. There is no noise added to the linear predictor. ```{r} #| eval: TRUE ### Generate data ## Set seed set.seed(1017) ## Set parameters n_1 <- 1000 beta_0 <- -1 beta_1 <- 3 beta_2 <- 5 ## Generate data x_1 <- rnorm(n_1) x_2 <- rnorm(n_1) eta <- beta_0 + beta_1 * x_1^2 + beta_2 * x_2 p <- exp(eta) / (1 + exp(eta)) y <- rbinom(n_1, 1, p) ``` ### Binned deviance residuals against linear predictor $X_1$ assuming the true model is linear Now we fit a logistic regression model to this dataset, but we assume that the true model is linear, i.e., $\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2$. We fit the model using the `glm` function. ```{r} #| eval: TRUE ### model the systematic component as a linear function of the predictors linear_glm <- glm(y ~ x_1 + x_2, family = binomial) summary(linear_glm) ``` ```{r} #| eval: TRUE ## Create a data frame data_1 <- data.frame(x_1, x_2, y, linpred = predict(linear_glm), predprob = predict(linear_glm, type = "response"), devres = residuals(linear_glm)) ### Deviance residuals against linear predictor X_1 ggplot(data_1) + geom_point(mapping = aes(x = x_1, y = devres)) + labs(x = "Linear predictor X_1", y = "Deviance residuals") + theme_minimal() + labs(title = "Dev_res against linear predictor X_1 when model is wrong") ``` ```{r} #| eval: TRUE ### Binned deviance residuals against linear predictor X_1 data_1 |> group_by(cut(x_1, breaks = unique(quantile(x_1, (1:40)/41)))) |> summarize(devres = mean(devres), x_1 = mean(x_1)) |> ggplot() + geom_point(mapping = aes(x = x_1, y = devres)) + labs(x = "Linear predictor X_1", y = "Binned deviance residual") + theme_minimal() + labs(title = "Binned Dev_res against linear predictor X_1 when model is wrong") ``` From the above plot, we can see that there is an obvious quadratic relationship between the deviance residuals and the linear predictor $x_1$. This suggests that the true model is not linear in $x_1$. ### Binned deviance residuals against the quadratic term $X_1^2$ assuming the true model is quadratic Now we fit a logistic regression model to this dataset, but we assume that the true model is quadratic, i.e., $\eta = \beta_0 + \beta_1 x_1^2 + \beta_2 x_2$. We fit the model using the `glm` function. ```{r} #| eval: TRUE ### model the systematic component as a linear function of the predictors quadratic_glm <- glm(y ~ I(x_1^2) + x_2, family = binomial) summary(quadratic_glm) ``` ```{r} #| eval: TRUE ## Create a data frame data_2 <- data.frame(x_1, x_2, y, linpred = predict(quadratic_glm), predprob = predict(quadratic_glm, type = "response"), devres = residuals(quadratic_glm)) ### Deviance residuals against linear predictor X_1^2 ggplot(data_2) + geom_point(mapping = aes(x = x_1^2, y = devres)) + labs(x = "Quadratic term X_1^2", y = "Deviance residuals") + theme_minimal() + labs(title = "Dev_res against quadratic term X_1^2 when model is correct") ``` ```{r} #| eval: TRUE ### Binned deviance residuals against linear predictor X_1^2 data_2 |> mutate(x_1_sq = x_1^2) |> group_by(cut(x_1_sq, breaks = unique(quantile(x_1_sq, (1:40)/41)))) |> summarize(devres = mean(devres), x_1_sq = mean(x_1_sq)) |> ggplot() + geom_point(mapping = aes(x = x_1_sq, y = devres)) + labs(x = "Quadratic term X_1^2", y = "Binned deviance residual") + theme_minimal() + labs(title = "Binned Dev_res against quadratic term X_1^2 when model is correct") ``` ### Binned deviance residuals against fitted value $\hat{\eta}$ assuming the true model is quadratic ```{r} #| eval: TRUE ### Deviance residuals against fitted value ggplot(data_2) + geom_point(mapping = aes(x = linpred, y = devres)) + labs(x = "Linear predictor", y = "Deviance residuals") + theme_minimal() + labs(title = "Dev_res against linear predictor when model is correct") ``` ```{r} #| eval: TRUE ### Binned deviance residuals against fitted value data_2 |> group_by(cut(linpred, breaks = unique(quantile(linpred, (1:40)/41)))) |> summarize(devres = mean(devres), linpred = mean(linpred)) |> ggplot() + geom_point(mapping = aes(x = linpred, y = devres)) + labs(x = "Linear predictor", y = "Binned deviance residual") + theme_minimal() + labs(title = "Binned Dev_res against linear predictor when model is correct") ``` ### Binned deviance residuals against fitted value $\hat{\eta}$ assuming the true model is linear ```{r} #| eval: TRUE ### Deviance residuals against fitted value ggplot(data_1) + geom_point(mapping = aes(x = linpred, y = devres)) + labs(x = "Linear predictor", y = "Deviance residuals") + theme_minimal() + labs(title = "Dev_res against linear predictor when model is wrong") ``` ```{r} #| eval: TRUE ### Binned deviance residuals against fitted value data_1 |> group_by(cut(linpred, breaks = unique(quantile(linpred, (1:40)/41)))) |> summarize(devres = mean(devres), linpred = mean(linpred)) |> ggplot() + geom_point(mapping = aes(x = linpred, y = devres)) + labs(x = "Linear predictor", y = "Binned deviance residual") + theme_minimal() + labs(title = "Binned Dev_res against linear predictor when model is wrong") ``` ### Scatter plot of logit (binned $Y$) and $X_1^2$ ```{r} #| eval: TRUE data_1 |> mutate(x_1_sq = x_1^2) |> group_by(cut(x_1_sq, breaks = unique(quantile(x_1_sq, (1:40)/41))) ) |> summarize(y = mean(y), x_1_sq = mean(x_1_sq), x_1_mean = mean(x_1)) |> ggplot() + geom_point(mapping = aes(x = x_1_mean, y = log(y / (1 - y))) ) + labs(x = "X_1", y = "logit(Y)") + theme_minimal() + labs(title = "Scatter plot of logit(binned Y) and X_1") + ylim(-2, 5) ``` ### Scatter plot of logit (binned $Y$) and $\hat \eta$ ```{r} #| eval: TRUE data_1 |> group_by(cut(linpred, breaks = unique(quantile(linpred, (1:40)/41))) ) |> summarize(y = mean(y), linpred = mean(linpred)) |> ggplot() + geom_point(mapping = aes(x = linpred, y = log(y / (1 - y))) ) + labs(x = "Linear predictor", y = "logit(Y)") + theme_minimal() + labs(title = "Scatter plot of logit(binned Y) and linear predictor when model is wrong") + ylim(-3, 5) ``` ```{r} #| eval: TRUE data_2 |> group_by(cut(linpred, breaks = unique(quantile(linpred, (1:40)/41))) ) |> summarize(y = mean(y), linpred = mean(linpred)) |> ggplot() + geom_point(mapping = aes(x = linpred, y = log(y / (1 - y))) ) + labs(x = "Linear predictor", y = "logit(Y)") + theme_minimal() + labs(title = "Scatter plot of logit(binned Y) and linear predictor when model is correct") + ylim(-3, 5) ``` ### Scatter plot of logit (binned $Y$) and $X_1$ ```{r} #| eval: TRUE data_1 |> group_by(cut(x_1, breaks = unique(quantile(x_1, (1:40)/41))) ) |> summarize(y = mean(y), x_1 = mean(x_1)) |> ggplot() + geom_point(mapping = aes(x = x_1, y = log(y / (1 - y))) ) + labs(x = "X_1", y = "logit(Y)") + theme_minimal() + labs(title = "Scatter plot of logit(binned Y) and X_1") + ylim(-2, 5) ``` ### Further exploration Harrell, F. E., Jr. (2016). Regression modeling strategies. Springer International Publishing. [link](https://warin.ca/ressources/books/2015_Book_RegressionModelingStrategies.pdf) Page 235 Several types of residuals can be computed for binary logistic model fits. Many of these residuals are used to examine the influence of individual observations on the fit. The partial residual can be used for directly assessing how each predictor should be transformed. For the ith observation, the partial residual for the $j$th element of $X$ is defined by $$ r_{ij} = \hat\beta_j X_{ij} + \frac{Y_i - \hat{p}_i}{\hat{p}_i (1 - \hat{p}_i)} $$ where $X_{ij}$ is the value of the $j$th variable in the $i$th observation, $Y_i$ is the corresponding value of the response, and $\hat p_i$ is the predicted probability that $Y_i=1$. A smooth plot (using, e.g. loess) of $X_{ij}$ against $r_{ij}$ will provide an estimate of how $X_{j}$ should be transformed, adjusting for other $X$s (using their current transformations). ```{r} #| eval: TRUE ### Partial residual plot for linear predictor X_1 data_1 |> mutate(partial_res = linear_glm$coefficients[[2]] * x_1 + (y - predprob)/(predprob * ( 1 - predprob))) |> ggplot() + geom_point(mapping = aes(x = x_1, y = partial_res)) + geom_smooth(mapping = aes(x = x_1, y = partial_res), method = "loess") + theme_minimal() ``` ```{r} #| eval: TRUE ### Partial residual plot for quadratic term X_1^2 data_2 |> mutate(partial_res = quadratic_glm$coefficients[[2]] * x_1^2 + (y - predprob)/(predprob * ( 1 - predprob))) |> ggplot() + geom_point(mapping = aes(x = x_1^2, y = partial_res)) + geom_smooth(mapping = aes(x = x_1^2, y = partial_res), method = "loess") + theme_minimal() ```