--- title: "Regression Beyond the Mean Worksheet, Part 1 of 2 (Lec 3)" output: html_notebook editor_options: chunk_output_type: inline --- ```{r} suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(broom)) suppressPackageStartupMessages(library(quantreg)) # For quantile regression ``` [(Link to Part 2 of the Worksheet)](https://ubc-mds.github.io/DSCI_562/lec4/worksheet.nb.html) This worksheet explores regression outside of estimating the conditional mean. We'll work with two data sets: Flu data: ```{r} flu <- read_csv("../data/flu-train.csv") %>% select(positive = PERCENT_POSITIVES, week = WEEK, year = YEAR) head(flu) ``` The horeshoe crab data from last time: ```{r} crab <- read_table("https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt", col_names = FALSE) %>% select(-1) %>% setNames(c("colour","spine","width","weight","n_male")) %>% mutate(colour = factor(colour), spine = factor(spine)) head(crab) ``` ## Variance Regression We'll first investigate how to estimate the conditional variance. We'll first need the mean; let's use `loess()` to get at it: ```{r} mean_lo <- loess(positive ~ week, data = flu, span = 0.2) %>% predict() flu <- flu %>% mutate(mean_lo = mean_lo) p0 <- ggplot(flu, aes(week)) + geom_line(aes(y = positive, group = year), alpha=0.5) + theme_bw() + labs(x="Week", y="Percent Positives") (p <- p0 + geom_line(aes(y = mean_lo), colour = "blue", size = 1)) ``` Calculate the conditional variance for each row of the data, also using `loess()`. ```{r} var_lo <- loess((positive - mean_lo)^2 ~ week, data = flu, span = 0.2) %>% predict() flu <- flu %>% mutate(var_lo = var_lo, sd_lo = sqrt(var_lo)) flu_unique <- flu %>% group_by(week) %>% summarize(mean_lo = unique(mean_lo), var_lo = unique(var_lo), sd_lo = unique(sd_lo)) ``` Here's a plot of the model function: ```{r} ggplot(flu, aes(week, (mean_lo - positive)^2)) + geom_point(alpha=0.2) + theme_bw() + ylab("Squared Residuals") + geom_line(aes(y = var_lo), colour = "blue", size = 1) ``` What would a 95% prediction interval be, if we were to assume the conditional distributions are Gaussian? It's a bad assumption, but let's try it anyway. Plot the interval as a ribbon. ```{r} p + geom_ribbon( data = flu_unique, mapping = aes( ymin = mean_lo - 1.96 * sqrt(var_lo), ymax = mean_lo + 1.96 * sqrt(var_lo) ), alpha = 0.1, fill = "blue", colour = "blue" ) ``` **NEW** Let's assume a Beta distribution instead. Its parameters are alpha and beta, which we can compute from the mean and variance: ```{r} flu_unique <- flu_unique %>% mutate( alpha = ((1-mean_lo)/var_lo - 1/mean_lo) * mean_lo^2, beta = alpha * (1/mean_lo - 1) ) p + geom_ribbon( data = flu_unique, mapping = aes( ymin = qbeta(0.025, alpha, beta), ymax = qbeta(0.975, alpha, beta) ), alpha = 0.1, fill = "blue", colour = "blue" ) ``` Questions: 1. If you assume homoskedasticity, how would you estimate variance then? > Just take the mean of all squared residuals (=null model). 2. If you assume the variance increases linearly, how would you estimate variance then? - Why might this not be a good idea for this dataset, besides the trend not looking linear? > Linear regression. Not a good idea, because the data are actually cyclic (the predictor space loops around). 3. If we fit an Exponential regression model, would it make sense to estimate the variance in this way? > No, because since we already have the mean estimated, we have the full distribution estimated, from which we can derive any quantity, including the variance. ## Quantile Regression ### No Assumption on the Model Function Use the above estimates of mean and variance, together with a Gaussian assumption, to plot the 0.75-quantile regression model function. ```{r} flu <- flu %>% mutate(q75_gauss = qnorm(0.75, mean = mean_lo, sd = sd_lo)) p0 + geom_line( data = flu, colour = "blue", size = 1, mapping = aes(y = q75_gauss) ) ``` Let's try doing the same thing, without making any assumptions. But first, let's gain familiarity with the `quantile()` function: estimate the 0.75-quantile model function for the null model. ```{r} quantile(flu$positive, probs = 0.75) ``` Now for regression across "week". We'll just use a "by-hand" local method: a moving window with a radius <1 week. ```{r} flu <- flu %>% group_by(week) %>% mutate(q75_local = quantile(positive, probs = 0.75)) p0 + geom_line( data = flu, colour = "blue", size = 1, mapping = aes(y = q75_local) ) ``` ### Assumption on the Model Function; distributional assumption Let's use the horseshoe crab data from last time. Load it in: Recall last time we fit a Poisson regression model with a log link: ```{r} crab <- glm(n_male ~ width, data = crab, family = poisson) %>% augment(type.predict = "response") %>% select(n_male, width, mean = .fitted) p_crab <- ggplot(crab, aes(width, n_male)) + geom_point(alpha = 0.25) + theme_bw() + labs(x = "Carapace Width", y = "# Nearby Males") p_crab + geom_line(aes(y = mean), colour = "blue", size = 1) ``` Use this model (under the Poisson assumption) to plot the 0.75-quantile model function. ```{r} crab <- crab %>% mutate(q75_glm = qpois(0.75, lambda = mean)) p_crab + geom_line(data = crab, mapping = aes(y = q75_glm), colour = "blue", size = 1) ``` ### Assumption on the Model Function; No Distributional Assumption For this, we will need to turn quantile regression into an optimization problem. We'll discuss this in Lecture 4.