#' ---
#' title: "The Bivariate Linear Model"
#' layout: lab
#' permalink: /lab-scripts/lab-3/
#' filename: bivariate-ols.R
#' active: lab-scripts
#' abstract: "This lab scripts introduces students to the basic linear model in
#' its bivariate form. Rather than use real-world data, we'll be using fake data
#' we can create. Let this underscore that much of what we know about the linear
#' model and what it tells us comes from simulation. Here, we can play God. We
#' are not afforded that luxury in the real world, but the tools we use come 
#' from situations where we can cosplay as the Almighty with respect to a 
#' population we created."
#' output:
#'    md_document:
#'      variant: gfm
#'      preserve_yaml: TRUE
#' ---

#+ setup, include=FALSE
knitr::opts_chunk$set(collapse = TRUE, 
                      fig.path = "figs/lab-3/",
                      cache.path = "cache/lab-3/",
                      fig.width = 11,
                      comment = "#>")
#+


#' ## R Packages/Data for This Session
#' 
#' You should've already installed the R packages for this lab session. 
#' `{tidyverse}` will be for all things workflow and that's all we'll be using 
#' today.

library(tidyverse)

#' For those of you wanting the packages we'll be using today without loading 
#' all of the {tidyverse}, run this. I won't, but here are the component
#' packages.

# library(dplyr)    # the workhorse
# library(magrittr) # for the pipe
# library(tibble)   # for tibbles
# library(ggplot2)  # for plots

#' We'll run one other command, because you may know how I feel by now about
#' `{ggplot2}`'s default theme. I have `{stevethemes}` and I'd really like to 
#' use it here, but we want to stay strictly within `{tidyverse}` today. So, 
#' we'll do this.

theme_set(theme_minimal())

#' Cool beans. Let's get going.
#' 
#' ## Creating Fake Data to Learn About the Linear Model
#' 
#' The lecture uses actual (toy) data from the World Bank to introduce you to the
#' linear model in the bivariate case. Now, I think it might be time for you to 
#' get it in the completely "fabricated" sense. In this lab script, you are going 
#' to create data that you will use to learn more about what the linear model 
#' with its OLS estimator is telling you (by way of `lm()` output).
#' 
#' Take careful inventory of what's going to happen here. We're going to set a 
#' reproducible seed, and then we're going to create two new "random" variables, 
#' called `x` and `e` from a normal distribution.
#' 
set.seed(8675309) # Jenny, I got your number...
tibble(x = rnorm(100),
       e = rnorm(100)) -> Fake

Fake

#' `x` is going to be our independent variable, which is exogenously created 
#' prior to an observed outcome. `e` is going to be our error term, which 
#' includes everything not included in the linear model. The presence of an error 
#' term is an important assumption in OLS, and notice here that the error term is 
#' created *independently from the x variable*. Yes, they're in the same data set, 
#' and yes they were created by the same command. However, the generation of *x* 
#' and *e* are simultaneously created, but completely independent from each other. 
#' If they were correlated with each other, that would violate a major assumption 
#' about endogeneity in the linear model (that we'll belabor in another lab 
#' script).
#' 
#' You can test that we successfully pulled this off with a correlation test.
#' 
cor.test(Fake$x, Fake$e)

#' Notice that the correlation between *x* and *e* is about .06, but the 95% 
#' interval includes 0. The modest correlation we do observation is indeed due to 
#' random chance.
#' 
#' Now, watch this. We're going to create a new variable, *y*, that is a linear 
#' function of *x* and *e*, plus some baseline value (intercept) when *x* is 0.
#' 
Fake %>%
  mutate(y = 5 + x + e) -> Fake

Fake

#' Here's how I want you to think about what you're going to do here. By 
#' simulating data to evaluate a model, you're playing god to see how well the 
#' model can capture the "true" parameters you're generating. The "true" value 
#' of *y* is 5 when *x* is 0. The "true" effect of *x* on *y* is to increase *y* 
#' by 1 for every one-unit increase in *x*. The error term here has equal footing 
#' to *x*, but recall that the error term (*e*) includes everything else that 
#' influences *y* but is not included in the model. It's ultimately going to 
#' introduce noise into how well we can capture the other true parameters.
#' 
#' The data are fake, sure, but the process of simulating so-called "random 
#' variables" (drawn from a normal distribution with pre-defined parameters) is 
#' useful for illustrating the properties of a statistical model like this. It'll 
#' also help to show what these data "look like." Here, for example, is the 
#' distribution of *x*. When you have granularity in a variable like this 
#' (with so many trailing decimals), it's useful to jettison a bar chart or 
#' histogram (which has somewhat arbitrary bin width) in favor of the density 
#' plot. The density plot may not have the same easy takeaway as to what's 
#' happening on the *y*-axis, but it's arguably more useful for getting the 
#' basic shape of a distribution.
#' 
#' There are two ways of doing this. Here's the base R way.
#' 
plot(density(Fake$x))
#'
#' Here, we see the data look nice and "normal" with a clear central tendency 
#' and a shape approximating a true bell curve.
#' 
#' Here's the prettier ggplot way. ggplot stuff (by way of `{ggplot2}`) comes 
#' default in `{tidyverse}`. Once you load `{tidyverse}`, you load all the 
#' `{ggplot2}` functionality you could want.
#' 
Fake %>%
  ggplot(.,aes(x)) +
  geom_density() 
#'
#' `ggplot()`'s defaults want to draw attention to the fact that there's more a 
#' left  skew than a right skew. In other words, the maximum of *x* is ~2.02 
#' whereas  the minimum is about -2.59, but the shape of the data still look 
#' normal. Any  irregularities in shape could be attributed to the fact we only 
#' drew 100 observations, which is fairly small set of observations (all things 
#' considered).
#' 
#' We can do the same thing for the outcome variable *y*.
#' 
plot(density(Fake$y))
#'
#' Also, in `{ggplot2}` format...
#' 
Fake %>%
  ggplot(.,aes(y)) +
  geom_density()
#' 
#' It's worth reiterating the *y* is "endogenously" created, and the only 
#' endogenous part of our data set. It is an outcome of two exogenous variables 
#' (*x*, which we can measure, and *e*, which serves as a plug-in for background 
#' noise in the underlying population).
#' 
#' Can OLS capture these known attributes about the data? We'll use the `lm()` 
#' function in R to find out. `lm()` is the base R function for estimating 
#' linear models. It's syntax is simple, especially in a simple case like this. 
#' The two basic things you'll want to feed it are the formula, separated by a 
#' comma, then the name of the data set for which you want to regress your 
#' dependent variable on your independent variable(s). Here, we're doing a super 
#' simple regression of *y* on *x*. Then, we'll get a summary
#' of the model with `summary()`.
#' 
M1 <- lm(y ~ x, data=Fake)
summary(M1)

#' Our question was whether OLS can capture the known population parameters that 
#' we created. The answer is yes. Yes, it can. In a simple bivariate case, you 
#' can see what this would look like visually with a scatterplot. I want you to 
#' see what this is doing. First, we're going to use our data and, with the pipe 
#' operator (`%>%`), we're going to pass the data to a `ggplot()` function with 
#' a simple aesthetic (`aes()`) argument creating an x-axis for `x` and `y` axis 
#' for `y`. Then, using `+` (which is the `{ggplot2}` continuation operator),
#' we're going to create a scatterplot with `geom_point()`. Then, using another 
#' continuation operator (`+`), we're going to use `geom_smooth()` to create a 
#' line of best fit for the data. Importantly, you have to  specify you want a 
#' linear smoother (`method = "lm"`) to communicate what a simple bivariate case 
#' is doing.  The default is `method = "loess"`, which has its own nice features 
#' but it's not what we want here. Finally, we're going to create two lines over 
#' the plot. The first uses `geom_vline(xintercept = mean(Fake$x))` to add a 
#' vertical line through the mean of `x`. The second uses 
#' `geom_vline(yintercept = mean(Fake$y))` to create a horizontal line through 
#' the mean of `y`. Notice that the line of best fit runs through the 
#' intersection of both them.

Fake %>%
  ggplot(.,aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_vline(xintercept = mean(Fake$x)) +
  geom_hline(yintercept = mean(Fake$y))

#' ## How to Evaluate Your Linear Model Output
#' 
#' This graph is great for visualizing a super simple bivariate case, but things 
#' are never so simple. So, let's  walk through a little bit of the important 
#' output from the model on its own terms. Let's start with the "coefficients",
#' since they are the main reason you're doing this. We mainly wanted to know if 
#' we could capture the "true" effect of *x* on *y*. The "true" effect is 1. The 
#' OLS estimator says it's about 1.06. That's technically wrong, sure, but the 
#' coefficient is what it is because it's also capturing the correlation between 
#' *x* and the error term *e*, which we *don't* get to include in the model. So, 
#' OLS says it's 1.06, which comes with a standard error of about .10. A standard 
#' error below the estimate is about .961. A standard error above the estimate is 
#' about 1.175.
#' 
#' Put another way, given estimation and fundamental uncertainty about the model, 
#' our estimated effect of *x* on *y* is within a standard error of what we know 
#' it actually is. We captured the true population parameter of interest to us.
#' 
#' The other thing you want to know from your regression model is whether the 
#' effect of your independent variable of interest on the dependent variable is 
#' "statistically significant." This term is very misleading, to be frank. About 
#' 10 years ago, it appeared on this list of [scientific ideas that 
#' scientists wish you'd stop misusing](https://gizmodo.com/10-scientific-ideas-that-scientists-wish-you-would-stop-1591309822).
#' Briefly:
#' 
#' > "Statistically significant" is one of those phrases scientists would love 
#' to have a chance to take back and rename. "Significant" suggests importance; 
#' but the test of  statistical significance, developed by the British 
#' statistician R.A. Fisher, doesn't measure the importance or size of an 
#' effect; only whether we are able to distinguish it, using our keenest 
#' statistical tools, from zero. "Statistically noticeable" or "Statistically 
#' discernible" would be much better.
#'   
#' Notice what this means for what you're doing. You're not testing, necessarily, 
#' whether you've identified the "true" effect when you evaluate statistical 
#' significance. You're testing whether you could rule out something else: 0. 
#' That's at least what gets returned to you here, and the *t*-statistic that 
#' gets reported is conveniently just the coefficient over the standard error. 
#' When the absolute value of that *t*-statistic is large, and its associated 
#' *p*-value falls below some threshold of interest to you, you can reject the 
#' null hypothesis and assert that what you observed is highly improbable if the 
#' true effect is 0 with precision. That's indeed the case here. The true effect 
#' is 1. We observed 1.06. The *t*-statistic associated with that coefficient 
#' and standard error is about 9.9. That's comically large, given the 
#' distribution of Student's *t* for 98 degrees of freedom. The probability of 
#' observing what we got is basically 0 if the true effect were 0 (and it's not, 
#' because it's 1). So, you can reject the null hypothesis and assert that 
#' what you got is closer to what it truly is, assuming repeated sampling and 
#' fixed population parameters. The first part of that may not apply here,
#' but the second one does. We created the data.
#' 
#' ### What About the Intercept?
#' 
#' You don't evaluate "statistical significance" in the same way for the 
#' *y*-intercept. There is no "null hypothesis" to note in the context of the 
#' intercept. It's just an estimate of *y* when *x* is 0. Here, we are adequately
#' capturing it's true value (i.e. it's actually 5, but we got 4.94). Dividing 
#' that estimated intercept over its standard error returns a ridiculously 
#' large *t*-statistic that means you can comfortably rule out 0. Then again, 
#' you weren't testing that it's 0. It's just some constant baseline for *y*.
#' 
#' Here's another way of illustrating this. Let's create another variable, *y2*. 
#' It has this formula.
#' 
# I could alternatively write this as just `y2 = x + e`.
Fake %>%
  mutate(y2 = 0 + x  + e) -> Fake

Fake

#' Let's run another regression.
#' 
M2 <- lm(y2 ~ x, Fake)
summary(M2)

#' Notice *x* is unaffected, but now the *y*-intercept is "statistically 
#' insignificant?" Except that it's not, because that's not how you should 
#' evaluate the *y*-intercept. The statistics you get back just tell you the 
#' estimated value of *y* when *x* is 0 could be 0 (because that's what it is).
#' 
#' This is often why you'll read regression tables that just outright omit the 
#' *y*-intercept and take no effort to model them. I'll only retort that you 
#' may encounter or want to learn more complicated models, so called "mixed 
#' (effects) models", that include random intercepts for different grouping 
#' clusters in a data set. For those models to converge, they'll often need a 
#' so-called "global" intercept in order to know against what baseline a group 
#' may vary. This is why I encourage students to get comfortable with scaling 
#' inputs into the model in order to create more meaningful *y*-intercepts 
#' that, with real world data, might tell the reader (and yourself) something 
#' about the typical value of *y*.
#' 
#' Indeed, that's what the intercept is communicating here, plus-or-minus error.
#' 
Fake %>%
  summarize(mean_y = mean(y),
            mean_y2 = mean(y2)) %>% data.frame

#' ## Other Things You Should Know About Your OLS Model
#' 
#' I'll go into the other things you should know about the OLS model, based 
#' loosely on the order of importance of them.
#' 
#' ### Fitted Values and Residuals
#' 
#' The OLS model draws a line of best fit through the data and that line of best
#' fit is the one that minimizes the sum of squared prediction errors. This 
#' means there are estimates for a particular data point, given this line of 
#' "best fit", and invariably an "error" (which is the distance between the 
#' estimated value of *y*, and what the "true" value of *y* is).
#' 
#' In R speak, these are the fitted values and the residuals. The fitted values are the estimated values of *y* for each
#' data point (given the regressors). You may also call these the *y*-hats. The residuals are the distances between what
#' was fitted and what the observation really is. These are the estimates of *e* from the data-generating process.
#' 
#' You can extract them both pretty easily in R. Observe, for `M1`.

Fake %>%
  mutate(fitted = fitted(M1),
         resid = resid(M1)) -> Fake

Fake

#' Here's a way to wrap your head around what this is telling you. For that first row, the value of *x*
#' is about -.997. The value of *y* is 4.75. Given the line that minimizes the sum of squared prediction
#' errors, the coefficient for a one-unit change of *x* on *y* is about 1.06. The *y*-intercept we
#' estimate to be about 4.94. `4.94 + 1.06*(-.997) = about 3.88`. That's the fitted value of *y*. In actuality,
#' it's about 4.75. 4.75 - 3.88 (alternatively: `y - y-hat` for that observation) is about .867. That's the
#' error, or the residual you see there. Do note there's rounding error here for presentation, but hopefully
#' the principle is clear.
#' 
#' Of note, I don't really know why R's default `summary()` option includes that kind of summary of the residuals
#' *first*, but that's the top thing you see. Observe:
#' 
tibble(x = resid(M1)) %>%
  summarize(min = min(x),
            q1 = quantile(x, .25),
            median = median(x),
            q3 = quantile(x, .75),
            max = max(x)) %>% data.frame

summary(M1)
#' 
#' Keep your fitted values and residuals handy, as they'll be very useful (really: necessary) information for
#' model diagnostics. Another good way of extracting them with the `augment()` function in the `{broom}`
#' package that is installed when you install `{tidyverse}`.
#' 
broom::augment(M1)

#' You obviously get other output with this function, which will come in handy 
#' when we delve into model diagnostics in earnest.
#' 
#' ### R-Squared
#' 
#' OLS textbooks sometimes encourage new students to read too much into R-squared. Minimally, it tells you
#' how well the model fits the data. It's bound between 0 and 1, and higher values = better fit. However,
#' ask yourself what primarily is the purpose of the regression model. Is it to identify a causal effect?
#' If so, those minimal OLS models regressing some outcome *y* on a treatment will typically have very
#' low R-squareds even if the causal effect is identified. Is the goal to fit the data? Then higher 
#' R-squared is obviously very nice but saturating the model with regressors for the sake of juicing
#' an R-squared may come with biasing coefficients (making it more difficult to identify signals from
#' the din).
#' 
#' I do want to draw your attention to a few things. For one, in the simple bivariate case, R-squared is
#' quite literally Pearson's *r*, squared. Observe.
#' 
Fake %>% summarize(cor = cor(x, y), 
                   cor2 = cor^2) -> correlations

correlations
correlations %>% pull(2)
summary(M1)$r.squared


#' Second, you'll see something called an adjusted R-squared. Simply, think of 
#' this as a kind of penalty that downweights (sic?) the normal ("multiple") 
#' R-squared for the presence of additional parameters. It's why it'll always 
#' want to be smaller than the other R-squared metric you see. Here's the hard 
#' way of calculating what comes default in R.
#' 
(1 - ((1-summary(M1)$r.squared)*(nobs(M1)-1)/(nobs(M1)-length(M1$coefficients))))

#' In plain English: adjusted R-squared  = `1 - (1 - R^2)*(n - 1)/n - k`. In 
#' this case, the difference is small, but the adjusted R-squared is telling 
#' you have more than one parameter here (i.e. the intercept).
#' 
#' Long story short: a high R-squared is nice, but it's not necessary for 
#' identifying a causal effect of *x* on *y*. There's no agreement on what a 
#' "good R-squared" looks like. 
#' 
#' ###  Residual Standard Error
#' 
#' The residual standard error is the standard error of the residuals. It's not too hard to calculate in a simple
#' case like this. Take the sum of squared residuals and square those. Divide it over the degrees of freedom in the
#' model. There's your residual standard error. You'll see the sum of squared residuals often
#' referred to as the deviance of the model.
#' 

sqrt(deviance(M1)/df.residual(M1)) 
sqrt(sum(resid(M1)^2)/98) 

#' There's more you can do with that deviance. For example, does this look familiar?
1 - (sum(resid(M1)^2)/(sum((Fake$y - mean(Fake$y))^2)))

#' It's R-squared, right? You can also calculate R-squared as the deviance over the total sum of squares. You
#' might see that formula as `1 - RSS/TSS` where the deviance is also known as the residual sum of squares. You
#' can also calculate R^2, more generally, as the sum of fitted squares (so-called "Explained Squares") over the
#' total sum of squares.
#' 

(sum((M1$fitted.values - mean(Fake$y))^2))/(sum((Fake$y - mean(Fake$y))^2))


#' ### F-statistic
#' 
#' Simply, you'll never use this. The F-statistic tests against a null hypothesis that the model you estimate
#' performs just as well as the model that guesses just the mean of *y*. The moment you know anything about
#' your data-generating process is the moment you can start to safely ignore this statistic. In our case,
#' the F-statistic of 98.69 against 98 degrees of freedom comes with a *p*-value that is basically 0. This model
#' is better than an intercept-only model.
#' 
#' Want to fail this statistic? Do this.
 
Fake %>%
  mutate(noise = rnorm(100)) -> Fake

M3 <- lm(y ~ noise, data=Fake)
summary(M3)

#' See that *p*-value with that *F* statistic? I can't say that this model 
#' performs any better than an intercept only model. Well, duh.
#'
#' ## Conclusion
#'
#' For now, take stock of the following things:
#' 
#' - *formula*: `lm()` is the function you want. You regress whatever your DV is on (`~`) whatever your independent variable
#' (or variables) is called. If you have multiple IVs, it'll be something like `y ~ x1 + x2`. Tell the formula your data come
#' from whatever your data object is called.
#' - *coefficient*: the coefficient is the rise over run from the slope intercept equation, in as many words. A one-unit change in *x* coincides
#' with an estimated change of [INSERT COEFFICIENT VALUE HERE] in *y*.
#' - *statistical significance*: you're assessing the probability of what you observe (with the standard error) if the true effect
#' is zero. Assess statistical significance by whatever evidentiary threshold you choose, but the convention is .05. Don't bother interpreting
#' "significance" for the intercept.
#' - *other model parameters*: R-square is nice when it's high but it's not necessary for identifying causal effects. Grabbing the fitted
#' values and residuals from the model will be useful for important diagnostics.