#' ---
#' title: "Linear Models"
#' layout: lab
#' permalink: /lab-scripts/linear-models/
#' filename: linear-models.R
#' active: lab-scripts
#' abstract: "This lab script teaches the implementation and interpretation of
#' the basic linear model by reference to Appell and Loyle's (2012) analysis
#' of the apparent economic benefits of justice. Students should read that
#' article to understand the context of what we're doing. This lab script will
#' just implement one of their basic models and unpack its contents."
#' output:
#'    md_document:
#'      variant: gfm
#'      preserve_yaml: TRUE
#' ---

#+ setup, include=FALSE
knitr::opts_chunk$set(collapse = TRUE, 
                      fig.path = "figs/linear-models/",
                      cache.path = "cache/linear-models/",
                      fig.width = 11,
                      comment = "#>")
#+
#
#   |\      _,,,---,,_
#   /,`.-'`'    -.  ;-;;,_
#  |,4-  ) )-,_..;\ (  `'-'
# '---''(_/--'  `-'\_)   Linear Models
#


#' ## If it's not installed, install it.

library(tidyverse)
library(stevethemes)
library(stevemisc)

#' ## Load the data
#' 
#' This code will work for me and it will assume you've downloaded the `.dta`
#' file in question and put it somewhere on your hard drive. Do you see that
#' `haven::read_dta()` function? Put the location to the 
#' `appelloyle_jprdata.dta` file there.
#' 
#' To make this simpler, we're going to use `mutate()` as a glorified `rename()` 
#' to create more intuitive variable names and select just what we want for this 
#' simple exercise. However, successful execution of this command requires you 
#' correctly specifying the location of the `.dta` file you downloaded from 
#' Athena.

haven::read_dta("http://svmiller.com/extdata/appelloyle_jprdata.dta") %>%
  mutate(fdi = v3Mdiff,
         pcj = truthvictim,
         econ_devel = fv8,
         econ_size = fv10,
         econ_growth = fv11,
         kaopen = fv34,
         xr = xratf,
         lf = labor,
         lifeexp = v64mean) %>%
  select(testnewid_lag, ccode, id, fdi:lifeexp) -> Data

# Alternatively, for convenience
Data <- stevedata::EBJ
# Let's look at the contents.

Data

#' Incidentally, if you'd like, you can watch me go HAM on this exact data here:
#' 
#' - http://svmiller.com/blog/2024/01/linear-model-diagnostics-by-ir-example/
#' 
#' ## Appel and Loyle (2012)
#' 
#' We're going to offer a basic replication of Appel and Loyle (2012). Some of 
#' you read this for your article summary and are aware of its contents. Briefly: 
#' the authors are trying to explain variation in FDI inflows as a function of 
#' whether a post-conflict state engaged in post-conflict justice. You may(?) 
#' remember some of these things from IRI where you may(?) have read about 
#' things like truth commissions and transitional justice. In this application, 
#' the authors are measuring whether a post-conflict state engaged in 
#' post-conflict justice by whether a post-conflict state set up a truth 
#' commission and/or reparation programs. There is a theoretical argument for 
#' why we should expect to see a positive relationship that they outline and you 
#' should know it. The logic isn't too hard to follow, though it assumes some 
#' comfort with thinking  economically, politically, and strategically.
#' 
#' ### A simple bivariate model
#' 
#' Expressed in regression terms, the authors want to know what is the effect 
#' of post-conflict justice (0 or 1) on net FDI inflows over a 10-year 
#' post-conflict period. They want an estimate of that effect and whether we can 
#' discern a  signal from the noise. IF you wanted to do this yourself in R, 
#' it'd be something like this.

M1 <- lm(fdi ~ pcj, data = Data)
summary(M1)

#' When you get this, you're getting a lot of stuff to look at. I'll go by order
#' of practical importance.
#' 
#' The star of the show here is the effect of post-conflict justice. See the 
#' coefficient is 1763.5 (a large nominal number to be sure). Absent any more 
#' information, it suggests that the difference in net FDI inflows over a 10-year
#' period for post-conflict states that engage in PCJ versus those that don't
#' is a difference of $1,763.5. PCJ states have higher FDI inflows, consistent 
#' with the authors' argument. However, there's always the lingering chance that 
#' a world surrounded by noise resulted in a positive coefficient when the actual 
#' "true" effect could be 0. Notice we have a standard error with that estimate, 
#' which is 744.0. The process that follows is not too dissimilar to creating 
#' *z*-scores. Take the estimate and divide it over the standard error, and you 
#' get  that *t*-value you see of 2.370. Because this is a t-distribution (with 
#' 93 degrees of freedom), the process isn't as simple as inference by the normal 
#' distribution. Here, let the fourth column ("Pr(>|t|)") guide you. That is 
#' communicating a probability of observing the (positive) effect we observed 
#' if the true differences between groups with and without post-conflict justice 
#' is 0. Here's what this is telling you: if the God's-honest-truth was that there
#' was no difference in net FDI inflows for these two categories of states, the
#' probability of us observing what we observed, assuming infinity trials, is 
#' .0198. That's so kind of rare that we're inclined to argue that the effect 
#' we observed is closer to what the truth is. It's not 0; it's positive, and 
#' the estimate of $1763.5 may not be "correct" but it's closer to the truth 
#' than 0.
#' 
#' Oh, btw, if you want to calculate that p-value yourself, here it is.

pt(2.37, df=93, lower.tail = F)*2

#' ^ let R's default summary() method do it for you.
#' 
#' Because this is a simple bivariate linear model where the IV is incidentally 
#' binary, there is a nice interpretation of the intercept. The estimated net 
#' FDI inflow for states without PCJ institutions is $425. The estimated net 
#' FDI inflow for states with PCJ institutions is $425 + $1763.5. Notice the 
#' intercept communicates the estimate of *y* when everything on the right-hand 
#' side of the equation is 0. The fact that it has no "stars" doesn't tell you 
#' it's not "significant" or whatever because there is no hypothesis attached to 
#' the intercept. You're not evaluating an argument in the intercept, really. 
#' However, one interesting wrinkle from this that is germane to the author's 
#' argument is that a simple comparison of groups with PCJ institutions and 
#' without PCJ institutions suggests the non-PCJ states may, on average, not be 
#' taking in any FDI at all.
#' 
#' I don't want to say this is necessarily the next "most important" thing, but 
#' it's probably the next place to go for beginners. Let's talk R-square, both 
#' of them actually. R-square is a so-called "goodness of fit" statistic because 
#' it tells you how well ('good') the model you estimated "fits" the data. The 
#' "multiple" R-square is the overall fit whereas the adjusted flavor carries a 
#' small "penalty" of a kind that decreases with every parameter added to the 
#' model. While multiple R-square can never be negative (i.e. you're always 
#' explaining at least a little variation), the adjusted R-square might be 
#' negative for models with lots of noise parameters. In this simple case, 
#' notice two things: 1) The R-square is about .05, suggesting only about 5% of 
#' the variation in net FDI inflows is accounted for by this PCJ indicator. The 
#' adjusted R-square is slightly smaller than the multiple R-square because you  
#' actually have two parameters here and not one. You have the one coefficient, 
#' but the intercept is a parameter too.
#' 
#' You may have learned this in lecture, but the multiple R-square in the simple 
#' bivariate linear model has a nice property. It's literally Pearson's *r*, 
#' squared. Observe.

cor(Data$fdi, Data$pcj)
cor(Data$fdi, Data$pcj)^2
broom::glance(M1)[1] %>% pull()

#' The addition of other parameters breaks this property and the multiple 
#' R-square is instead derived from knowing the residuals and the deviance of 
#' the model, which is a conversation for the MA students and not you.
#' 
#' A few other things before closing up:
#' 
#' 1) R's default summary output gives you a summary of the residuals, which is
#'    kind of wonky. By definition, the mean of the residuals is 0. Ideally, you
#'    want the median around there too. Also ideally: a simple linear model with
#'    a bivariate IV is kind of borrowing trouble, especially if we haven't 
#'    looked carefully at our DV.
#' 2) It's uncomfortable that a lot of our beginner language says "don't forget
#'    the F-test", but all the homies hate the F-test. The F-test tests against
#'    a null hypothesis that the model you estimate performs just as well as one
#'    that is guessing just the mean of the DV. The moment you start to know 
#'    anything about your data-generating process is the moment you can ignore
#'    this statistic. It really is beginner-level stuff from like 100 years ago.
#' 3) The residual standard error is an estimate derived from the model about the
#'    variability of the residuals and how they differ from the line. "Higher" 
#'    values indicate more variability, but also may say something about the data.
#'    The model here are nominal dollars, so take that into consideration.
#'    
#' ### A Replication of Model 1 in Table 1
#' 
#' Let's replicate Model 1 in Table 1. We started with a simple bivariate model,
#' but we need to add "relevant controls" that could be influencing the relationship
#' between PCJ and FDI. After all, is FDI not an economic matter? It tracks that,
#' for example, capital openness will influence FDI inflows. Post-conflict states
#' are probably poor and without a lot of capital openness. The same can be said
#' for the other indicators in Model 1. How do we "control" for this in the 
#' linear model in R? Simple: Add them to the right-hand side of the equation,
#' separated by a + sign. Observe:

M2 <- lm(fdi ~ pcj + econ_devel + econ_size + econ_growth +
           kaopen + xr +  lf+ lifeexp,
         data = Data)

summary(M2)

#' Let's talk through a few things of note here:
#'
#' - The effect of post-conflict justice is basically unchanged. It's still
#'   positive and significant.
#' - Only economic size and the exchange rate variables return significant effects
#'   in their own right.
#' - R-square is .423 but adjusted R-square is .369. This is an informal observation
#'   to make but the discrepancy between the two suggests there are parameters
#'   in the model that are basically noise. In particular, the development
#'   and growth variables seem to be basically just noise.
#' - Ignore the F-statistic. Of course you know that your model is performing
#'   better than just guessing the mean.
#'   
#' ### Let's talk diagnostics
#' 
#' I encourage students to think of the `LINE` mnemonic for evaluating OLS.
#'
#' - `L`: the model is linear in its form (without hidden curvilinearity).
#' - `I`: the observations are independent from each other (which is more of 
#'    something you have to know about your data in advance, which we won't 
#'    belabor here).
#' - `N`: the residuals are normally distributed
#' - `E`: the distribution of residuals is *e*qual across the range of the model.
#'
#' Most of your assessment here can be visual. One of the most useful 
#' diagnostics is the fitted-residual plot. For each observation, take the 
#' estimated (fitted) values of *y* from the model and the residuals 
#' (estimation errors) for all those observations. Make a scatterplot. By 
#' definition, the rise-over-run is flat at 0. The LOESS smoother can tell you 
#' if that's a good fit or not.
#' 
#' We'll keep it simple, so let's focus a bit on what's going to happen here, 
#' using the `augment()` function in `{broom}` (you have this).

broom::augment(M2) %>%
  select(.fitted, .resid) %>%
  bind_cols(Data, .) -> Data

# Alternatively, in case we get stuck in a loop in lab...
broom::augment(M2) -> Data

#' Let's get an idea of what's happening here before going any further.

Data %>%
  select(fdi, .fitted, .resid) %>%
  mutate(y_is_fitted_plus_resid = .fitted + .resid,
         resid_is_y_minus_fitted = fdi - .fitted)

#' Now, let's get to the heart of the matter.

Data %>%
  ggplot(.,aes(.fitted, .resid)) +
  geom_point(pch = 21) +
  theme_steve(style='generic') +
  geom_hline(yintercept = 0, linetype="dashed", color="red") +
  geom_smooth(method = "loess")

#' Wooooof, that should not look like that. What you want to see is basically
#' featureless, patternless buckshot. Something like this: 
tibble(x = rnorm(95),
       y = 5 + x + rnorm(95)) -> A

M4 <- lm(y ~ x, A)

broom::augment(M4) %>%
  ggplot(.,aes(.fitted, .resid)) +
  geom_point(pch = 21) +
  theme_steve(style='generic') +
  geom_hline(yintercept = 0, linetype="dashed", color="red") +
  geom_smooth(method = "loess")

#' 
#' What you're looking at is a cry for help. Briefly, I'm teaching you this
#' way of doing it because it makes for a better presentation. If you wanted to
#' just see it for yourself and don't care about presentation, the `plot()`
#' function in base R is nice.

plot(M2, which = 1) # the one we're interested in
plot(M4, which = 1) # ideal case

#' One limitation of the fitted-residual plot, however, is that it won’t tell 
#' you where exactly the issue might be. That’s why I wrote the `linloess_plot()` 
#' in `{stevemisc}`. This plot takes a model object and, for each right-hand side 
#' variable, draws a rise-over-run line of best fit and the LOESS smoother with
#' respect to either the residuals (default) or the raw values of the dependent
#' variable. Do note this tells you nothing about binary IVs, but binary IVs 
#' aren’t the  problem here.

linloess_plot(M2, pch=21) +
  theme_steve(style='generic')

#' ^ A couple of things stand out here to me: 1) the model is so offended
#' by the presence of the exchange rate variable that it's making anything else
#' impossible to see. Two: that econ_growth variable has some weirdness happening
#' around 0 that implies some kind of non-linear effect.
#'
#' What does the histogram say?

Data %>%
  select(econ_growth, xr) %>%
  gather(var, val) %>%
  ggplot(.,aes(val)) + geom_histogram(bins=40) +
  facet_wrap(~var, nrow = 2, scales='free') +
  theme_steve(style='generic')

#' Iiiiiinteresting. You're free to play around with this, but use your head a bit.
#' It implies there are a few categories of "growth". Some are negative, which is
#' its own problem. Some are are within "normal" (i.e. about 2%, let's say). More
#' are way beyond that and on islands to themselves. As for the exchange rate
#' variable? Let's just say such a distribution is revolting and leave it at that.
#' It implies you have to think long and hard about what those observations are
#' doing in your model.
#' 
#' ### A comment on "normality"
#' 
#' OLS assumes the errors are normally distributed. This is often conflated with 
#' an assumption that the outcome variable is normally distributed. That’s not 
#' quite what it is. It does imply that the conditional distribution of the 
#' dependent variable is normal but that is not equivalent to assuming the 
#' marginal distribution of the dependent variable is normal. At the end of the 
#' day, the assumption of normality is more about the errors than the dependent 
#' variable even as the assumption about the former does strongly imply an 
#' assumption about the latter.
#'
#' Violating the assumption of a normal distribution of the errors is not as 
#' severe as a violation of some of the other assumptions. The normality 
#' assumption is not necessary for point estimates to be unbiased. In one 
#' prominent textbook on statistical methods, Gelman and Hill (2007, p. 46) say 
#' the normality assumption is not important at all because it has no strong 
#' implication for the regression line. I think this follows because Gelman and 
#' Hill (2007)—later Gelman, Hill, and Vehtari (2020)—are nowhere near as 
#' interested in null hypothesis testing as your typical social scientist 
#' likely is. No matter, violating the assumption of a normal distribution of 
#' errors has some implication for drawing a line that reasonably approximates 
#' individual data points (if not the line itself, per se). Thus, you may want 
#' to check it, certainly if you have a small data set.
#' 
#' My misgiving with these normality tests is that they all suck, even at what 
#' they’re supposed to do. The “textbook” normality tests involve extracting 
#' the residuals from the model and checking if their distribution is consistent 
#' with data that could be generated by a normal distribution. The two 
#' implementations here are typically base R. One is the Shapiro-Wilk test. The 
#' other is the Kolmogorv-Smirnov test. There are more—like the Anderson-Darling 
#' test—which you can also do and is communicating the same thing. You can 
#' explore a few of these in the `{nortest}` package.

shapiro.test(resid(M2))
ks.test(resid(M2), y=pnorm)

#' When p-values are sufficiently small, these tests are saying “I can determine 
#' these weren’t generated by some kind of normal distribution.” My misgiving 
#' with these particular tests are multiple. One, you can dupe them pretty 
#' easily with a related distribution that looks like it, but is not it 
#' (e.g. Student’s t). Two, they are deceptively just a test of sample size. The 
#' more observations you have, the more sensitive the test is to any observation 
#' in the distribution that looks anomalous. Three, textbooks typically say to 
#' use the K-S test if you have a large enough sample size, but, from my 
#' experience, that’s the test that’s most easily duped.
#'
#' Here’s what I recommend instead, knowing that the normality assumption of 
#' errors is one of the least important assumptions: visualize this instead. 
#' For one, the “textbook” visual diagnostic is the Q-Q plot. This is actually 
#' a default plot in base R for linear models if you know where to look.

plot(M2, which=2)

#' The Q-Q plots the theoretical quantiles of the residuals against the 
#' standardized residuals. Ideally, they all fall on a nice line. Here, they 
#' don’t, suggesting a problem. The negative residuals are way more negative than
#' they should be and the positive residuals are way more positive than they should
#' be.
#'
#' I think a better way of doing this is another visual exercise. Take your 
#' residuals and create a density plot of them. Over that, create a normal 
#' distribution with a mean of 0 and a standard deviation that matches that
#' description. In cases where the normality tests still flunk you, the normality
#' assumption may still be reasonably approximated. Let's see what they look like
#' here.

Data %>%
  ggplot(.,aes(.resid)) +
  stat_function(fun = dnorm, color="blue",
                args = list(mean = 0, 
                            sd = sd(Data$.resid, na.rm=T)),
                linetype="dashed", linewidth=1.1) +
  geom_density(linewidth = 1.1) +
  theme_steve(style='generic')

rd_plot(M2)
#' ^ wooof. Kill this with fire.
#'
#' Your solution to this particular “problem” will depend on what exactly you’re 
#' doing in the first place. The instances in which these plots look really 
#' problematic will be situations like these. Your model may have relatively 
#' few covariates and the covariates you do include are dummy variables. If you 
#' have so few observations in the model (i.e. I can count the number of 
#' observations in the model on one or three hands), any distribution of 
#' residuals will look crazy. That's somewhat of the case here. We have a few
#' observations, but enough (on paper) granularity in the variables that this
#' should not be happening in an ideal world. It’s more likely the case that you 
#' have a dependent variable that you are pretending has a nice and pretty
#' distribution. That's almost assuredly the case here.

Data %>%
  ggplot(.,aes(fdi)) + geom_density() +
  theme_steve(style='generic')

#' ^ it's one thing for there to be this kind of tail. It's another for it to be
#' driven by so few observations. It's possible you may want to condense the scale
#' here, even if you have to be mindful about the negative values of a log
#'  transformation.