#' --- #' title: "Linear Model Diagnostics (and What to Do if You Flunk One)" #' layout: lab #' permalink: /lab-scripts/lab-5/ #' filename: ols-diagnostics.R #' active: lab-scripts #' abstract: "The linear model works great until it doesn't. Alternatively: I #' accept the model's output only the extent to which I accept its assumptions #' and whether the model estimated satisfies those assumptions. There are myriad #' assumptions in the linear model and we can't possibly go over them all in #' the confines of a single lab session in such a hurried course on quantitative #' methods. We'll focus on a few key ones here." #' output: #' md_document: #' variant: gfm #' preserve_yaml: TRUE #' always_allow_html: true #' --- #+ setup, include=FALSE knitr::opts_chunk$set(collapse = TRUE, fig.path = "figs/lab-5/", cache.path = "cache/lab-5/", fig.width = 11, tab.cap = "NULL", warning = FALSE, fig.cap = NULL, comment = "#>") #+ #' ## Elsewhere in My R Cinematic Universe #' #' I will be doing *a lot* of self-plagiarism in this particular script. I #' taught earlier versions of to MA students [in this department](http://eh6105.svmiller.com/lab-scripts/ols-diagnostics.html), #' and [to PhD students at my previous employer](http://post8000.svmiller.com/lab-scripts/ols-diagnostics-lab.html). #' This version is a bit more streamlined, but the verbosity of previous versions #' of it might be useful to look over for you. #' #' I've also written elsewhere about [how cool bootstrapping is](http://svmiller.com/blog/2020/03/bootstrap-standard-errors-in-r/), #' and about assorted #' [heteroskedasticity robust standard errors](http://svmiller.com/blog/2024/01/linear-model-diagnostics-by-ir-example/). #' The latter of the two has an IR application you might find interesting. The #' former is illustrative of what exactly the procedure is doing and how you #' could do it yourself. Finally, IRIII.2 students in our department also #' get the same ["check this shit out" approach to `{modelsummary}`](http://ir3-2.svmiller.com/lab-scripts/lab-5/). #' You all will be saying hi to my mom in Ohio a lot. #' #' Finally, I want to make sure you all understand what happens in the linear #' model when there are log-transformed variables on one or both sides of the #' regression equation. A shorthand way of thinking about this is that [log #' transformations proportionalize (sic) changes](http://svmiller.com/blog/2023/01/what-log-variables-do-for-your-ols-model/) #' on the untransformed scale. Please read that blog post carefully. #' #' If I link it here, it's because I'm asking you to look at it. Look at it as #' it might offer greater insight into what I'm doing and what I'm trying to tell #' you. #' #' ## R Packages/Data for This Session #' #' You should've already installed the R packages for this lab session. #' `{tidyverse}` will be for all things workflow. `{stevedata}` will have data #' sets and `{stevemisc}` has some assorted helper functions. `{lmtest}` has #' some formal diagnostic tests and convenience functions for summarizing #' standard error adjustments done by the `{sandwich}` package. `{modelsummary}` #' is awesome and you'll kind of see what it's doing along the way. #' `{stevethemes}` is optional, but I want to use it. I'll offer a nod #' to the `{car}` package that is optional, but has a function that does #' what `linloess_plot()` does (but better). library(tidyverse) library(stevedata) library(stevemisc) library(stevethemes) # optional, but I want it... library(lmtest) library(sandwich) library(modelsummary) options("modelsummary_factory_default" = "kableExtra") # ^ this is new. Default is now "tinytable", but much of my code is written # around `{kableExtra}`. As you develop your skills, you may want to make the # pivot to `{modelsummary}`'s default. theme_set(theme_steve()) # optional, but I want it... #' I want to add here that the output format of this script, if you're reading #' it on my website, is to Markdown. The tables themselves are HTML, which #' Markdown can handle quite nicely through Github. However, there is a lever I #' need to toggle to disable the table captions from also becoming standalone #' paragraphs in Markdown format. I don't know what that lever is based on my #' cornball workflow spinning this `.R` file to a `.md` file. This is *not* a #' problem you will have formatting for LaTeX or HTML, but it will make for #' a somewhat clumsy viewing experience on the course website. Again, you'll be #' saying hi to my Mom a lot. She lives in Ohio. #' #' Okay then, let's get on with the show. #' #' ## The Data We'll Be Using #' #' I'll be using the `states_war` data set in `{stevedata}`. You can find more #' information about this data set by typing this into your RStudio console. ?states_war #' The data offer an opportunity to explore the correlates of performance in war #' in a way inspired by [Valentino et al. (2010)](https://doi.org/10.1017/S0022381609990831). #' The authors primarily focus on democracy and how democracies allegedly #' minimize the costs of war through assorted means. We'll do the same here. #' #' Let's offer a spiritual replication of the second model of Table 1, focusing #' on military fatalities (which is the explicit focus of the Gibler-Miller #' conflict data we'll be using). However, we need to create some variables. #' First, we're going to create a kind of "loss exchange ratio" variable. #' Formally, this variable is equal to the fatalities imposed on the enemy #' combatant(s) divided over the fatalities suffered by the state, adding 1 to #' the denominator of that equation to prevent division by 0. We'll create this #' one, and another that's proportional (i.e. imposed fatalities over imposed #' fatalities and state's own fatalities). Play with this to your heart's #' content.[^min] Second, we're going to create an initiator variable (`init`) #' to isolate those states that make conscious forays into war. From my #' experience creating data for `{peacesciencer}`, it is usually (though not #' always) the case that you can discern initiation from whether it's on the #' Side A and/or it's an original participant. To get to the point, the #' initiator variable will be 1 if 1) it's on Side A (i.e. the side that made #' the first incident) *and* it's an original participant and 2) if it's #' *not* an original participant (often meaning it self-selected into the #' conflict). Otherwise, it will be a 0. #' #' [^min]: The fatalities offered by Gibler and Miller (2024) have #' high and low estimates. We'll focus on just the low estimates here. states_war %>% mutate(ler = oppfatalmin/(fatalmin + 1), lerprop = oppfatalmin/(fatalmin + oppfatalmin), init = case_when( sidea == 1 & orig == 1 ~ 1, orig == 0 ~ 1, TRUE ~ 0 )) -> Data #' Now, let's do a kind-of replication of the second model in Table 1, where #' we'll model the first loss exchange ratio variable (LER) we created (`ler`) #' as a function of democracy (`xm_qudsest`), GDP per capita (`wbgdppc2011est`), #' the initiator variable (`init`), the military expenditures of the state #' (`milex`), the duration of the participant's stay in conflict (`mindur`), and #' the familiar estimate of power provided by CoW's National Material #' Capabilities data as the composite index of national capabilities (`cinc`). M1 <- lm(ler ~ xm_qudsest + wbgdppc2011est + init + milex + xm_qudsest + mindur + cinc, Data) #' Okie doke, let's see what we got. summary(M1) #' I can already anticipate some design decisions got us to this point, but the #' model suggests only military expenditures matter to understanding performance #' in war. Higher values of military expenditures coincide with higher LER. That #' tracks, at least. #' #' Now, let's do some diagnostic testing. #' #' ## Linearity #' #' OLS assumes some outcome *y* is a linear function of some right-hand predictors #' you include in the model. That is, the estimated value itself follows that #' linear formula, and the values themselves are "additive" (i.e. y = a + b + c). #' There are various diagnostic tests for the linearity assumption, including #' some outright statistical tests that are either not very good at what they #' purport to do (Rainbow test), or awkwardly implemented (Harvey-Collier test). #' #' This is why I encourage you to explore this visually. For one, start with #' arguably the most useful OLS diagnostic plot: the fitted-residual plot. Grab #' the fitted values from the model and the residuals and create a scatterplot #' from them. Then, overlay a LOESS smoother over it to check for any #' irregularities. By definition, the "rise over run" line is flat at 0. The #' LOESS smoother will communicate whether that's actually a line of good fit. #' #' Base R has a function that can do this for you. `plot()` is a default #' function in R that, if applied to an object created by `lm()`, will create #' a battery of graphs for you. You just want the first one, which you can #' specify with the `which` argument. Observe. #' plot(M1, which=1) #' Eww, gross. That shouldn't look like that. Gross. #' #' You may also want to use this an opportunity to learn about the `augment()` #' function in `{broom}`. You can use this to extract pertinent information from #' the model. First, let's see what comes out. #' broom::augment(M1) # ?augment.lm() # for more information #' We want just the fitted values and the residuals here. The former goes on #' the *x*-axis and the latter goes on the *y*-axis for the plot to follow. broom::augment(M1) %>% ggplot(.,aes(.fitted, .resid)) + geom_point(pch = 21) + geom_hline(yintercept = 0, linetype="dashed", color="red") + geom_smooth(method = "loess") #' Again, gross. Just, eww. Gross. #' #' You should've anticipated this in advance if you had some subject domain #' expertise. These are large nominal numbers, in war, that could be decidedly #' lopsided. For example, here's the U.S. performance against Iraq in the Gulf #' War. Data %>% filter(ccode == 2 & micnum == 3957) %>% select(micnum:enddate, fatalmin, oppfatalmin, ler) #' In other words, the U.S. killed almost 27 Iraqi troops for every American #' soldier killed by the Iraqi military. That's great for battlefield performance #' from the perspective of the Department of Defense, but it's going to point #' to some data issues for the analyst. #' #' Indeed, here's what the dependent variable looks like. Data %>% ggplot(.,aes(ler)) + geom_density() #' Here, btw, is what it's natural logarithm would look like, and what the #' proportion variable we created would look like by way of comparison. Believe #' me when I say I know what I'm doing with the code here. Trust me; I'm a #' doctor. Data %>% select(ler, lerprop) %>% mutate(log_ler = log(ler + 1)) %>% gather(var, val) %>% ggplot(.,aes(val)) + geom_density() + facet_wrap(~var, scales='free', nrow= 3) #' The linear model does not require normally distributed DVs in order to #' extract (reasonably) linear relationships. However, we should've not done #' this, and we should've known in advance that we should not have done this. #' What estimate of central tendency could we expect from a variable that looks #' like this? So, let's take a step back and re-estimate the model two ways. The #' first will do a +1 and log of the DV and the second will use the proportion #' variable we created earlier. This might be a good time as well to flex with #' the `update()` function in R. Data %>% mutate(ln_ler = log(ler + 1)) -> Data M2 <- update(M1, ln_ler ~ .) M3 <- update(M1, lerprop ~ .) #' Briefly: this is just updating `M1` to change the two DVs. A dot (`.`) is a #' kind of placeholder in R to keep "as is." You can read `~ .` as saying #' "regressed on the same stuff as before and keeping the other arguments we #' supplied to it before (like the data)." You'll see me return to more uses of #' this with the `wls()` function, which really wants a data set with no missing #' values or for you to have supplied `na.action=na.exclude` as an optional #' argument in the `lm()` function. We'll deal with this detail later. #' #' Now, let's also flex our muscles with `modelsummary()` to see what these #' alternate estimations mean for the inferences we'd like to report. modelsummary(list("LER" = M1, "log(LER + 1)" = M2, "LER Prop." = M3), title = "Hi Mom!", stars = TRUE, ) #' The results here show that it's not just a simple matter that "different DVs #' = different results". Far from it. Different *reasoned* design decisions can #' be the matter of the different results you observe. Using more reasonable #' estimates of battlefield performance (either logged loss exchange ratio or #' its proportion form) results in models that make more sense with respect to #' the underlying phenomenon you should care to estimate. In our case, it's the #' difference of saying whether there is an effect to note for democracy, GDP #' per capita, and duration in conflict. #' #' Still, we may want to unpack these models a bit more, much like we did above. #' Let's focus on `M2` for pedagogy's sake, even though there is more reason to #' believe the third model is the less offensive of the two.[^glm] Let's get #' our fitted-residual plot. #' #' [^glm]: We are also going to leave aside an important conversation to have #' about just how advisable a +1-and-log really is for isolating coefficients #' of interest and whether this should be a linear model at all. The linear #' model is probably "good enough", and certainly for this purpose. broom::augment(M2) %>% ggplot(.,aes(.fitted, .resid)) + geom_point(pch = 21) + geom_hline(yintercept = 0, linetype="dashed", color="red") + geom_smooth(method = "loess") #' It's still not ideal. The fitted-residual plot is broadly useful for #' other things too, and it's screaming out loud that I have a heteroskedasticity #' problem and some discrete clusters in the DV. However, it wants to imply some #' non-linearity as well. #' #' One limitation of the fitted-residual plot, however, is that it won't tell you #' where exactly the issue might be. That's why I wrote the `linloess_plot()` in #' `{stevemisc}`. This plot takes a model object and, for each right-hand #' side variable, draws a rise-over-run line of best fit and the LOESS smoother. #' Do note this tells you nothing about binary IVs, but binary IVs aren't the #' problem here. linloess_plot(M2, pch=21) #' `{car}` has this one for you, if you'd like. I'll concede it works better #' than my function at the moment. car::residualPlots(M2) #' The thing I like about this kind of plot is that it can point to multiple #' problems, though it won't point you in the direction of any potential #' interactions. No matter, here it points to several things that I may want to #' ask myself about the data. I'll go in order I see them. #' #' 1. Does democracy need a square term? It seems like the most autocratic #' and the most democratic perform better than those democracies "in the middle". #' There's ample intuition behind this in the democratic peace literature, which #' emphasizes there is a real difference between a democracy like Sweden and a #' fledgling democratizing state like mid-1990s Serbia. Intuitively, there is #' also a qualitative difference between a durable autocracy like Iran now #' versus a fledgling autocracy like Iran in the early 1980s. We might be seeing #' that here. #' #' 2. The GDP per capita variable has some clusters. It's already #' log-transformed, so I'm disinclined to take a log of a log. It's technically #' benchmarked in 2011 USD, so there's not an issue of nominal or real dollars #' here. Some states are just poor? In our data, I see a jump from 2.428 #' (Mecklenburg in the First Schleswig War) to 5.815 (i.e. Ethiopia in the #' Second Italian-Ethiopian War). A few things might be happening here, and it's #' worth noting estimates of GDP per capita in the 19th century are *always* #' tentative and prone to some kind of measurement error. It might be ideal to #' proportionalize (sic) these, much like we did with the LER variable. However, #' that would require more information (on my end, behind the scenes) than we #' have here. Perhaps we just make a poverty dummy here and leave well enough #' alone, mostly to see if there is any sensitivity to those observations. The #' line doesn't look like it's affected. #' #' 3. It might make sense to log-transform military expenditures and duration. #' We'd have to do a +1-and-log to expenditures because there are states in war #' that don't register in the thousands, per the National Material Capabilities #' data. #' #' With that out of the way, let's create some variables of interest to us and #' re-estimate Model 2. Data %>% mutate(povdum = ifelse(wbgdppc2011est < 4, 1, 0), ln_mindur = log(mindur), ln_milex = log(milex + 1)) -> Data M4 <- lm(ln_ler ~ xm_qudsest + I(xm_qudsest^2) + wbgdppc2011est + ln_milex + ln_mindur + cinc, Data) summary(M4) #' I don't think it's going to be an issue, but now I'm curious... summary(update(M4, . ~ . -I(xm_qudsest^2), data=subset(Data, povdum == 1))) # ^ you can read this `update()` as "update M4, keep DV and bulk of IVs, but # remove the square term and change the data to just those in Data where # povdum == 1. #' We don't have a lot of observations to play with here, and the observations #' affected here are decidedly unrepresentative of the overall population of #' cases. I wouldn't sweat this issue, other than perhaps as a call to restrict #' any serious analysis to just the 20th century when we developed better #' capacity for calculating estimates of economic size and wealth. All the #' observations affected here are in the 19th century and, wouldn't you know it, #' all concern the wars of Italian/German unification. #' #' You're welcome to play more with the diagnostics of these models, though it #' comes with the caveat that you're making linear assumptions anyway when you #' estimate this model. Don't go chasing waterfalls, only the rivers and the #' lakes to which you are used... to. I know you're going to have it your way or #' (more than likely) use no statistical model at all. Let's just not move too #' fast.[^creep] #' #' [^creep]: "Creep" is 100% the best single off that record, but didn't have #' the Windows 95-era CGI. You know I'm right. Stay out of my mentions. #' #' ## Independence #' #' The linear model assumes the independence of the model's errors and that any #' pair of errors are going to be uncorrelated with each other. Past observations #' (and past errors) should not inform other pairs of errors. In formal terms, #' this assumption seems kind of Greek to students. In practice, think of it #' this way. A lot of the theoretical intuition behind OLS is assuming something #' akin to a simple random sample of the population. You learned central limit #' theorem by this point and how you can simulate that. However, you may have a #' data set that does not look like this. #' #' In other words, this assumption is going to be violated like mad in any #' design that has a time-series, spatial, or multilevel component. For #' students new to this stuff, you should be able to anticipate ahead of #' time when your errors are no longer independent of each other. #' #' The implications of non-independent errors---aka “serial correlation” #' or “autocorrelation”---are generally more about the variance of the #' estimator than the estimator itself. That said, there is reason to #' believe the standard errors are wrong and this can have important #' implications for statistical inference you’d like to do. No matter, this #' is a huge assumption about OLS and violating it ultimately means OLS #' loses its inferential value. The data are no longer randomly sampled in that #' sense. #' #' Once you know under what conditions that autocorrelation is going to happen, #' you can probably skip this and go right to assorted tricks that you know will #' work given the nature of your serial correlation. So, this is going to be a #' curious exercise I'm going to have you do here. The fundamental grouping #' effect here is primarily "spatial" or cross-sectional (in that it deals #' with state groupings) but the primary tests we teach around this particular #' assumption are temporal (i.e. we're going to prime you in the advanced level #' to think about time series and panel models). I just want to reiterate this #' is something you should really know in advance about your data, and should #' assume a priori about your data. The primary tests of interest are always #' geared toward temporal attributes of a data set (which, again, you should #' know in advance). #' #' Anywho, if you have a random walk time series, you know those can typically #' be first-differenced in order to strip out the source of autocorrelation. If #' you have some type of "spatial" serial correlation (e.g. citizens nested in #' countries, students nested in schools), you can employ some kind of fixed #' effects or random effects to outright model the source of unit heterogeneity. #' No matter, the "textbook" tests for these are typically in the time series #' case and involve the Durbin-Watson test or the Breusch-Godfrey test. #' #' Of the two, practitioners I’ve read seem to favor the latter over the former. #' The Durbin-Watson test has some pretty strong assumptions and only looks for #' a order-1 autocorrelation. No matter, it has some informational value when #' you get deeper into the weeks of time series stuff. Both can be estimated by #' way of the `{lmtest}` package. Let's go back to our model and apply both. #' dwtest(M2) bgtest(M2) #' The "null" hypothesis of both tests is "no autocorrelation." The alternative #' hypothesis is "autocorrelation." When the *p*-value is sufficiently small, #' it indicates a problem. Here, they seem to suggest some kind of #' autocorrelation, though the data aren't exactly a panel or a time series. #' Understand, again, these tests are making particular assumptions about your #' data that we don't quite have, and that you should learn to anticipate #' these issues without needing a textbook test to do it for you. I don't have #' the time or opportunity to belabor these issues in detail, but: #' #' 1. When you read quant stuff in IR, especially the kind of stuff you might #' see my name attached to or people I otherwise know, you'll see so-called #' "clustered" standard errors that are often "spatial" or cross-sectional in #' scope. For example, there is good reason to believe the American observations #' in this model are related to each other. British observations should be #' related to other British observations, and so on. #' #' 2. Much like you'd see in a panel model, you might also see so-called #' "fixed effects" applied here. This might take on multiple forms. If I had any #' reason to believe whatsoever that conflicts within particular regions are #' different from each other, I could create a series of dummies that say "is #' this war in Europe or not" or "is this in Sub-Saharan Africa or not?" Or, #' perhaps, I have reason to believe there is a clustering of observation by #' era or moment in time. The world war eras are different from the Cold War #' eras, and so on. #' #' I can create these ad hoc with the information available in the data I have. #' The `stdate` column has the start of the first event by the participant in #' war. I can extract that with `str_sub()` (getting the last four digits), #' and convert that to a numeric integer for year. Then, I can use a #' `case_when()` call to categorize whether the observation was before #' World War I, between and involving both world wars, during the Cold War, or #' after the Cold War. Data %>% mutate(year = str_sub(stdate, -4, -1)) %>% mutate(year = as.numeric(year)) %>% mutate(period = case_when( year <= 1913 ~ "Pre-WW1", between(year, 1914, 1945) ~ "WW1 and WW2 Era", between(year, 1946, 1990) ~ "Cold War", year >= 1991 ~ "Post-Cold War" )) %>% mutate(period = fct_relevel(period, "Pre-WW1", "WW1 and WW2 Era", "Cold War")) -> Data M5 <- update(M2, . ~ . + period, Data) modelsummary(list("log(LER + 1)" = M2, "w/ Period FE" = M5), stars = TRUE, title = "Hi Mom!", ) #' Do note fixed effects in this application have a tendency to "demean" other #' things in the model, and they have a slightly different interpretation. For #' example, though both are "significant" and "look the same", the GDP per #' capita effect is now understood as the effect of *increasing* GDP per capita #' *within periods* rather than higher *levels* of GDP per capita, per se. You #' do observe some period weirdness. Compared to the pre-WWI era, logged LER is #' generally lower in the interwar years and generally lower in the Cold War, #' importantly partialing out the other information in the model. #' No matter, if you have autocorrelation, you have to tackle it outright or #' your OLS model has no inferential value. The exact fix depends on the exact #' nature of the problem, which will definitely depend on you knowing your data #' well and what you're trying to accomplish. #' #' ## Normality (of the Errors) #' #' OLS assumes the *errors* are normally distributed. This is often conflated #' with an assumption that the outcome variable is normally distributed. That's #' not quite what it is. It does imply that the conditional distribution of the #' dependent variable is normal but that is not equivalent to assuming the #' marginal distribution of the dependent variable is normal. At the end of the #' day, the assumption of normality is more about the errors than the dependent #' variable even as the assumption about the former does strongly imply an #' assumption about the latter. #' #' Violating the assumption of a normal distribution of the errors is not as #' severe as a violation of some of the other assumptions. The normality #' assumption is not necessary for point estimates to be unbiased. In one #' prominent textbook on statistical methods, Gelman and Hill (2007, p. 46) say #' the normality assumption is not important at all because it has no strong #' implication for the regression line. I think this follows because Gelman and #' Hill (2007)---later Gelman, Hill, and Vehtari (2020)---are nowhere near as #' interested in null hypothesis testing as your typical social scientist likely #' is. No matter, violating the assumption of a normal distribution of errors #' has some implication for drawing a line that reasonably approximates #' individual data points (if not the line itself, per se). Thus, you may want #' to check it, certainly if you have a small data set. #' #' My misgiving with these normality tests is that they *all* suck, even at what #' they're supposed to do. The "textbook" normality tests involve extracting the #' residuals from the model and checking if their distribution is consistent #' with data that could be generated by a normal distribution. The two #' implementations here are typically base R. One is the Shapiro-Wilk test. The #' other is the Kolmogorv-Smirnov test. There are more---like the #' Anderson-Darling test---which you can also do and is communicating the same #' thing. You can explore a few of these in the `{nortest}` package. #' shapiro.test(resid(M2)) ks.test(resid(M2), y=pnorm) #' When *p*-values are sufficiently small, these tests are saying "I can #' determine these weren't generated by some kind of normal distribution." My #' misgiving with these particular tests are multiple. One, you can dupe them #' pretty easily with a related distribution that looks like it, but is not it #' (e.g. Student's *t*). For example, let me cheese the Shapiro test with the #' Poisson distribution. set.seed(8675309) shapiro.test(lambdas <- rpois(1000, 100)) # c.f. http://svmiller.com/blog/2023/12/count-models-poisson-negative-binomial/ #' I should not have passed this test, and yet I did. These aren't normally #' distributed. They're Poisson-distributed, and can't be reals. #' Second, these normality tests are deceptively just a test of sample size. The #' more observations you have, the more sensitive the test is to any observation #' in the distribution that looks anomalous. Three, textbooks typically say to #' use the K-S test if you have a large enough sample size, but, from my #' experience, that's the test that's most easily duped. #' #' Here's what I recommend instead, knowing that the normality assumption of #' errors is one of the least important assumptions: visualize this instead. For #' one, the "textbook" visual diagnostic is the Q-Q plot. This is actually a #' default plot in base R for linear models if you know where to look. #' plot(M2, which=2) #' The Q-Q plots the theoretical quantiles of the residuals against the #' standardized residuals. Ideally, they all fall on a nice line. Here, they #' don't, suggesting a problem. The negative residuals are more negative than #' expected at the tail end of the distribution (which is almost always #' unavoidable for real-world data at the tails) but the higher residuals are #' more positive than expect for the higher quantiles. This one is more #' problematic. You could've anticipated this knowing that there is still a #' right skew in the DV even after its logarithmic transformation. #' #' I mentioned in lecture that a better way of gauging just how severe the issue #' is will involve generating a density plot of the residuals against a #' stylized density plot matching the description of the residuals (i.e. with a #' mean of 0 and a standard deviation equal to the standard deviation of the #' residuals). It would look something like this for `M2`. #' rd_plot(M2) #' Or this for `M3`: rd_plot(M3) #' In these plots, the black solid line is a density plot of the actual #' residuals whereas the blue, dashed line is a density plot of a normal #' distribution with a mean of 0 and a standard deviation equal to the #' standard deviation of the residuals. The real thing will always be kind of #' lumpy in some ways when you're using actual data. Ask yourself how bad it is. #' I'd argue that the distribution is not acceptable in the case of `M2`, though #' reasonably approximated in `M3`. Both results are fairly similar to each #' other in the stuff we care about. #' #' Your solution to this particular "problem" will depend on what exactly you're #' doing in the first place. The instances in which these plots look really #' problematic will be situations like these. Your model may have relatively #' few covariates and the covariates you do include are dummy variables. If you #' have *so* few observations in the model (i.e. I can count the number of #' observations in the model on one or three hands), any distribution of #' residuals will look crazy. That's *probably* not your case, especially as #' you've thought through the data-generating process for your dependent #' variable. It's more likely the case that you have a discrete dependent #' variable and you're trying to impose a linear model on it. The end result is #' a subset of residuals that look kind of wonky at the tails of the #' distribution. Under those conditions, it makes more sense to use the right #' model for the nature of your dependent variable. If your DV is binary, #' consider a logistic regression. If you're trying to model variation in a #' 5-item Likert measure, consider the ordinal logistic regression. If you have #' a proportion with observations appearing at the exact limit of the #' proportion, consider some kind of beta regression. Are you trying to model a #' "count" variable (i.e. an integer), especially one that has some kind of #' skew? Maybe you want a Poisson regression, or negative binomial. #' #' Either way, the solution to the non-normality of the residuals involves #' outright modeling the kind of data that presents this kind of non-normality. #' The assumption, to be clear, is about the *errors*, but the "problem" and the #' "solution" often point to what exactly your dependent variable looks like #' and whether OLS is the right model for the job. #' #' ## Equal Error Variance (Homoskedasticity) #' #' The final diagnostic test you should run on the OLS model involves assessing #' whether the distribution of residuals is constant across the range of the #' model. OLS assumes that the variance of the errors is the same regardless #' of the particular value of x. This is often called “homoskedasticity”, mostly #' in juxtaposition to the violation of this assumption: “heteroskedasticity.” #' The change of prefix should be clear to what is implied here. #' Heteroskedasticity often means the errors are typically a function of #' something else in the model. #' #' The implications for violating the assumption of homoskedasticity are #' more nuisance than real problem. Unequal/non-constant variance does #' not affect the `XB` part of the regression model, but it does have #' implications for predicting individual data points. Here, the implication #' is similar to the implication of violating the assumption of normally #' distributed errors. However, it's more severe than the non-normality #' of residuals. If null hypothesis significance testing is part of your #' research design (and it is), heteroskedasticity means the standard #' errors of your estimates are wrong. Worse yet: they'll be wrong in ways #' you don't know until you start digging further into what your data #' and model look like. #' #' There are two ways to assess whether you have a heteroskedasticity #' problem in your OLS model. The first is the fitted-residual plot. You can #' ask for that again. This time, have your eye draw kind of an upper bound #' and lower bound on the *y*-axis (for the residuals). Are those lines #' flat/pattern-less? Tell-tale cases of heteroskedasticity like we discussed in #' lecture are cases where you have a cone pattern emerge. Sometimes it's not #' as obviously cone-shaped. This one is 100% cone-shaped. #' plot(M2, which=1) #' I think I see a pattern, and certainly the pattern of residuals don't look #' like random buckshot like I'd want. Fortunately, there's an actual test #' you can use that's available in the `{lmtest}` package. It's the #' Breusch-Pagan test. The null hypothesis of the test is homoskedasticity. #' The alternate hypothesis is heteroskedasticity. If the *p*-value is #' sufficiently small, you have a problem. bptest(M2) #' Looks like we have a problem, but what's the solution? Unfortunately, #' there's no quick fix, per se, because any single solution to #' heteroskedasticity depends on the exact nature of heteroskedasticity #' and whether you know for sure that's the case. If you have positive reals #' on either side of the regression equation, and there's some kind of positive #' skew that's causing residuals to fan out, you can consider some kind of #' logarithmic transformation to impose normality on the variable(s). Is it #' an outlier/influence observation that's causing this? If so, consider #' punting out the observation from the model. At least, consider it. I'll note #' that outliers/influence are not sufficient justification to exclude an #' observation from a regression model. You should only do that if it's #' *unrepresentative* of the underlying population, not necessarily #' because it's anomalous in the underlying population. #' #' Therefore, "solutions" to heteroskedasticity are often just alternative #' specifications of the model and seeing if the core results from the #' base model change in any meaningful way. There are a few paths you #' may want to explore. #' #' The basic "textbook" approach is weighted least squares. I mentioned in #' lecture that this approach is kind of convoluted. It's not complicated. #' It's just convoluted. The procedure here starts with running the #' offending model (which we already did; it's `M2`). Then, grab the #' residuals and fitted values from the model. Next, regress the absolute #' value of the residuals on the fitted values of the original model. #' Afterward, extract those fitted values, square them and divide 1 over #' those values. Finally, apply those as weights in the linear model once #' more for a re-estimation. #' #' An older version of this script showed you how you can do it yourself. #' Just have `{stevemisc}` do it for you with the `wls()` function. Just be #' mindful that the `wls()` function isn't keen on data frames where #' observations have NAs. If you have those, and just want to ignore it and #' proceed, you may want to include an `na.action = na.exclude` as an optional #' argument in `lm()`. We can cheese that here with an `update()`. modelsummary(list("log(LER)" = M2, "WLS" = wls(update(M2, na.action=na.exclude))), stars = TRUE, title = "Hi Mom!", gof_map = c("nobs", "adj.r.squared")) #' Re-estimating the model by way of weighted least squares reveals some #' interesting changes, leaving open the possibility that any inference we'd #' like to report for the democracy variable, the initiator variable, the #' military expenditure variable, and the duration variable are functions of #' the heteroskedasticity and whether we dealt with it in the "textbook" way. #' #' There is an entire branch of econometrics that deals with what to do about #' your test statistics if you have this problem, and they all seem to have a #' unifying aversion to the weighted least squares approach. Their point of #' contention is typically that if the implications of heteroskedasticity is the #' line is fine but the standard errors are wrong, then the "fix" offered by a #' textbook is almost guaranteed to redraw lines. They instead offer a battery #' of methods to recalibrate standard errors based on information from the #' variance-covariance matrix to adjust for this. This would make the standard #' errors “robust.” #' #' I mention above that I have an entire blog post that discusses some of these #' things in more detail than I will offer here. It's a lament that it's never #' just "do this and you're done" thing; it's always a "fuck around and see what #' you find" thing. However, the "classic" standard error corrections are #' so-called "Huber-White" standard errors and are often known by the acronym #' "HC0". The suggested defaults you often see in software for "robust" standard #' errors are type "HC3", based on this offering from #' [Mackinnon and White (1985)](https://www.sciencedirect.com/science/article/abs/pii/0304407685901587). #' I can't really say this with 100% certainty (because you won't know until #' you can see it for yourself) but it's often the case that older analyses #' that report "robust" standard errors are Huber-White ("HC0") standard #' errors and "heteroskedasticity-robust" standard errors in newer analyses #' are HC3 standard errors. That's at least a heuristic I've picked up from #' my experience. It's conceivable that "robust" standard errors in analyses #' you see 20 years ago might also be type "HC1". Your clue for that would be #' whether the analysis was conducted (or advertised that it was done) in Stata. #' "HC1" was (is?) Stata's default standard error correction if you attached #' `, robust` at the end of the `reg` call.[^blah] #' #' [^blah]: The distinction among these various standard error corrections #' depends on the intersection of leverage points and sample size. For larger #' sample sizes, the distinctions don't really materialize. For smaller sample #' sizes without a lot of leverage points, type "HC0" and "HC1" tend to #' underestimate standard errors. For smaller sample sizes with a lot of leverage #' points, HC3 tends to perform best while "HC0" and HC1" perform worse. While #' it tracks that "HC3" standard error corrections may want to overestimate #' standard errors, econometricians tend to like this under the premise that #' it makes their standard error corrections "conservative" (i.e. it makes it #' harder to reject the null hypothesis and better hedges against Type I errors). #' It's why the R packages that do these standard error corrections prioritize #' "HC3" over others. In principle, "HC2" is not as conservative as "HC3" and #' performs better in smaller samples than "HC0" and "HC1". However, I cannot #' earnestly tell you ever seeing such a standard error correction in the wild. #' [Imbens and Kolesar (2016)](https://direct.mit.edu/rest/article-abstract/98/4/701/58336/Robust-Standard-Errors-in-Small-Samples-Some?redirectedFrom=fulltext) #' recommend a modified version of it, though. This topic is admittedly kind of #' frustrating. So much of what's "suggested" is often an issue of path dependence #' and an uncritical use of current practice. In other words: we suggest things #' as defaults because they are defaults. Do what's been done before. Just #' because it's standard practice doesn't necessarily make it good practice, though. #' #' There's another approach, which is 100% I'd do if I were presented the #' opportunity. I'd bootstrap this motherfucker. You likewise have #' several options here. I maintain it's fun/advisable to do the bootstrap #' yourself rather than have some software bundle do it for you, and I have #' guides on previous course websites and on my blog that show you how to do #' this. However, we'll keep it simple and focus on some convenience functions #' because they also illustrate the myriad options you have. Let's focus on #' just two. The first, the simple bootstrap pioneered by Bradley Efron, #' resamples *with replacement* from the data and re-runs the model some #' number of times. In expectation, the mean coefficients of these re-estimated #' models converge on what it is in the offending model (which you would know #' from central limit theorem). However, the *standard deviation of the #' coefficients is your bootstrapped standard error*. Another popular approach #' is a bootstrapped model from the residuals. Sometimes called a “Bayesian” or #' “fractional” bootstrap, this approach had its hot girl summer on Twitter two #' or three years ago and leaves the regressors at their fixed values, #' resamples the residuals, and adds them to the response variable. #' #' If you wanted to do any one of these in isolation, you'd want to leverage #' the `coeftest()` function in `{lmtest}` with the assorted mechanisms for #' futzing with the variance-covariance matrix (or other parts of the model) in #' the `{sandwich}` package. For example, what follows below is going to do the #' simple bootstrap 1,000 times on `M2`. You can set a reproducible seed with #' the `set.seed()` function for max reproducibility, if you'd like. # set.seed(8675309) coeftest(M2, vcov = sandwich::vcovBS(M2, R = 1000)) coeftest(M2) # compare to what M2 actually is. #' Take a moment to appreciate that you just ran 1,000 regressions on a 1,000 #' data replicates in a matter of seconds. Thirty years ago, that would’ve #' taken a weekend at a supercomputer at some American university with a #' billion dollar endowment. Neat, huh? #' #' Anywho, the results of the simple bootstrap suggest a precise effect of #' military expenditures that would not be discerned in the model with #' heteroskedastic standard errors. The other significant coefficients are still #' significant, but less precise. #' #' FYI, `{modelsummary}` appears to be able to do this as well. In this #' function, notice what the `vcov` argument is doing and the order in which #' it is doing it. modelsummary(list("log(LER)" = M2, "WLS" = wls(update(M2, na.action=na.exclude)), "HC0" = M2, "HC3" = M2, "Bootstrap" = M2, "Resid. Boot." = M2), vcov = list(vcovHC(M2,type='const'), vcovHC(wls(update(M2, na.action=na.exclude))), vcovHC(M2,type='HC0'), vcovHC(M2, type='HC3'), vcovBS(M2), vcovBS(M2, type='residual')), gof_map = c("nobs", "adj.r.squared"), title = "Hi Mom! Last time, I promise.", stars = TRUE) #' The summary here suggests that we have heteroskedastic errors, and what we #' elect to do about it---or even if we elect to do anything about it---have #' implications for the inferences we'd like to report. No matter, the #' "solution" to heteroskedastic standard errors are more "solutions", multiple, #' and involve doing robustness tests on your original model to see how #' sensitive the major inferential takeaways are to these alternative estimation #' procedures.