#' ---
#' title: "Extending the Linear Model"
#' layout: lab
#' permalink: /lab-scripts/lab-4/
#' filename: extending-ols.R
#' active: lab-scripts
#' abstract: "This is a lab script will encourage you to not overthink the
#' linear model. If you're comfortable with what it's telling you in the simple
#' bivariate case, extending it to include other complexities (like interactions,
#' fixed effects, or other inputs) is a really simple matter. Multiple regression
#' partials out effects of other variables on the dependent variable. Don't
#' sweat it too much.."
#' output:
#'    md_document:
#'      variant: gfm
#'      preserve_yaml: TRUE
#' ---

#+ setup, include=FALSE
knitr::opts_chunk$set(collapse = TRUE, 
                      fig.path = "figs/lab-4/",
                      cache.path = "cache/lab-4/",
                      fig.width = 11,
                      comment = "#>")
#+


#' ## R Packages/Data for This Session
#' 
#' You should've already installed the R packages for this lab session. 
#' `{tidyverse}` will be for all things workflow and `{stevedata}` will have the 
#' data sets we'll be using today.

library(tidyverse)
library(stevedata)

# For those of you without `{tidyverse}`, do this:
# library(dplyr)
# library(tibble)
# library(magrittr)
# library(forcats)
# library(ggplot2)

#' ## Building Toward Multiple Regression
#' 
#' Last time we had a lab, we just focused on bivariate linear model with fake 
#' data that we created. The goal of that exercise was to show you what the
#' linear model looked like when you had maximum control over the model 
#' parameters. We'll use some actual data this time. This is a survey data set 
#' from the United Kingdom in 2018-19, largely probing individual attitudes 
#' toward immigration. You can find out more about this data set by typing 
#' `?ESS9GB` into Rstudio. You can also 
#' [go here](http://svmiller.com/stevedata/reference/ESS9GB.html) for
#' a web version of the codebook.

ESS9GB

#' You can also see read some blog posts where I make considerable use of these
#' data here:
#' 
#' - [What Do We Know About British Attitudes Toward Immigration? A Pedagogical Exercise of Sample Inference and Regression](http://svmiller.com/blog/2020/03/what-explains-british-attitudes-toward-immigration-a-pedagogical-example/)
#' - [How Should You Think About Your Priors for a Bayesian Analysis?](http://svmiller.com/blog/2021/02/thinking-about-your-priors-bayesian-analysis/)
#' 
#' Understand that some of that is slightly more complex than what we'll be doing
#' here, but I'd rather inundate you with things to look at and read rather than
#' give you nothing at all.
#' 
#' Our primary dependent variable in this model will be the `immigsent` variable,
#' which is an additive index of three variables also included in the data set.
#' These prompts, verbatim, each range from 0 to 10 and ask 1) Would you say it 
#' is generally bad or good for the United Kingdom’s economy that people come to 
#' live here from other countries? (`imbgeco`), 2) And… would you say that the 
#' United Kingdom’s cultural life is generally undermined or enriched by people 
#' coming to live here from other countries? (`imueclt`), and 3) Is the United 
#' Kingdom made a worse or a better place to live by people coming to live here 
#' from other countries? (`imwbcnt`). The `immigsent` variable simply adds all
#' three together, resulting in a scale that ranges from 0 to 30. You can 
#' interpret 0 as being maximally anti-immigrant/immigration and 30 being 
#' maximally pro-immigrant/immigration. Higher values = more pro-immigration
#' sentiment.

#' Let's build toward multiple regression by starting in the simple bivariate, 
#' like we did in lecture. We'll start by exploring an uncontroversial argument 
#' that we should expect pro-immigration sentiment to be much lower in
#' North East than other parts of the United Kingdom. Why "North East"? This is
#' the region where Newcastle is. Newcastle, rightly or wrongly, has a reputation
#' of being ["Brexit Central".](https://bylinetimes.com/2019/07/17/calling-the-north-east-of-england-brexit-central-is-simplistic-and-snobbish/)
#' We have the means to explore this with the `region` variable. Let's take a 
#' look at it.

ESS9GB %>% distinct(region)

#' Perfect. I see I have 12 familiar regions to me in the data to explore, but
#' I have no "North East" dummy variable. No matter, it's a simple matter to 
#' create one like this.

ESS9GB %>%
  mutate(northeast = ifelse(region == "North East (England)", 1, 0)) -> Data

#' Note: I'm creating a new data frame here for convenience sake.
#' 
#' Now let's do a simple regression, regressing the immigration sentiment
#' variable on whether the respondent is in the North East.

M1 <- lm(immigsent ~ northeast, data=Data)
summary(M1)

#' By the way, in a super simple case like this where *y* is continuous and *x* 
#' is binary, this is basically what the *t*-test will tell you as well.

t.test(immigsent ~ northeast, Data)

#' In this simple difference of means, the mean immigration sentiment where
#' `northeast` is 0 is 17.02. The mean immigration sentiment where `northeast` 
#' is 1 is 14.65. The difference between those two numbers is basically the 
#' regression coefficient. It's the "North East" effect, which basically
#' decreases immigration sentiment by about 2.37. That effect is highly unlikely 
#' to be zero. You knew that because we went over statistical significance and 
#' null hypothesis testing. You also know that, one assumes, from just kind of
#' eye-balling the United Kingdom with a polite curiosity. Again, "rightly or
#' wrongly."
#' 
#' You can also do the same thing and look for differences between Scotland
#' and not-Scotland.
#' 

Data %>%
  mutate(scotland = ifelse(region == "Scotland", 1, 0)) -> Data

M2 <- lm(immigsent ~ scotland, Data)
summary(M2)

#' The model here suggests that pro-immigration sentiment in not-Scotland is
#' 16.73. The "Scotland" effect increases pro-immigration sentiment by 1.76. The 
#' *t* statistic suggest that if the true differences between Scotland and 
#' not-Scotland is 0, what we got would have been observed about .19% of the 
#' time, on average.
#' 
#' What if we wanted to tease out more information for all regions? This would
#' be a kind of "fixed effects" analysis, which is a somewhat misleading and 
#' confusing way of saying "take a categorical variable, and create a series of 
#' dummy variables from it." You could do this manually, or you can have R do it 
#' for you.
#' 
M3 <- lm(immigsent ~ factor(region), Data)
summary(M3)

#' I want to stop right here and explain what this procedure is doing. First, 
#' you have 12 regions, but you get 11 regression coefficients. In this 
#' procedure, something must be a baseline. Second, if you don't give R any 
#' guidance, the "baseline" becomes whatever is first in alphabetical order. In 
#' this case, that is "East Midlands (England)", which is the region where 
#' Leicester and Nottingham are (if Premier League squads may help orient you).
#' What this "fixed effects" communicates, then, is the difference between East 
#' of England (i.e. where Norwich, Essex and Cambridge are) relative to East 
#' Midlands (and if there is a discernible difference), the difference between
#' London and East Midlands (and if it's discernible), the difference between 
#' North East and East Midlands (and if it's discernible), and so on. Here, we'd
#' suggest discernible differences relative to the baseline of London, North 
#' East, Scotland, and South East (i.e. where Brighton, Canterbury, and Dover 
#' are). Pro-immigration sentiment is higher in London, South East, and Scotland 
#' relative to East Midlands and lower in North East relative to East Midlands.
#' 
#'  A few other things. For one, you could basically tease much of this out 
#'  through doing a comparison of means. This procedure doesn't demonstrate 
#'  "significance", but calculating standard errors and creating lower and 
#'  upper bounds around that mean will illustrate a little bit more about what 
#'  the regression model is telling you.
#'  
Data %>%
  summarize(mean = mean(immigsent, na.rm=T),
            sd = sd(immigsent, na.rm=T),
            n = n(),
            se = sd/sqrt(n()),
            lwr = mean - se,
            upr = mean + se,
            .by = region) %>%
  data.frame

#' Graphing it will help.

Data %>%
  summarize(mean = mean(immigsent, na.rm=T),
            sd = sd(immigsent, na.rm=T),
            n = n(),
            se = sd/sqrt(n()),
            lwr = mean - se,
            upr = mean + se,
            .by = region) %>%
  # The next two lines are super-hacky ways of making the plot read in order.
  arrange(desc(region)) %>%
  mutate(region = fct_inorder(region)) %>%
  # Notice below: region is x-axis, mean is y-axis, and I've outlined that I
  # want the `lwr` and `upr` columns to be mins and maxes for what's coming
  # next...
  ggplot(., aes(region, mean, ymin=lwr, ymax=upr)) + 
  # ...the point-range. This will create a dot at `mean` and bounds 
  # that correspond with the ymin and ymax variables.
  geom_pointrange() +
  # Put your coords down flip them and reverse them.
  coord_flip()

#' What if you wanted to make London the baseline? You could manually create 
#' your own dummy variables, or you can turn that "region" variable into a more 
#' informative factor through the use of the `{forcats}` package (which is 
#' loaded with `{tidyverse}`). Let's use `fct_relevel()` for this purpose. Using 
#' `fct_relevel()` in the way I do below, I'm basically telling R that this 
#' new `region2` variable is a factor and it's specific order is 1) London and 
#' 2) after that, I don't care, and just proceed alphabetically as you wish.
#' 

Data %>%
  mutate(region2 = fct_relevel(region, "London")) -> Data

Data %>% summary

#' Now, let's do this again.
#' 
M4 <- lm(immigsent ~ region2, Data)
summary(M4)

#' This time, you're benchmarking any comparisons to London. Just because 
#' there's no difference between London and Scotland here doesn't mean there's
#' no difference between Scotland and East Midlands. What you do with so-called 
#' "fixed effects" depends on what you want to accomplish. Does it matter that 
#' something is the baseline? If so, make it the baseline and compare everything 
#' else to it. Do you not care, and you're just doing this because you want to 
#' partial out these potential categorical effects while you focus on some other 
#' regressors? Then don't bother with the details, because it's mostly immaterial
#' to what you want to accomplish.
#' 
#' ## Multiple Regression
#' 
#' Now, let's a do a multiple regression. Let's see if we can't regress the
#' immigration sentiment variable on the Scotland dummy variable, whether the
#' respondent is a woman, and the respondent's ideology on a 0-10 scale where 0
#' is the most to the right, and 10 is the most to the left. You knew how to do 
#' this in the bivariate case, but what do you do to the formula when you have 
#' multiple independent variables? Simple as this:
#' 
M5 <- lm(immigsent ~ scotland + female + lrscale, Data)
summary(M5)

#' In other words: separate however many *x* variables you want in the equation 
#' with a `+` sign. The formula above could be read as "using the `lm()` 
#' function, regress `immigsent` on `scotland` and `female` and `lrscale` from 
#' the `Data` data set." Let's walk through the basic takeaways from the model:
#' 
#' - The estimated pro-immigration sentiment for a man, not residing in Scotland,
#'   who is furthest to the political left is 20.38. In this case, the 
#'   y-intercept actually exists. This parameter isn't total nonsense, at least
#'   on paper.
#' - Partialing out everything else in the model ("ceteris paribus"), the effect
#'   of going from not-Scotland to "Scotland" increases pro-immigration sentiment
#'   by 1.45 points on our scale. If the true "Scotland effect" were 0, we'd
#'   expect to have observed this effect, on average, about 1.12% of the time.
#'   We'd suggest that "Scotland effect" is statistically significant.
#' - The partial effect of being a woman versus a man ("ceteris paribus") is to 
#'   decrease pro-immigration sentiment by -.43 points. However, the corresponding
#'   *t*-statistic and *p*-value suggests that such an effect would be observed
#'   about 19% of the time if there truly was no effect. Thus, the parameters are
#'   consistent with a null hypothesis of no relationship between pro-immigration
#'   sentiment and gender. We can't discern a difference between men and women
#'   exist here with respect to this outcome variable.
#' - Partialing out the Scotland effect and potential gender differences, we see
#'   an unsurprising effect of ideology on pro-immigration sentiment. Higher 
#'   levels of ideological affinity with the right coincide with lower levels of
#'   pro-immigration sentiment. The unit effect is -.63 and that effect is
#'   incidentally the most precise in the model. There is a clear ideology effect
#'   in our model.
#'   
#' ## Interactive Effects
#' 
#' Let's walk through the issue of what to do when you think there might be an 
#' interactive relationship among *x* and *z* in their relationship with *y*. 
#' I'm going to keep this example simple. We noticed there were no gender 
#' differences, but is that because it depends on ideology? Let's see, and notice
#' how we're going to see it below. When you want to see if two (or more, but 
#' let's focus on just two) things interact, you put an asterisk (`*`) between 
#' the variables in the right-hand side of the equation. Like this:
#' 
M7 <- lm(immigsent ~ scotland + female*lrscale, Data)
summary(M7)

#' Notice that two variables we interacted produced three regression 
#' coefficients. The `female` coefficient is the gender difference when `lrscale`
#' is 0 (i.e. the respondent is furthest to the left). The  `lrscale` coefficient
#' is the effect of increasing ideological affinity to the right when `female` 
#' is 0 (i.e. the effect of ideology on men). The `female:lrscale` communicates 
#' a way of thinking about how the two interact with each other, and whether that
#' interaction is "significant". Here, it is (albeit at the .10 level). The 
#' "Scotland effect" variable is not affected by this and is still an "all else
#' equal" partial effect.
#' 
#' I want to use this as an illustrative case. *Be careful* in how you interpret
#' your interactions, and how you model them. We got lucky here that that the
#' data have naturally occurring 0s, but what is "0"---and whether there's a 0
#' at all---will influence what exactly the regression summary tells you. For 
#' example, let's center the data so that 0 is the people in the middle of the 
#' scale. Observe:
#' 

Data %>%
  mutate(c_lrscale = lrscale - 5) -> Data

Data %>%
  distinct(lrscale, c_lrscale) %>%
  arrange(lrscale)

#' Take very quick inventory of what this did. We took the raw data and 
#' subtracted 5 from it. Now, 0 are people who would place themselves in the
#' ideological middle. 5 are the people furthest to the right and -5 are the 
#' people furthest to the left. We didn't really "change" 0, as much as we 
#' shifted it.
#' 
#' Now, watch.
#' 
M8 <- lm(immigsent ~ scotland + female*c_lrscale, Data)
summary(M8)

#' Compare this to the output of `M7`. Notice how the gender dummy variable isn't
#' significant? Except, again, you have to be careful because the `female` dummy
#' is now the difference between men and women for those in the middle, not the
#' far left. The interactive coefficient didn't move, but the gender variable
#' did. The ideology constituent effect didn't move because men are still the
#' baseline. Just be careful about what's actually happening in an interaction.
#' 
#' I wanted to close with a caveat about interactions. Just because the overall 
#' interactive effect is insignificant, or just because component terms of the
#' interaction are insignificant, doesn't mean there isn't something potentially 
#' interesting happening in your interactions. If the interaction is 
#' statistically significant, you should *definitely* look at it to see what the 
#' interaction looks like. If it's insignificant, you should still look at it.
#' After all, you're the one that felt compelled to interact two things. Go look
#' at what you did just to see for yourself.
#' 
#' Here's what we'll do. We'll create a simple prediction grid. Let the ideology
#' variable be either 0 (furthest to the left), 5 (in the middle), or 10 
#' (furthest to the right). Let the `female` variable range from 0 to 1. Then, 
#' given the model we estimated, we're going to get predictions from the model 
#' and standard errors around those predictions. Let the Scotland effect be 0
#' for convenience sake. Then, we'll create conceivable lower and upper bounds 
#' around these estimates and graph them.
#' 
#' Note: it's killing me to not ask you to install `{modelr}` for this, but you
#' should absolutely install `{modelr}` for this. It makes it way less tedious.

tibble(lrscale = c(0, 5, 10, 0, 5, 10),
       female = c(0,0,0,1,1,1),
       scotland = 0) -> newdat

newdat # Look at our hypothetical data/prediction grid.

Preds <- predict(M7, newdata = newdat, se.fit = TRUE)

Preds # Look at our predictions.

Preds %>% as_tibble() %>% # Coerce to a tibble, and...
  # bind_cols to the prediction grid/hypothetical data, and...
  bind_cols(newdat, .) %>% 
  # select just what we want, to look at  it...
  select(lrscale:se.fit)

#' Since I can see that performed how I wanted it to. Let's make it pretty.

Preds %>% as_tibble() %>%
  bind_cols(newdat, .) %>% 
  select(lrscale:se.fit) %>%
  # Okay, there's going to be some hackiness (sic). First, we're going to create
  # lower and upper bounds that are 90% intervals around +/- the standard error.
  # Next, we're going to create a categorical variable from the female variable
  # in a way that's super obvious. Next. We're going to create a categorical
  # variable from the `lrscale` variable (ideo_lbl) that adds more intuitive 
  # labels. The \n you see there forces a linebreak for legibility in the label.
  # You'll see what it does. Using the fct_inorder() forces it to be ordered
  # because, otherwise, it would display 0, 10, 5 (i.e. furthest, furthest, 
  # middle) because of the alphabetical order. Because our data are already
  # ordered, this is just a shortcut/cheat code to do what we want to do. In
  # this case, it's because I know y'all might help to see the left in red and 
  # the right in blue.
  mutate(lwr = fit - 1.645*se.fit,
         upr = fit + 1.645*se.fit,
         gndr_lbl = ifelse(female == 0, "Men", "Women"),
         ideo_lbl = case_when(
           lrscale == 0 ~ "Furthest\nLeft",
           lrscale == 5 ~ "Middle",
           lrscale == 10 ~ "Furthest\nRight"
         ),
         ideo_lbl = fct_inorder(ideo_lbl)) %>%
  ggplot(.,aes(gndr_lbl, fit, ymin=lwr, ymax=upr, color=ideo_lbl)) +
  geom_pointrange(position = position_dodge(width = .1)) +
  coord_flip() +
  theme_minimal() +
  labs(x = "", y = "Estimated Pro-Immigration Sentiment (with 90% Bounds)",
       color = "")


#' What this tells me: the effect of increasing ideology on pro-immigration 
#' sentiment is stronger for the men than it is the women. For women furthest
#' to the left, their mean pro-immigration sentiment is 19.3. Furthest to the 
#' right: 14.3. That's a min-max difference of about 5 points. For the men: the 
#' estimated pro-immigration sentiment for non-Scottish dudes furthest to the 
#' left is 21.2, which is interestingly higher than the women furthest to the
#' left. However, the estimated pro-immigration sentiment for the men furthest
#' to the right is 13.2. That's a min-max difference of 8 points. Kinda cool,
#' and kinda interesting as I'm doing this on the fly for the sake of this class.
#' 
#' There, is I suppose, an alternate interpretation that is largely consistent
#' with the data but communicates a different mechanism happening here. In the
#' above interpretation, I largely channeled the effect of ideology through men
#' and women. There could also be a story to tell about why there is an apparent
#' gender difference among those furthest to the left but no real gender 
#' differences among those furthest to the right. In this potential 
#' interpretation, there is a potential gender mechanism that's being clustered
#' on ideology, and why is it that the men and women are more like each other
#' on this matter (in this simple model) among the most right whereas they are
#' more different among those most to the left.
#' 
#' I'll defer to you to make sense of the findings as you see fit, and why you
#' think you see what you see (provided you treat such an obviously simple model
#' as gospel truth about the UK population at this point in time). No matter, 
#' always look at your data. You might find something cool in there.