#' ---
#' title: "Some Basics of Descriptive Statistics"
#' layout: lab
#' permalink: /lab-scripts/lab-2/
#' filename: descriptive-inference.R
#' active: lab-scripts
#' abstract: "This lab script will introduce students to some basic elements of
#' descriptive statistics. Nothing here will be too exotic and assumes at least
#' some rudimentary knowledge of measures of central tendency. The recurring 
#' theme here is you should never trust a summary statistic without first 
#' looking at your data. The best diagnostics are visual, and ultimately 
#' anticipatory."
#' output:
#'    md_document:
#'      variant: gfm
#'      preserve_yaml: TRUE
#' ---

#+ setup, include=FALSE
knitr::opts_chunk$set(collapse = TRUE, 
                      fig.path = "figs/lab-2/",
                      cache.path = "cache/lab-2/",
                      fig.width = 11,
                      comment = "#>")
#+


#' ## R Packages/Data for This Session
#' 
#' You should've already installed the R packages for this lab session. 
#' `{tidyverse}` will be for all things workflow and `{stevedata}` will be for 
#' the toy data sets. Let's load those libraries now. The order doesn't really 
#' matter here.

library(stevedata)
library(tidyverse)

#' If, for some reason, you don't want to load `{tidyverse}` (i.e. you might be
#' one of my past students that had some issues with `{systemfonts}`), here are
#' the component packages we'll be using. Just "uncomment" (sic) these four 
#' lines of code and execute them.

# library(dplyr)
# library(tidyr)
# library(magrittr)
# library(ggplot2)

#' In your R session, please look up the documentation for the data set to get 
#' a better idea as to what each variables are. Here's how I want you to think 
#' about this. You the smart researcher know that the `race` column in 
#' `gss_spending` is a categorical variable where 1 = white people, 2 = black 
#' people, and 3 = people who self-identify with another race. However, your 
#' dumb computer and dumb R program don't know that. They just see numbers. You 
#' know the underlying information they communicate. At least, you  should know it.
#' 
#' I like to think my toy data sets are well annotated. If you want to read the 
#' documentation file for one of my toy data sets, put a `?` before the object 
#' name in the console. This should fire up the documentation file in another 
#' Rstudio pane (i.e. where the plots typically go).
#' 
#' We'll be using three  toy data sets in this session. The first is American 
#' attitudes toward national spending in the General Social Survey (2018). The 
#' second is a sample of macroeconomic data from the Penn World Table 
#' (from [the previous lab session](http://eh6127.svmiller.com/lab-scripts/lab-1/)). 
#' The third data set are attitudes about the justifiability of taking a bribe 
#' in the World Values Survey.

# Check out documentation.
?gss_spending
?pwt_sample
?wvs_justifbribe

#' ## Identifying Modes and Medians
#' 
#' If you know a variable of interest is discrete (i.e. it only has a finite 
#' set of responses), your tools for assessing central tendency are limited. 
#' There are two types here: unordered-categorical and ordered-categorical. 
#' Unordered-categorical variables are cases where numeric values communicate 
#' only differences. While we all learned in elementary school that 2 is greater 
#' than 1, we gloss over that it's also true that 2 does not equal 1. They are 
#' distinct from each other. Sometimes, we have a variable where that's all the
#' underlying information available. Unordered-categorical variables are quite 
#' common in the social sciences, even as we don't necessarily like to 
#' acknowledge them as such (because they are information-poor). 
#' 
#' ### Getting Modes for Unordered-Categorical Variables
#' 
#' For example, the `race` variable in the `gss_spending` is 
#' unordered-categorical. We can see from the documentation that 1 = white, 
#' 2 = black, and 3 = other. However, we cannot say that 2 (black) is greater 
#' than 1 (white) nor can we say that 3 (other) is two greater than 1 (white). 
#' Numerically, we know that's true, but that's just not the information 
#' communicated in this variable. All we can say is that 2 != 1 (which is also 
#' true, but not something we really learned in grade school math). Likewise, 
#' white != black.
#' 
#' Under those conditions, the most we can do with unordered-categorical variables 
#' is to get the mode as an information-poor measure of central tendency. The 
#' mode refers to the most commonly occurring value. R doesn't have a built-in 
#' mode function. That's understandable since modes aren't very useful 
#' information and you can get it through other means. However, we can write a 
#' simple function to extract the mode from a variable.

# R doesn't have a built-in mode function. So let's write one.

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

getmode(gss_spending$race)

gss_spending %>% summarize(mode = getmode(race))

#' This tells us that there are more white people in the sample than there are 
#' people identifying with any other race. This is not surprising. 
#' [We already know](https://www.census.gov/quickfacts/fact/table/US/PST045219)
#' that white people are about 75% of the U.S. population (about 60% 
#' disentangling out Hispanic [like me] as an ethnicity).
#' 
#' There are more creative means to get the mode and get more information to boot. 
#' For example, here's a "tidy" approach that leans on `group_by()` and `count()`.
#' 
gss_spending %>%
  count(race) %>%
  # ^ you could stop here, but let's have more fun.
  # create more information...
  mutate(race_lab = c("White", "Black", "Other"),
         prop = n/sum(n)) %>%
  # creatively arrange the order of columns
  select(race, race_lab, everything())

#' You could also graph this information. Do note I encourage using a fancier 
#' `{ggplot2}` theme (like my `theme_steve()`) in `{stevethemes}`,
#' but this will do for this purpose.
#' 
gss_spending %>%
   count(race) %>%
   ggplot(., aes(as.factor(race), n)) +  
   geom_bar(color="black", alpha=0.8, stat="identity") +
   scale_y_continuous(labels = scales::comma_format(big.mark = " ")) +
   scale_x_discrete(labels = c("White","Black","Other")) +
   geom_text(aes(label=n), vjust=-.5, colour="black",
             position=position_dodge(.9), size=4) +
   labs(x = "",
        y = "Number of Observations",
        caption = "Data: General Social Survey (2018) via ?gss_spending in {stevedata}",
        title = "A Bar Chart of Respondent Race in the General Social Survey (2018)")


#' ### Getting Medians for Ordered-Categorical Variables
#' 
#' The second kind of discrete variable is called "ordered-categorical." These 
#' are variables where increasing (or decreasing) values communicate "more" (or 
#' "less") of some phenomenon being measured, but the distance is inexact or 
#' even vague. Logically, ordered-categorical variables say 3 > 1 where 
#' unordered-categorical variables say 3 != 1, though these variables do not 
#' tell us that 3 is two greater than 1.
#' 
#' Ordered-categorical variables appear in several forms, but perhaps the most 
#' common is a Likert item. Likert items give a statement to a respondent and 
#' prompt the respondent to respond with their level of agreement. This is often 
#' a five-point scale with responses of "strongly disagree" (1), "disagree" (2), 
#' "neither agree nor disagree" (3), "agree" (4), and "strongly agree" (5). 
#' If someone says they "strongly agree" with the prompt, we can reasonably 
#' say they are in more agreement with the prompt than someone who just
#' "agrees". But how much more are they in agreement with the prompt? 
#' We can't say because the only difference between "strongly agree" 
#' and "agree" is a vague magnifier of "strongly". In other words, 5 > 4, 
#' but 5 is not one greater than 4 in this context.
#' 
#' The most precise measure for central tendency for ordered-categorical 
#' variables is the median. The median refers to the middlemost value where 
#' half of the observations are above it and half of the observations are below 
#' it. Fortunately, R has a built-in `median()` function. One caution, though: 
#' if you have NAs in the column for which you want the median, the output will 
#' return an NA. You can suppress that at your discretion with `na.rm=T`.
#' 
#' Here's an example of the highest degree obtained by the respondent in the 
#' `gss_spending` data frame. Educational attainment is routinely measured
#' like this. Higher values = more educational attainment, which broadly 
#' means "more education." In the `gss_spending` data frame, the `degree` 
#' variable assumes values of 0 (did not graduate high school), 1 (completed 
#' high school), 2 (completed junior college), 3 (completed four-year 
#' bachelor's equivalent), 4 (graduate degree). This would be the median of 
#' that variable.
#' 
median(gss_spending$degree, na.rm=T)

#' That's unsurprising the extent to which we know 
#' [only about 35% of Americans have at least a four-year college diploma](https://www.census.gov/newsroom/press-releases/2020/educational-attainment.html).
#' 
#' You can also calculate a cumulative percentage to get the same information. 
#' The row where the cumulative percentage of responses exceeds 
#' 50% is the row that contains the median.
#' 
gss_spending %>%
  count(degree) %>%
  mutate(degree_lab = c("Did Not Graduate HS", "High School", "Junior College", 
                        "Bachelor Degree", "Graduate Degree"),
         prop = n/sum(n),
         cumperc = cumsum(prop)*100) %>%
  select(degree, degree_lab, everything())

#' ## Identifying the Mean (and Standard Deviation)
#' 
#' Means are what would want in most applications, provided your 
#' data are interval and could be plausibly described by the mean. 
#' The mean is the statistical "average" of the data. They are reserved for
#'  data where numerical values communicate exact differences.
#' 
#' For example, let's go back to our `pwt_sample` data and calculate 
#' the mean for the real GDP variable for each year, as well as the standard 
#' deviation. Like the `median()` function, `mean()` and `sd()` throws up an 
#' NA if there are missing observations. We can suppress that the same way.
#' 
pwt_sample %>%
  summarize(mean_rgdpna = mean(rgdpna, na.rm=T),
            sd_rdgpna = sd(rgdpna, na.rm=T),
            .by = year)

#' Remember to always look at your data before saying anything with confidence 
#' about them, but a standard deviation that is larger than a mean is usually 
#' a good indicator of a large, large spread in the data. Since we're dealing 
#' with GDP data you'll have the GDPs of the United States juxtaposed to 
#' Greece and Iceland, at least in this sample.
#' 
#' You can also graph this if you wanted to see mean real GDP over time for 
#' these 22 countries.

pwt_sample %>%
  summarize(meanrgdp = mean(rgdpna, na.rm=T),
            .by = year) %>% 
  mutate(meanrgdpb = meanrgdp/1000) %>%
  ggplot(.,aes(year, meanrgdpb)) + 
  geom_line() +
  scale_x_continuous(breaks = seq(1950, 2020, by = 10)) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(x = "",
       y = "Average Real GDP (in Constant 2017 National Prices, in Billion 2017 USD)",
       caption = "Data: Penn World Table (v. 10.0) via ?pwt_sample in {stevedata}",
       title = "Average Real GDP for 22 Rich Countries, 1950-2019",
       subtitle = "The average real GDP in 2017 was over 2.3 trillion dollars, which should seem super sketchy.")

#' Nothing looks terribly amiss in that line chart, per se. You see a little 
#' recession effect. You see the "lostness" of the 1970s, at least a little. 
#' However, something is. Did you notice that the average real GDP for these 22 
#' countries in 2019 was over 2 trillion dollars? Something seems fishy about 
#' that, because it is. These data are 22 select rich countries, but the United 
#' States has an enormous GDP among even those select rich countries. I use 
#' these data to underscore a basic problem with the arithmetic mean for 
#' continuous variables: it is *very sensitive* to skew (here: right skew). The 
#' U.S. is a clear anomaly among these select rich countries. Indeed, it looks 
#' like the U.S. is four times the size of the next biggest economy in 2019 and 
#' is about as big as the next seven biggest economies (in these data) combined.
#' 
pwt_sample %>%
  filter(year == 2019) %>%
  mutate(rgdpb = rgdpna/1000) %>%
  select(country, year, rgdpb) %>%
  arrange(-rgdpb) %>%
  data.frame(.)

#' This is why economic indicators are often expressed by the median rather than 
#' the mean. The mean is sensitive to outlier observations whereas the median 
#' is robust to them. When you describe your interval-level variables, do a 
#' quick comparison between the median and the mean to see how far off the 
#' median is from the mean. Where the distance is large, it points in the 
#' problem of outlier observations.
pwt_sample %>%
  filter(year == 2019) %>%
  mutate(rgdpb = rgdpna/1000) %>%
  ggplot(.,aes(rgdpb)) + 
  geom_density() +
  geom_vline(aes(xintercept = median(rgdpb, na.rm = T)), linetype="dashed") +
  geom_vline(aes(xintercept = mean(rgdpb, na.rm = T))) +
  scale_x_continuous(breaks = seq(0, 20000,by=2000),
                     labels = scales::dollar_format()) +
  labs(title = "A Density Plot of Real GDP in 2019 for 22 Rich Countries",
       x = "Real GDP in Billions 2017 USD",
       y = "",
       caption = "Data: Penn World Table (v. 10) via ?pwt_sample in {stevedata}.\nMedian noted in the dashed line while the mean is noted in the solid line.",
       subtitle = "Suddenly, 'average' doesn't look so 'average'.")

#' Here's another way of looking at it: plotting the yearly mean against the 
#' yearly median. The extent to which the mean further separates from the median
#' implies a skew problem that is getting worse over time.
#' 
pwt_sample %>%
  mutate(rgdpb = rgdpna/1000) %>%
  summarize(Median = median(rgdpb, na.rm=T),
            Mean = mean(rgdpb, na.rm=T),
            .by = year) %>%
  gather(Category, value, Median:Mean, -year) %>%
  ggplot(.,aes(year,value, color=Category, linetype=Category)) +
  geom_line(linewidth=1.1) +
  scale_x_continuous(breaks = seq(1950, 2020, by =5)) +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "bottom") + labs(x = "",
       y = "Median and Mean Real GDP (in Constant Billion 2017 USD)",
       caption = "Data: Penn World Table (v. 10) via ?pwt_sample in {stevedata}",
       title = "Median and Mean Real GDP for 22 Rich Countries, 1950-2019",
       subtitle = "A mean that further separates from the mean suggests a worsening skew problem.")

#' ## An Aside: On "Dummy" Variables
#' 
#' I want to dedicate a special comment here to what are commonly called 
#' "dummy" variables. "Dummy" variables (aka binary variables) are variables 
#' that take on just two values---functionally 0 or 1. While some dummies have 
#' an intuitive order (e.g. "democracy" vs. "not democracy" implies "more 
#' democracy"), the phenomenon being measured is strictly "there" or "not 
#' there." This means they are a unique subset of nominal variables, 
#' effectively saying they are unordered-categorical with just two categories 
#' ("there" or "not there", "success" vs. "fail").
#' 
#' That said, I liken them to the carbon or water of the variable world in that 
#' they behave in interesting ways and have some interesting 
#' peculiarities/properties. With respect to central tendency, the mode and 
#' median of a dummy variable will be the same value and the mean will 
#' communicate the proportion of 1s. This assumes that your dummies are recoded 
#' to 0/1, which you should *always* do. Dummies are often provided to you as 
#' 1/2. Always recode those.
#' 
#' Let's see what this looks like with the `sex` variable in `gss_spending`. 
#' `sex` is a numeric vector for the respondent's sex, either female (1) or 
#' male (0).
#' 
gss_spending %>%
  # where sex = 1 for women, 0 for men
  summarize(mode = getmode(sex),
            median = median(sex),
            mean = mean(sex))

#' The above tells me that there are more women than men in the data (the mode), 
#' the middlemost observation is a woman on this dummy variable (the median), and
#' that about 55.2% of the observations are women (the mean).
#' 
#' ## An Aside: Ordinal or Interval? Or Something Else?
#' 
#' Practitioners love to treat technically ordinal data as interval. 
#' There's more you can do with an interval-level variable than you can with 
#' an ordinal (ordered-categorical) variable. However, you have to ask yourself 
#' just how advisable this is. For example, take yourself back to college when 
#' you just turned 21. Are you truly one year older than your
#' 20-year-old friend when it comes to identification at the off-campus bar? 
#' Legally you are, but you might just be three weeks older than your 20-year-old 
#' friend. Here, you have information loss in the form of days/weeks. Is someone 
#' who lists their yearly income at $50,010 truly 10 dollars richer than someone 
#' who lists their yearly income at $50,000? Legally they are (i.e. Uncle Sam 
#' cares about dollars, not cents), but you have information loss in the form of 
#' cents.
#' 
#' That said, we often describe age by reference to an average and we can 
#' (provided there are no major issues of skew) understand an average income. 
#' Certainly, practitioners would go-go-gadget a mean on those variables. Here's 
#' a general rule of thumb that I haven't seen anyone formalize out loud (even 
#' if there's supporting scholarship you'll read for this) but nonetheless 
#' guides how we do this. First, check the number of possible values. Second, 
#' check the skew. If you have at least 5---I would caution at least 7, even to 
#' 10)---distinct values, you can *start* to think of it as interval. However, 
#' check the distribution of the variable. If you have a major skew problem, 
#' *don't* treat it as interval and consider another approach to modeling your 
#' variable.
#' 
#' Age is technically ordinal, but there are typically a wide number of distinct 
#' ages in survey data spanning from 18 to the 80s and beyond. Therein, 
#' differences between observations are granular enough that means with decimals 
#' make some sense. Below, the mean is about 48.97. You can reasonably 
#' conceptualize that as an average age of 48 that is 97% of the way to 
#' turning 49.

gss_spending %>%
  summarize(mean = mean(age, na.rm=T),
            median = median(age, na.rm=T),
            distinct = n_distinct(age))

#' Caveat, naturally: don't try to explain age as some outcome if you're 
#' thinking ahead to a linear model. Age is a linear function of time. :P
#' 
#' The `sumnat` variable in my `gss_spending` data frame summarizes all 
#' responses to spending toward all programs from the ordinal scale provided in 
#' the original prompts (again: `?gss_spending` for more). It has a small skew 
#' problem to the left, but it's cast across 33 different values.

gss_spending %>%
  summarize(mean = mean(sumnat, na.rm=T),
            median = median(sumnat, na.rm=T),
            distinct = n_distinct(sumnat))

#' A diagnostic graph doesn't suggest a major problem for skew either. It's not 
#' a disservice to this variable to summarize it with a mean.
#' 
gss_spending %>%
  ggplot(.,aes(sumnat)) +
  geom_bar(alpha=0.8, fill="#619cff",color="black") +
  labs(title = "A Histogram of Attitudes Toward Government Spending",
       subtitle = "There is some left skew, but the data still look normal.",
       caption = "Data: General Social Survey (2018), via ?gss_spending in {stevedata}",
       x = "", y = "Number of Observations in Each Bar")

#' Some variables, however, are too damn ugly to think of as interval the extent 
#' to which you want informative means and so on. Here's an example from the 
#' World Values Survey regarding the justifiability of taking a bribe. This is 
#' part of a series of World Values Survey items where the researcher lists 
#' various behaviors (e.g. abortion, divorce, homosexuality, here: taking a 
#' bribe) and the respondent lists how justifiable they think it is on a 
#' 1-10 scale.  10 = always justifiable. 1 = never justifiable. Here would be 
#' the summary statistics.

wvs_justifbribe %>%
  na.omit %>%
  summarize(mean = mean(f117, na.rm=T),
            median = median(f117, na.rm=T),
            distinct = n_distinct(f117))

#' I'm anticipating a problem here. The mean is almost one off the median with 
#' just 10  distinct values. A graph will bring the problem in full light.

wvs_justifbribe %>%
  count(f117) %>%
  na.omit %>% # count() will also summarize the NAs, which we don't want.
  ggplot(.,aes(as.factor(f117), n)) +
  geom_bar(stat="identity", alpha=0.8, color="black") +
  scale_x_discrete(labels=c("Never Justifiable", "2", "3", "4",
                            "5", "6", "7", "8", "9", "Always Justifiable")) +
  scale_y_continuous(labels = scales::comma_format(big.mark = " ")) +
  geom_text(aes(label=n), vjust=-.5, colour="black",
            position=position_dodge(.9), size=4) +
  labs(y = "Number of Observations in Particular Response",
       x = "",
       title = "The Justifiability of Taking a Bribe in the World Values Survey, 1981-2016",
       caption = "Data: World Values Survey (1981-2016), via ?wvs_justifbribe in {stevedata}",
       subtitle = "There are just 10 different responses in this variable with a huge right skew.")


#' Here's what I recommend doing in an admittedly extreme situation like this. 
#' First, no matter the 10 different values, *don't* ask for a mean from this. 
#' A mean won't be terribly informative because the data have a huge skew 
#' problem and a finite scale of possible responses. Two, the median is at the 
#' most you can get here, but I'd caution against treating it as useful 
#' information as well. Again, there's a huge skew. Third, for an extreme 
#' situation like this where more than 70% of the data are in just one 
#' response, collapse the variable into a dummy variable. In this case, 0 = 
#' taking a bribe is never justifiable and 1 = taking a bribe is at least 
#' somewhat justifiable. I 
#' [did something like this with the exact question](http://svmiller.com/research/external-territorial-threats-and-tolerance-of-corruption-a-privategovernment-distinction/); 
#' others have too. Here's how you'd basically do it. Your data basically look 
#' like this.
#' 
wvs_justifbribe

wvs_justifbribe %>%
  mutate(never_justifiable = ifelse(f117 == 1, 1, 0),
         somewhat_justifiable = ifelse(f117 == 1, 0, 1)) %>%
  select(uid, f117, never_justifiable:ncol(.)) 

#' You want to use the `mutate()` function to create a dummy variable whereby if 
#' `f117` is 1, it's 0. If it's greater than 1, code it as 1. Let's use the 
#' `mutate()` wrapper to create a new `bribe_dummy` variable that has this 
#' information. Pay careful attention to what the `case_when()` function is 
#' doing in  this context.

# Take our data frame, and...
wvs_justifbribe %>%
  # use the mutate() function to create a new column, using case_when(), where...
  mutate(bribe_dummy = case_when(
    # if f117 == 1, it's 0
    f117 == 1 ~ 0,
    # if it's f117 > 1 (or %in% c(2:10)), it's 1
    f117 > 1  ~ 1,
    # if it's anything else (here: NA), keep it as NA.
    TRUE ~ NA_real_
  )) -> Data # Create a new data object for now.

#' Let's see what this did.
Data %>%
  distinct(f117, bribe_dummy) 

#' Congrats on recoding your first variable! Notice too how you created a new 
#' column that did not overwrite the original, so you can always troubleshoot 
#' what you did to see if you messed up somewhere. Btw, since I know this is 
#' going to be something you'll need to know to do for a homework question I'll 
#' ask, let's assume we're interested in those that are `NA` here. `NA` can mean 
#' multiple things, but what if we just assume that they're people that are 
#' withholding what they think about taking a bribe and we want to perhaps 
#' understand that as a kind of dependent variable? Here's how we'd do that.

# Taking our data frame, and...

Data %>%
  # ...using mutate() to create a new column in the data where,
  # using case_when()
  mutate(missing_f117 = case_when(
    # if it's missing, it's a 1
    is.na(f117) ~ 1,
    # if it's anything else (`TRUE` in case_when() speak), it's 0 
    TRUE ~ 0
  )) -> Data # assign to object

#' Let's see what this did.

Data %>% 
  distinct(f117, missing_f117)