#' ---
#' title: "Data Generation, Measurement, and Visualization"
#' layout: lab
#' permalink: /lab-scripts/data-viz/
#' filename: data-viz.R
#' active: lab-scripts
#' abstract: "This lab script eases students into the use of `{ggplot2}` for
#' the visualization of data. Along the way, it also introduces students to
#' how to create data for sake of illustration and how to get the most 
#' information out of graphs you can create. In the latter case, it means being
#' aware of `{ggplot2}`'s otherwise sensible defaults and using information the
#' researcher has about the phenomena being measured to improve a graph's
#' legibility and clarity."
#' output:
#'    md_document:
#'      variant: gfm
#'      preserve_yaml: TRUE
#' ---

#+ setup, include=FALSE
knitr::opts_chunk$set(collapse = TRUE, 
                      fig.path = "figs/data-viz/",
                      cache.path = "cache/data-viz/",
                      fig.width = 11,
                      comment = "#>")
#+
# =========================
# Data Generation, Measurement, and Visualization
#
# Steven V. Miller (EH 1903)
# =========================
#   \
#     \
#       \
#           |\___/|
#         ==) ^Y^ (==
#           \  ^  /
#            )=*=(
#           /     \
#           |     |
#          /| | | |\
#          \| | |_|/\
#          //_// ___/

#' If it's not installed, install it.

library(tidyverse)
library(stevedata)
library(stevethemes)

#' ## The Basics of `{ggplot2}` 

#' Let's start with the basics. Every foundation to a plot you make will look
#' like this. First, let's create some fake data.

tibble(x = rnorm(100),
       y = x+rnorm(100)) -> Example

Example

#' Now, let's use this basic information to create a ggplot object.

ggplot(Example, aes())

#' ^ Alternatively, as I'm inclined to do it for more general jobs.

Example %>% # pipe operator, and...
  ggplot(., aes()) 

#' ^ In the above function, the . there is just the more literal way of saying
#' "whatever is is active the command above the current command is what goes here."
#' Notice that is just the `Example` data frame.

#' Notice that the `ggplot()` function takes first an assumed data source 
#' (`Example`), followed by an aesthetic (`aes()`) argument contained in it. If 
#' you leave this blank, you get just a (default gray) canvas. If you specify 
#' the name of a column contained in the data source, you first get an x-axis. 
#' Observe.

ggplot(Example, aes(x))

#' If you specify another column after a comma, you get a y-axis. Observe.

ggplot(Example, aes(x, y))

#' Notice this hasn't plotted anything yet. It just created the canvas for you.
#' What comes next depends on what you want to communicate. In this simple case,
#' we have two variables (`x` and `y`) that are functionally continuous and `y` is a 
#' simple linear function of `x`. This seems like an easy call for a scatterplot.
#' If you want to declare what type of plot you want on your ggplot canvas, you
#' specify it with some relevant "geom", preceded by a plus sign. In this case,
#' `geom_point()` creates dots corresponding with the coordinates of x and y. This,
#' minimally, creates a scatterplot.

#' ## Scatterplot
ggplot(Example, aes(x, y)) + geom_point()

#' Notice you can stack other geoms on top of each other. For example, you can
#' illustrate the linear form of the data points with `geom_smooth()`. Do note there
#' is an optional "method" argument in `geom_smooth()`, which I'm using to tell
#' `{ggplot2}` that I want a linear smoother. The default is the LOESS smoother, 
#' which is situationally useful for teasing out non-linear relationships. If 
#' you want that default smoother, just don't specify the `method = 'lm'` 
#' argument.

ggplot(Example, aes(x, y)) + 
  geom_point() +
  geom_smooth() +
  geom_smooth(method = 'lm', linetype = 'dashed', color = 'black')

#' It's worth saying that these plots come with all sorts of customization options,
#' that you'll either want to use or not use. For example, what if I wanted the
#' dots to be hollow triangles? I'm not sure why in a simple case like this that
#' I would want that, but it's doable.

ggplot(Example, aes(x, y)) + 
  geom_point(shape = 2) +
  geom_smooth(method = "lm")

#' Feel free to explore options here. You can see them here:
#' 
#' - http://www.sthda.com/english/wiki/ggplot2-point-shapes
#' 
#' This would be a good time to introduce one other thing you should think to
#' have on all your plots: labels. As a matter of hygiene, I close all my plots
#' with a `labs()` argument for specifying useful information about your plots.
#' `labs()` takes a whole lot of arguments, some of which are contingent on your
#' plot's complexity. For ease of explanation, I'm just going to offer this code
#' with the idea being you can see what's happening here.

ggplot(Example, aes(x, y)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "The Title of My Plot",
       subtitle = "A subtitle, which may or may not be useful.",
       x = "An Informative x-Axis", y = "An Informative y-Axis",
       caption = "Data: Simulated Data. Hi Mom!")

#' That's really it. I'll only add that if you're preparing a graph for a journal,
#' you'll want to ignore plot titles and subtitles because that information is
#' typically communicated as a figure caption. That's on the document side of things
#' and not the plot side of things. 
#' 
#' I'll close with one plea here: resist the urge to roll out a default ggplot
#' theme and be done with it. Add a theme. What you choose is up to you, and there
#' is no shortage of themes out there. My preferred theme is `theme_steve()` 
#' from my `{stevethemes}` package.

ggplot(Example, aes(x, y)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "The Title of My Plot",
       subtitle = "A subtitle, which may or may not be useful.",
       x = "An Informative x-Axis", y = "An Informative y-Axis",
       caption = "Data: Simulated Data. Hi Mom!") +
  theme_steve(style='generic')

#' I want to add that you may want to explore some of the font options in this
#' `{stevethemes}` package. Type `?how_to_install_fonts()` for more information.

#' I never really think to do this, but you could use the `theme_set()` function,
#' preferably near the top of your script, to set a default theme. That way,
#' you can avoid having to manually specify a theme each time.

theme_set(theme_steve(style='generic'))

#' Observe what this does.

ggplot(Example, aes(x, y)) + 
  geom_point() +
  geom_smooth(method = "lm")

#' From here, though, everything else will be a simple matter of showing you
#' how to make different kinds of plots. Let's start with what I think to be
#' the most basic: the bar chart. Here, the data are discrete and we just want
#' a rough estimate of count. Let's use the `steves_clothes` data frame in 
#' `{stevedata}`. This is a simple data set on the country of origin for my dress
#' apparel that I used for teaching undergrads in the United States about the
#' globalization of the garment industry. I can show you that presentation if you
#' like, but here let's just note the data set.

steves_clothes
?steves_clothes

#' Notice that everything here is categorical. What if I wanted to get some kind
#' of information from this data set. For example, what is the country of origin
#' for most of my dress clothes? A simple bar chart will do here. Watch what 
#' happens here:

ggplot(steves_clothes, aes(origin)) + # Don't need a y-axis. geom_bar() will give us one.
  geom_bar() +  # create a simple bar chart
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  # ^ create a label/count
  theme_steve(style='generic') +
  labs(caption = "Data: ?steves_clothes in {stevedata}. Hey, that's me!",
       x = "Country of Origin", y = "Count")

#' By default, the x-axis is ordered alphabetically. What if I wanted to order 
#' it from highest count to lowest? Unfortunately, you will need a few extra steps
#' here because the only axis supplied was the origin and the counts were done
#' on the fly. FWIW, it's why I typically do it this way.

steves_clothes %>% # start with the data, and...
  summarize(n = n(), .by=origin)  %>% # get a count by origin
  arrange(-n) %>% # arrange highest to lowest
  mutate(origin = fct_inorder(origin)) %>% # make a factor with that exact order
  ggplot(.,aes(origin, n)) + # notice we now have a y-axis
  geom_bar(stat='identity', fill=g_c("su_water"), alpha=.8, color='black') +
  # ^ stat='identity' is a fancy way of saying "I calculated this for you".
  # Also: let's make this look a little cuter. alpha = .8 gives a little transparency
  # color = 'black' adds a black border and the fill argument uses the g_c() 
  # function in {stevethemes} to add Stockholm University's branded "water" color.
  geom_text(aes(label=n), vjust=-.5) + # give it a text label, and...
  theme_steve(style = 'generic') +
  labs(caption = "Data: ?steves_clothes in {stevedata}. Hey, that's me!",
       x = "Country of Origin", y = "Count")

#' ## Histograms and density plots
#' 
#' Histograms and density plots are communicating the same basic
#' thing: the shape of the data. The density plot is a smoothed histogram, meaning
#' it's less sensitive to issues of binwidth. However, communicating "density" on
#' the y-axis isn't as intuitive to the reader.
#' 
#' Let's use the gss_wages data, which comes in {stevedata}. You can look at the
#' data here.

gss_wages
?gss_wages

#' I use these data here, which you can read about (if you'd like).
#' - http://svmiller.com/blog/2020/10/inference-permutations-gender-pay-gap-general-social-survey/
#' 
#' Let's keep this exercise simple, notwithstanding how you can use these data
#' to explore the gender pay gap in the United States. Let's use these data and
#' select on just those people who are in their so-called "peak earning years"
#' (which, in the U.S. context, is about 35-64) and we'll focus on just all 
#' observations since 2000.

gss_wages %>% 
  filter(between(age, 35, 64)) %>%
  filter(year >= 2000) -> Data

#' How might we summarize the distribution of incomes? One way is the basic
#' histogram, which would be this.

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram()

#' You can use a histogram to get a basic sense of the shape of the data, and 
#' you basically get that here. However, there are a few things that can be 
#' dissatisfying about the histogram. Notice the default number of bins is 30, 
#' which is often a sensible default. You can change that by specifying, say, 
#' bins = 60 in the  geom_histogram() function if you want to double the number 
#' of bins. Arguably a better thing to do here which will give you a more 
#' complete sense of the shape of the data is to ask for a density plot.

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_density()

#' Because the density plot is simply a smoothed histogram, what you get to see
#' is not a function of how many bins you want or are given to you as a default. 
#' That's not to say it's not without some issues you'll want to consider. For 
#' example, it too has parameters on which it depends (namely the kernel, which
#' typically defaults to Gaussian) and the bandwidth (which is doing the bulk of
#' the smoothing). {ggplot2}'s use of the density estimate is generally fairly smart
#' and avoids an advanced issue you'll at least want to be aware of (namely: the
#' density plot may create the appearance of data where none exist, which is 
#' something you're more likely to spot if you have good knowledge of what your
#' data already look like or what issues you could anticipate. The biggest 
#' downside of the density plot, though, is the reader might be thrown a bit
#' for what is happening in the y-axis. In the histogram, it's an intuitive count
#' of observations in a particular bin. In the density plot, it's something called
#' a kernel density estimate following the probability density function of the data.
#' That sounds like it's not a lot of fun to communicate to the reader, and you
#' would be forgiven for not wanting to think too much about it. The basic takeaway
#' then is to think about the density plot as more for you as a diagnostic tool
#' and to perhaps think about the histogram more for public presentation, even
#' if it comes with a caveat that what exactly you see in the histogram is a
#' function of how many bins you ask for.
#' 
#' I want to close with a quick guide on adjusting the scales a bit to make 
#' them more readable. Notice something about what you see below. 

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram(bins = 50)

#' You know the variable in question is dollars, but R has no way of knowing that. 
#' It just sees large nominal numbers and is representing them in scientific 
#' notation that you don't enjoy looking at. You can see that the y-axis on the
#' left has counts that can be quite large and you may want to pretty them up 
#' a bit. How might you do that? Here's where you'll want to invest time into
#' reading about what the scale_ functions do in {ggplot2}. Observe:

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma_format())

#' Here is where I'll impress that a lot of defaults are very American. Swedes
#' would prefer something like this.

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma_format(big.mark = " "))

#' You can also be crazy if you want.

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma_format(big.mark = "åäö"))

#' ^ Don't do this. I mean, you can. But don't.

#' Let's adjust the scale on the x-axis to add labels that communicate dollars.

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_x_continuous(labels = scales::dollar_format())


#' Note that there might be some trial and error you want to experiment with here,
#' much of which is dependent on what you know about the data. For example, R just
#' sees large nominal numbers and is adjusting defaults accordingly. However, I know
#' that dollar increments of income in hundreds of thousands is awfully large. I
#' can adjust the scale on x to add more ticks for 50k increments like this, if I
#' so chose.

Data %>%
  ggplot(., aes(realrinc)) + 
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_x_continuous(labels = scales::dollar_format(),
                     breaks = seq(0, 500000, by=50000))

#' ## A comment on saving what you do.

#' There are somewhat advanced techniques I would love to teach you here, but I
#' think I've tabled much of it until the final computer lab. No matter, let's talk
#' about saving what you do. First, let's create a legible version of the plot
#' above, but with some additional aesthetic stuff that I hope is clear/intuitive
#' without me having to belabor it for the sake of this point.

Data %>%
  ggplot(., aes(realrinc)) + 
  # I'll leave it to you to play with what's happening here.
  geom_histogram(bins = 50,
                 alpha = 0.8,
                 color = 'black',
                 fill = "#619cff") +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_x_continuous(labels = scales::dollar_format(),
                     breaks = seq(0, 500000, by=50000)) +
  labs(caption = "Data: General Social Survey (2000-2018)",
       y = "Number of Observations in Bin",
       x = "Respondent's Base Income (in 1986 USD)")

#' Notice your plot? See the "Export" tab there? Click it and go with "Save 
#' image", probably because you are not exporting to LaTeX and your Word document
#' wouldn't really want PDF graphics. Save what you see as a .png file wherever
#' it is you want. Note you can adjust the dimensions of the plot here, though you 
#' should be kinda careful (or experimental) with what you do here. If you adjust
#' the dimensions, it is *probably* advisable that you maintain the aspect ratio.
#' You can do whatever you want with what you just did, though you may be wanting
#' to drag it into a Word document. It's as simple as, well, dragging it into a 
#' Word document.
#' 
#' Now, here's the fun part that you should also be mindful to do. Remember the
#' `Data` object we created?

Data

#' Let's save it now. The exact format to which you want to save it is to your
#' discretion. If your data set is super small, then a .csv or a .tsv file is
#' sufficient. You can open those in Excel and look at it to no real problem.
#' However, the data we have here have more than 14,000 rows so that's not a 
#' productive thing to do. You could do it, but let's explore our options.
#' 
#' So, let's save it as an R serialized data frame. First, let's see where
#' our working directory is. It's important to note you and I are assuredly
#' going to have different working directories here.
#' 

getwd()

saveRDS(Data, "my-toy-data-set.rds")

#' Let's experiment with alternatives now.
haven::write_dta(Data, "my-toy-data-set.dta")
write_csv(Data, "my-toy-data-set.csv", na = '') # note, if you have NAs, do this.

# That's about it.