#' --- #' title: "An Intro to R, RStudio, and {tidyverse}" #' layout: lab #' permalink: /lab-scripts/lab-1/ #' filename: intro-r-rstudio.R #' active: lab-scripts #' abstract: "This lab scripts offers what I think to be a gentle introduction #' to R and RStudio. It will try to acclimate students with R as programming #' language and RStudio as IDE for the R programming language. There will be #' a few recurring themes here that are subtle but deceptively critical. 1) You #' really must know where you are on your computer without referencing to icons #' you can push (i.e. know your working directory and the path to it). 2) You #' can push any number of buttons in RStudio, but everything is still a command #' in a terminal/console. Pay careful attention to that information as it's #' communicated to you." #' output: #' md_document: #' variant: gfm #' preserve_yaml: TRUE #' --- #+ setup, include=FALSE knitr::opts_chunk$set(collapse = TRUE, fig.path = "figs/lab-1/", cache.path = "cache/lab-1/", fig.width = 11, comment = "#>") #+ #' ## Elsewhere in My R Cinematic Universe #' #' Some of what I offer here will have to be aggressively plagiarized from #' other resources I've made available. I started teaching graduate-level #' methods at my previous employer and the bulk of #' [what I wrote there](http://post8000.svmiller.com/lab-scripts/intro-r-rstudio.html) #' will be ported over here. Likewise, I give basically #' [the same tutorial](http://ir3-2.svmiller.com/lab-scripts/lab-1/) to our #' third-semester BA students. There is a much, much older guide that I wrote #' [back in 2014](http://svmiller.com/blog/2014/08/a-beginners-guide-to-using-r/) #' that you may or may not find super useful. I don't know what else to say here. #' When it comes to introducing students to R, you're going to repeat yourself #' on loop. #' #' #' #' ## Configure RStudio #' #' When you're opening R for the very first time, it'll be useful to just get a general sense of what's happening. #' I have [a beginner's guide that I wrote in 2014](http://svmiller.com/blog/2014/08/a-beginners-guide-to-using-r/) #' (where did the time go!). Notice that I built it around [RStudio](https://rstudio.com/products/rstudio/), #' which you should download as well. RStudio desktop is free. #' Don't pay for a "pro" version. You're not running a server. You won't need it. #' #' When you download and install RStudio *on top* of R, you should customize it #' just a tiny bit to make the most of the graphical user interface. To do what #' I recommend doing, select "Tools" in the menu. Scroll to "global options" #' (which should be at the bottom). On the pop-up, select "pane layout." #' Rearrange it so that "Source" is top left, "Console" is top right, and the #' files/plots/packages/etc. is the bottom right. Thereafter: apply the changes. #' #' ![](http://post8000.svmiller.com/intro-r-rstudio/rstudio-global-options.png) #' #' You don't have to do this, but I think you should since it better economizes #' space in RStudio. The other pane (environment/history, Git, etc.) is stuff #' you can either learn to not need (e.g. what's in the environment) or will #' only situationally need at an advanced level (e.g. Git information). Minimize #' that outright. When you're in RStudio, much of what you'll be doing leans on #' the script window and the console window. You'll occasionally be using the #' file browser and plot panes as well. #' #' If you have not done so already, open a new script (Ctrl-Shift-N in Windows/Linux or Cmd-Shift-N in Mac) #' to open a new script. #' #' ## Get Acclimated in R #' #' Now that you've done that, let's get a general sense of where you are in an R session. #' #' ### Current Working Directory #' #' First, let's start with identifying the current working directory. You should know where #' you are and this happens to be where I am, given the location of this script. getwd() #' Of note: by default, R's working directory is the system's "home" directory. This is somewhat #' straightforward in Unix-derivative systems, where there is an outright "home" directory. #' Assume your username is "steve", then, in Linux, your home directory will be "/home/steve". In Mac, #' I think it's something like "/Users/steve". Windows users will invariably have something #' clumsy like "C:/Users/steve/Documents". Notice the forward slashes. #' R, like everything else in the world, uses forward slashes. The backslashes owe to #' Windows' derivation from DOS. #' #' ### Create "Objects" #' #' Next, let's create some "objects." R is primarily an "object-oriented" programming language. #' In as many words, inputs create outputs that may be assigned to objects in the workspace. #' You can go nuts here. Of note: I've seen R programmers use `=`, `->`, and `<-` #' interchangeably for object assignment, but I've seen instances where `=` doesn't work as #' intended for object assignment. `->` is an option and I use it for assignment #' for some complex objects in a "pipe" (more on that later). #' For simple cases (and for beginners), lean on `<-`. #' a <- 3 b <- 4 A <- 7 a + b A + b # what objects did we create? # Notice we did not save a + b or A + b to an object # Also notice how a pound sign creates a comment? Kinda cool, right? # Always make comments to yourself. ls() #' Some caution, though. First, don't create objects with really complex names. #' To call them back requires getting every character right in the console or #' script. Why inconvenience yourself? Second, R comes with some default #' objects that are kinda important and can seriously ruin things downstream. #' I don't know off the top of my head all the default objects in R, but there #' are some important ones like `TRUE`, and `FALSE` that you DO NOT want to #' overwrite. `pi` is another one you should not overwrite, and `data` is a #' function that serves a specific purpose (even if you probably won't be using #' it a whole lot). You can, however, assign some built-in objects to new objects. this_Is_a_long_AND_WEIRD_objEct_name_and_yOu_shoUld_not_do_this <- 5 pi # notice there are a few built-in functions/objects d <- pi # you can assign one built-in object to a new object. # pi <- 3.14 # don't do this.... #' If you do something dumb (like overwrite `TRUE` with something), all hope #' is not lost. Just remove the object in question with the `rm()` command. #' #' ### Install/Load Libraries #' #' R depends on user-created libraries to do much of its functionality. This class will lean on just a #' few R libraries. The first, `{tidyverse}` is our workhorse for workflow. It'll also be #' the longest to install because it comes with lots of dependencies to maximize its #' functionality. [`{stevedata}`](http://svmiller.com/stevedata) contains toy #' data sets that I use for in-class instruction, and we'll make use of these #' data in these lab sessions (and in your problem sets). #' [`{stevemisc}`](http://svmiller.com/stevemisc) contains assorted helper functions #' that I wrote for my research, which we'll also use in this class. #' [`{stevetemplates}`](http://svmiller.com/stevetemplates) is not strictly necessary, #' but it will make doing your homeworks infinitely easier (even if you're not a LaTeX user). #' `{lmtest}`, which is not a package I maintain, does various model diagnostics for OLS. #' #' I *may*---and probably will, to be honest---ask you to install various other packages #' that I think you should have installed. Already, I can see that the last problem set #' is going to be a "choose your adventure" at the end, and request that you have #' either the `{fixest}` or `{modelr}` package installed. I hope to keep these #' situations to a minimum. #' #' If any of these result in a "non-zero exit status", that's R's way of saying "I #' couldn't install this." For you Mac users, the answer to this situation is *probably* #' "update [Xcode](https://developer.apple.com/xcode/)." Xcode is a developer tool suite for Apple, #' and many of the `{tidyverse}` packages require access to developmental libraries that, #' to the best of my understanding, are available in Xcode. In all likelihood, you're a first-time #' user who has not had to think about software development (and so you haven't updated Xcode since #' you first got your Macbook). You might have to do that here. #' #' For you Windows users: I think I've figured out what this may look like for you based on my recent #' foray into the university's computer labs. The Windows corollary to Xcode is Rtools, which you *don't* have #' installed by default (because it's not a Microsoft program, per se). You'll need to install it. #' First, take inventory of what version of R you have (for the university's computer labs, it should be #' 4.0.5). [Go to this website](https://cran.r-project.org/bin/windows/Rtools/) and download the #' version of Rtools that corresponds with the version of R you have. Just click through all the default #' options so that it can install. Next, in RStudio, open a new blank file and copy-paste the following code #' into it. #' # PATH="${RTOOLS40_HOME}\usr\bin;${PATH}" #' I'll add the caveat that you should remove the hashtag and space preceding that line. #' #' Next, save the file as `.Renviron` in your default working directory, which is probably where you are #' if you are using RStudio for the first time. The save prompt from RStudio will advise you that this is #' no longer an `.R` file (and, duh, just tell it to save anyway). Afterwards, restart RStudio and try #' again. This *should* fix it, based on my recent trial run in the university's computer labs. #' #' For you Linux users: you're awesome, have great hair, everyone likes you, and #' you don't need to worry about a thing, *except* the various developmental #' libraries you may have to install from your package repository. My flavor of #' Linux is in the Debian/Ubuntu family, so #' [here's an (incomplete) list of developmental libraries](http://svmiller.com/blog/2019/07/notes-to-self-new-linux-installation-r-ubuntu/) #' based on my experience. Helpfully, most R packages that fail this way will #' tell you what development library you need, whether in you're in the Debian #' or Red Hat family. #' #' #' If you have yet to install these packages (and you almost certainly have not #' if you're opening R for the first time), install it as follows. #' Note that I'm just commenting out this command so it doesn't do this when #' I compile this script on my end. # Take out the comment... # install.packages(c("tidyverse", "stevedata", "stevemisc", "stevetemplates", "lmtest")) #' Once they're all installed, you can load the libraries with the `library()` command. #' Of note: you only need to install a package once, but you'll need to load the library #' for each R session. You won't really need to load `{stevetemplates}` for anything since #' it's core functionality is its integration with RStudio. Let's load `{tidyverse}` and #' `{stevedata}` in this session, since it's what I'll typically use. #' library(tidyverse) library(stevedata) #' For those of you that are having `{tidyverse}` installation issues because of #' `{systemfonts}` needing some font-related development libraries, try this. #' Again, take out the comments if you want this to run. #' # library(tibble) # special data type we'll use # library(magrittr) # pipe operator # library(dplyr) # the workhorse # library(readr) # for reading particular data types. # library(stevedata) # for data #' These are the core packages that are in `{tidyverse}` that you should have #' installed. Having `{tidyverse}` loads all of these. It's basically a wrapper. #' Here, you're just being explicit. #' #' ### Load Data #' #' Problem sets and lab scripts will lean on data I make available in `{stevedata}`. #' However, you may often find that you want to download a data set from #' somewhere else and load it into R. Example data sets would be stuff like #' European Values Survey, European Social Survey, or Varieties of Democracy, or #' whatever else. You can do this any number of ways, and it will depend on #' what is the file format you downloaded. Here are some #' commands you'll want to learn for these circumstances: #' #' - `haven::read_dta()`: for loading Stata .dta files #' - `haven::read_spss()`: for loading SPSS binaries (typically .sav files) #' - `read_csv()`: for loading comma-separated values (CSV) files #' - `readxl::read_excel()`: for loading MS Excel spreadsheets. #' - `read_tsv()`: for tab-separated values (TSV) files #' - `readRDS()`: for R serialized data frames, which are awesome for file compression/speed. #' #' Notice that functions like `read_dta()`, `read_spss()`, and `read_excel()` #' require some other packages that I didn't mention. However, these other #' packages/libraries are part of the `{tidyverse}` and are just not loaded #' directly with them. Under these conditions, you can avoid directly loading a #' library into a session by referencing it first and grabbing the function #' you want from within it separated by two colons (`::`). Basically, #' `haven::read_dta()` could be interpreted as a command saying "using the #' `{haven}` library, grab the `read_dta()` command in it". #' #' These wrappers are also flexible with files on the internet. For example, these will work. #' Just remember to assign them to an object. # Note: hypothetical data Apply <- haven::read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta") # Let's take a look at these data. Apply #' ## Learn Some Important R/"Tidy" Functions #' #' I want to spend most of our time in this lab session teaching you some basic commands #' you should know to do basically anything in R. These are so-called "tidy" verbs. We'll be using #' some data available in `{stevedata}`. This is the `pwt_sample` data, which includes yearly #' economic data for a handful of rich countries that are drawn from version 10.0 of the Penn #' World Table. If you're in RStudio, you can learn more about these data by typing the following #' command. ?pwt_sample #' I want to dedicate the bulk of this section to learning some core functions that are part #' of the `{tidyverse}`. My introduction here will inevitably be incomplete because #' there's only so much I can teach within the limited time I have. That said, #' I'm going to focus on the following functions available in the #' `{tidyverse}` that totally rethink base R. These are the "pipe" (`%>%`), #' `glimpse()` and `summary()`, `select()`, `group_by()`, #' `summarize()`, `mutate()`, and `filter()`. #' #' ### The Pipe (`%>%`) #' #' I want to start with the pipe because I think of it as the most #' important function in the `{tidyverse}`. The pipe---represented as `%>%`---allows #' you to chain together a series of functions. The pipe is especially useful #' if you're recoding data and you want to make sure you got everything #' the way you wanted (and correct) before assigning the data to #' another object. You can chain together *a lot* of `{tidyverse}` #' commands with pipes, but we'll keep our introduction here rather minimal #' because I want to use it to teach about some other things. #' #' ### `glimpse()` and `summary()` #' #' `glimpse()` and `summary()` will get you basic descriptions of your data. #' Personally, I find `summary()` more informative than `glimpse()` though `glimpse()` #' is useful if your data have a lot of variables and you want to just peek #' into the data without spamming the R console without output. #' #' Notice, here, the introduction of the pipe (`%>%`). In the commands below, #' `pwt_sample %>% glimpse()` is equivalent to `glimpse(pwt_sample)`, but I like to #' lean more on pipes than perhaps others would. My workflow starts with (data) #' objects, applies various functions to them, and assigns them to objects. I think you'll get a lot of mileage #' thinking that same way too. pwt_sample %>% glimpse() # notice the pipe pwt_sample %>% summary() #' ### `select()` #' #' `select()` is useful for basic (but important) data management. You can use it to grab #' (or omit) columns from data. For example, let's say I wanted to grab all the columns #' in the data. I could do that with the following command. pwt_sample %>% select(everything()) # grab everything #' Do note this is kind of a redundant command. You could just as well spit the entire data #' into the console and it would've done the same thing. Still, here's if I wanted everything #' except wanted to drop the labor share of income variable. #' pwt_sample %>% select(-labsh) # grab everything, but drop the labsh variable. #' Here's a more typical case. Assume you're working with a large data object and you #' just want a handful of things. In this case, we have all these economic data on these #' 21 countries (ed. we really don't, but roll with it), but we just want the GDP data #' along with the important identifying information for country and year. Here's #' how we'd do that in the `select()` function, again with some assistance from the pipe. pwt_sample %>% select(country, year, rgdpna) # grab just these three columns. #' ### Grouping data for grouped functions (`group_by()`, or `.by=`) #' #' I think the pipe is probably the most important function in the `{tidyverse}` even as #' a critical reader might note that the pipe is 1) a port from another package (`{magrittr}`) #' and 2) now a part of base R in a different terminology. Thus, the critical reader #' (and probably me, depending on my mood) may note that grouping functions--- #' whether through `group_by()` or `.by`---is probably the most important #' component of the `{tidyverse}`. Basically, `group_by()` allows you to "split" #' the data into various subsets, "apply" various functions to them, and #' "combine" them into one output. You might see that terminology "split-apply-combine" #' as you learn more about the `{tidyverse}` and its development. #' #' Here, let's do a simple `group_by()` exercise, while also introducing you to #' another function: `slice()`. We're going to group by country in `pwt_sample` and "slice" #' the first observation for each group/country. Notice how we can chain these together #' with a pipe operator. # Notice we can chain some pipes together pwt_sample %>% # group by country group_by(country) %>% # Get me the first observation, by group. slice(1) #' If you don't group-by the country first, `slice(., 1)` will just return the first #' observation in the data set. pwt_sample %>% # Get me the first observation for each country slice(1) # womp womp. Forgot to group_by() #' I offer one caveat here. If you're applying a group-specific function (that you #' need just once), it's generally advisable to "ungroup" (i.e. `ungroup()`) as the #' next function in your pipe chain. As you build together chains/pipes, the intermediate #' output you get will advise you of any "groups" you've declared in your data. Don't #' lose track of those. This is incidentally why the `{tidyverse}` effectively #' "retired" the `group_by()` function for `.by` as an argument in these functions. #' `.by` will always return un-grouped data whereas `group_by()` always returns #' grouped data. #' #' Observe: pwt_sample %>% # group by country group_by(country) %>% # Get me the first observation, by group. slice(1) pwt_sample %>% slice(1, .by=country) #' ### `summarize()` #' #' `summarize()` creates condensed summaries of your data, for whatever it is #' that you want. Here, for example, is a kind of dumb way of seeing how many #' observations are in the data. `nrow(pwt_sample)` works just as well, but #' alas... pwt_sample %>% # How many observations are in the data? summarize(n = n()) #' More importantly, `summarize()` works wonderfully with `group_by()` or `.by=`. #' For example, for each country (`group_by(country)`), let's get the #' maximum GDP observed in the data. pwt_sample %>% group_by(country) %>% # Give me the max real GDP observed in the data. summarize(maxgdp = max(rgdpna, na.rm=T)) #' `.by` does the same here. pwt_sample %>% # Give me the max real GDP observed in the data, .by country. summarize(maxgdp = max(rgdpna, na.rm=T), .by=country) #' One downside (or feature, depending on your perspective) to `summarize()` is that it #' condenses data and discards stuff that's not necessary for creating the condensed output. #' In the case above, notice we didn't ask for what year we observed the maximum GDP #' for a given country. We just asked for the maximum. If you wanted something that would #' also tell you what year that particular observation was, you'll probably want a `slice()` command #' in lieu of `summarize()`. #' #' Observe: #' pwt_sample %>% group_by(country) %>% # translated: give me the row, for each country, in which real GDP is the max (ignoring missing values). slice(which(rgdpna == max(rgdpna, na.rm=T))) # or... pwt_sample %>% # translated: give me the row, for each country, in which real GDP is the max (ignoring missing values). slice(which(rgdpna == max(rgdpna, na.rm=T)), .by=country) #' This is a convoluted way of thinking about `summarize()`, but you'll probably #' find yourself using it a lot. #' #' ### `mutate()` #' #' `mutate()` is probably the most important `{tidyverse}` function for data #' management/recoding. It will allow you to create new columns while #' retaining the original dimensions of the data. Consider it the sister #' function to `summarize()`. But, where `summarize()` discards, #' `mutate()` retains. #' #' Let's do something simple with `mutate()`. For example, the `rgdpna` column #' is real GDP in million 2017 USD. What if we wanted to convert that #' million to billions? This is simple with `mutate()`. Helpfully, you can create #' a new column that has both the original/raw data and a new/recoded variable. #' This is great for reproducibility in your data management. One thing I will #' want to reiterate to you through our sessions is you should *never* overwrite #' raw data you have. Always create new columns if you're recoding something. #' #' Anyway, here's "Wonderw-"... sorry, here's that new real GDP in billions #' variable we wanted. pwt_sample %>% # Convert rgdpna from real GDP in millions to real GDP in billions mutate(rgdpnab = rgdpna/1000) %>% # select just what we want for presentation select(country:year, rgdpna, rgdpnab) #' Let's assume we wanted to create a dummy variable for observations in the #' data starting from the Great Recession forward? In other words, let's create #' a dummy variable for all observations that were in 2008 or later. pwt_sample %>% mutate(post_recession = ifelse(year >= 2008, 1, 0)) %>% select(country:year, post_recession) #' Knowing these data go to 2019, we can do this another way as well. pwt_sample %>% mutate(post_recession = ifelse(year %in% c(2008:2019), 1, 0)) %>% select(country:year, post_recession) #' Economists typically care about GDP per capita, right? We can create that #' kind of data ourselves based on information that we have in `pwt_sample`. pwt_sample %>% mutate(rgdppc = rgdpna/pop) %>% select(country:year, rgdpna, pop, rgdppc) #' Notice that `mutate()` also works beautifully with `group_by()`. For example, #' you may recognize that these data are panel data. We have 21 countries #' (cross-sectional units) across 70 years (time units). If you don't believe #' me, check this out... #' pwt_sample %>% summarize(n = n(), min = min(year), max = max(year), .by=country) %>% data.frame #' You might know---or should know, as you progress---that some panel methods #' look for "within" effects inside cross-sectional units by looking at the #' value of some variable relative to the cross-sectional average for that #' variable. Let's use the real GDP per capita variable we can create as an #' example. Observe what's going to happen here. pwt_sample %>% mutate(rgdppc = rgdpna/pop) %>% select(country:year, rgdpna, pop, rgdppc) %>% mutate(meanrgdppc = mean(rgdppc), diffrgdppc = rgdppc - mean(rgdppc), .by=country) #' That `diffrgdppc` variable practically "centers" the real GDP per capita #' variable, and values communicate difference from the mean. This is a #' so-called "within" variable, or a transformation of a variable where #' it now communicates changes of some variable "within" a cross-sectional unit. #' #' ### filter() #' #' `filter()` is a great diagnostic tool for subsetting your data to look at #' particular observations. Notice one little thing, especially if you're new to #' programming. The use of double-equal signs (`==`) is for making logical #' statements where as single-equal signs (`=`) is for object assignment or #' column creation. If you're using `filter()`, you're probably wanting to find #' cases where something equals something (`==`), is greater than something (`>`), #' equal to or greater than something (`>=`), is less than something (`<`), or #' is less than or equal to something (`<=`). #' #' Here, let's grab just the American observations by filtering to where `isocode` == "USA". pwt_sample %>% # give me just the USA observations filter(isocode == "USA") #' We could also use `filter()` to select observations from the most recent year. pwt_sample %>% # give me the observations from the most recent year. filter(year == max(year)) #' If we do this last part, we've converted the panel to a cross-sectional data set. #' #' ## Don't Forget to Assign! #' #' When you're done applying functions/doing whatever to your data, don't forget #' to assign what you've done to an object. For simple cases, and for beginners, #' I recommend thinking "left-handed" and using `<-` for object assignment #' (as we did above). When you're doing stuff in the pipe, my "left-handed" #' thinking prioritizes the starting data in the pipe chain. Thus, I tend to #' use `->` for object assignment at the end of the pipe. #' #' Consider a simple example below. I'm starting with the original data #' (`pwt_sample`). I'm using a simple pipe to create a new variable #' (within `mutate()`) that standardizes the real GDP variable from millions to #' billions. Afterward, I'm assigning it to a new object (`Data`) with `->`. pwt_sample %>% # convert real GDP to billions mutate(rgdpnab = rgdpna/1000) -> Data Data