# Introduction to R Workshop # UW Tacoma, Autumn 2018 # This script is a companion file to the workshop slides found here: # https://clanfear.github.io/Intro_R_Workshop/ # Overview # 1. R and RStudio Orientation # 2. Packages # 3. Creating and Using Objects # 4. Dataframes and Indexing # 5. Basic Analyses # 6. Resources for Further Learning ##### # 1. R and RStudio Orientation # R as a Calculator # In the **console**, type `123 + 456 + 789` and hit `Enter`. 123 + 456 + 789 # Now in your blank R document in the **editor**, try typing the line # sqrt(400) and either clicking *Run* or hitting `Ctrl+Enter` or `⌘+Enter`. sqrt(400) # `sqrt()` is an example of a **function** in R. # If we didn't have a good guess as to what `sqrt()` will do, we can type `?sqrt` in the console # and look at the **Help** panel on the right. ?sqrt # **Arguments** are the *inputs* to a function. In this case, the only argument to `sqrt()` # is `x` which can be a number or a vector of numbers. # Help files provide documentation on how to use functions and what functions produce. # Creating Objects # R stores *everything* as an **object**, including data, functions, models, and output. # Creating an object can be done using the **assignment operator**: `<-` new.object <- 144 # **Operators** like `<-` are functions that look like symbols but typically sit between their arguments # (e.g. numbers or objects) instead of having them inside `()` like in `sqrt(x)`. # .footnote[[1] We can actually call operators like other functions by stuffing them between backticks: \`+\`(x,y)] # We do math with operators, e.g., `x + y`. `+` is the addition operator! # Calling Objects # You can display or "call" an object simply by using its name. new.object # Object names can contain `_` and `.` in them, but cannot *begin* with numbers. Try # to be consistent in naming objects. RStudio auto-complete means *long names are better # than vague ones*! # *Good names1 save confusion later.* # .footnote[[1] "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton] # Using Objects # An object's **name** represents the information stored in that **object**, so you can treat the object's name # as if it were the values stored inside. new.object + 10 new.object + new.object sqrt(new.object) # Creating Vectors # A **vector** is a series of **elements**, such as numbers. # You can create a vector and store it as an object in the same way. To do this, use the # function `c()` which stands for "combine" or "concatenate". new.object <- c(4, 9, 16, 25, 36) new.object # If you name an object the same name as an existing object, *it will overwrite it*. # You can provide a vector as an argument for many functions. sqrt(new.object) # Character Vectors # We often work with data that are categorical. To create a vector of text elements—**strings** in programming terms—we must place the text in quotes: string.vector <- c("Atlantic", "Pacific", "Arctic") string.vector # Categorical data can also be stored as a **factor**, which has an underlying numeric representation. Models will convert factors to dummies.1 factor.vector <- factor(string.vector) factor.vector # .footnote[[1] Factors have **levels** which you can use to set a reference category in models using `relevel()`.] # Saving and Loading Objects # You can save an R object on your computer as a file to open later: save(new.object, file="new_object.RData") # You can open saved files in R as well: load("new_object.RData") # But where are these files being saved and loaded from? # Working Directories # R saves files and looks for files to open in your current **working directory**1. # You can ask R what this is: getwd() # Similarly, we can set a working directory like so: setwd("C:/Users/cclan/Documents") # More Complex Objects # The same principles shown with vectors can be used with more complex objects # like **matrices**, **arrays**, **lists**, and **dataframes** (lists which look like matrices but can hold multiple data types at once). # Most data sets you will work with will be read into R and stored as a **dataframe**, so the remainder of this workshop will mainly focus on using these objects. # Loading Dataframes # Delimited Text Files # The easiest way to work with external data—that isn't in R format—is for it to be stored in a *delimited* text file, e.g. comma-separated values (**.csv**) or tab-separated values (**.tsv**). # R has a variety of built-in functions for importing data stored in text files, like `read.table()` and `read.csv()`.1 # .footnote[[1] Use "write" versions (e.g. `write.csv()`) to create these files from R objects.] # By default, these functions will read *character* (string) columns in as a *factor*. # To disable this, use the argument `stringsAsFactors = FALSE`, like so: new_df <- read.csv("some_spreadsheet.csv", stringsAsFactors = FALSE) # This will error since there is no spreadsheet here! # Data from Other Software # Working with **Stata**, **SPSS**, or **SAS** users? You can use a **package** to bring in their saved data files: # * `foreign` #+ Part of base R #+ Functions: `read.spss()`, `read.dta()`, `read.xport()` #+ Less complex but sometimes loses some metadata #* `haven` #+ Part of the `tidyverse` family #+ Functions: `read_spss()`, `read_dta()`, `read_sas()` #+ Keeps metadata like variable labels # For less common formats, Google it. I've yet to encounter a data format without an # R package to handle it (or at least a clever hack). # If you encounter an ambiguous file extension (e.g. `.dat`), try opening it with # a good text editor first (e.g. Atom, Sublime); there's a good chance it is actually raw text # with a delimiter or fixed format that R can handle! # Installing Packages # But what are packages? # Packages contain functions (and sometimes data) created by the community. The real power of R is found in add-on packages! # For the remainder of this workshop, we will work with data from the `gapminder` package. # These data are a panel data describing 142 countries observed every 5 years from 1952 to 2007. # We can install `gapminder` from the Comprehensive R Archive Network (CRAN): install.packages("gapminder") # You only need to install a package **once** for any given version of R. You need to reinstall packages after upgrading R. # Loading Packages # To load a package, use `library()`: library(gapminder) # Once a package is loaded, you can call on functions or data inside it. data(gapminder) # Places data in your global environment head(gapminder) # Displays first six elements of an object # Indexing and Subsetting # Indices and Dimensions # In base R, there are two main ways to access elements of objects: square brackets (`[]` or `[[]]`) and `$`. How you access an object depends on its *dimensions*. # Dataframes have *2* dimensions: **rows** and **columns**. Square brackets allow us to numerically **subset** in the format of `object[row, column]`. Leaving the row or column place empty selects *all* elements of that dimension. gapminder[1,] # First row gapminder[1:3, 3:4] # First three rows, third and fourth column #<< # The **colon operator** (`:`) generates a vector using the sequence of integers from its first argument to its second. `1:3` is equivalent to `c(1,2,3)`. # Dataframes and Names # Columns in dataframes can also be accessed using their names with the `$` extract operator. This will return the column as a vector: gapminder$gdpPercap[1:10] # Note here I *also* used brackets to select just the first 10 elements of that column. # You can mix subsetting formats! In this case I provided only a single value (no column index) because **vectors** have *only one dimension* (length). # If you try to subset something and get a warning about "incorrect number of dimensions", check your subsetting! # Indexing by Expression # We can also index using expressions—logical *tests*. gapminder[gapminder$year==1952, ] # How Expressions Work # What does `gapminder$year==1952` actually do? head(gapminder$year==1952, 50) # display first 50 elements # It returns a vector of `TRUE` or `FALSE` values. # When used with the subset operator (`[]`), elements for which a `TRUE` is given are returned while those corresponding to `FALSE` are dropped. # Logical Operators # We used `==` for testing "equals": `gapminder$year==1952`. # There are many other [logical operators](http://www.statmethods.net/management/operators.html): # * `!=`: not equal to # * `>`, `>=`, `<`, `<=`: less than, less than or equal to, etc. # * `%in%`: used with checking equal to one of several values # Or we can combine multiple logical conditions: # * `&`: both conditions need to hold (AND) # * `|`: at least one condition needs to hold (OR) # * `!`: inverts a logical condition (`TRUE` becomes `FALSE`, `FALSE` becomes `TRUE`) # Logical operators are one of the foundations of programming. You should experiment with these to become familiar with how they work! # Sidenote: Missing Values # Missing values are coded as `NA` entries without quotes: vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA) # Even one `NA` "poisons the well": You'll get `NA` out of your calculations unless you remove them manually or use the extra argument `na.rm = TRUE` in some functions: mean(vector_w_missing) mean(vector_w_missing, na.rm=TRUE) # Finding Missing Values # **WARNING:** You can't test for missing values by seeing if they "equal" (`==`) `NA`: vector_w_missing == NA # But you can use the `is.na()` function: is.na(vector_w_missing) # We can use subsetting to get the equivalent of `na.rm=TRUE`: mean(vector_w_missing[!is.na(vector_w_missing)]) #<< #`!` *reverses* a logical condition. Read the above as "subset *not* `NA`" # Multiple Conditions Example # Let's say we want observations from Oman after 1980 and through 2000. gapminder[gapminder$country == "Oman" & gapminder$year > 1980 & gapminder$year <= 2000, ] # Note we always need to use the full object name in each subseting argument, rather than just `country == "Oman"` alone. You can subset one object using another this way (e.g. `gapminder[other_data$some_variable == x, ]`). # Saving a Subset # If we think a particular subset will be used repeatedly, we can save it and give it a name like any other object: China <- gapminder[gapminder$country == "China", ] head(China, 4) # Another Operator: `%in%` # A common thing we may want to do is subset rows to things in some *set*. # We can use `%in%` like `==` but it matches *any element* in the vector on its right. former_yugoslavia <- c("Bosnia and Herzegovina", "Croatia", "Macedonia", "Montenegro", "Serbia", "Slovenia") yugoslavia <- gapminder[gapminder$country %in% former_yugoslavia, ] head(yugoslavia) # Create New Columns # We can create new columns (variables) in a dataframe using the same subsetting functions. yugoslavia$pop_million <- yugoslavia$pop / 1000000 yugoslavia$life_exp_past_40 <- yugoslavia$lifeExp - 40 head(yugoslavia) # `ifelse()` # A common function used in general in R programming is `ifelse()`. This returns a value depending on logical tests. ifelse(test = x==y, yes = 1, no = 2) # This will error because it is just an example # Output from `ifelse()`: # * `ifelse()` returns the value assigned to `yes` (in this case, `1`) if `x==y` is `TRUE`. # * `ifelse()` returns `no` (in this case, `2`) if `x==y` is `FALSE`. # * `ifelse()` returns `NA` if `x==y` is neither `TRUE` nor `FALSE`. # Note we can omit explicitly typing function arguments like `test = ` if we enter them in the order of arguments shown in the function's help page. # `ifelse()` Example yugoslavia$short_country <- ifelse(yugoslavia$country == "Bosnia and Herzegovina", "B and H", as.character(yugoslavia$country)) yugoslavia[yugoslavia$year==1952, c(1,9)] # Selecting just columns 1 and 9 # Read this as "For each row, if `country` equals `"Bosnia and Herzegovina"`, # make `short_country` equal to `"B and H"`, otherwise make it equal to that row's # original value of `country` (as character, rather than factor, data)." # This is a simple way to change some values but not others! # Note that you can split arguments to a function into multiple lines for clarity, so long as lines end with an operator (like `+`) or comma (used to separate arguments). # Analyses ## Basic Graphics and Models # Histograms # We can use the `hist()` function to generate a histogram of a vector: hist(gapminder$lifeExp, xlab = "Life Expectancy (years)", #<< main = "Observed Life Expectancies of Countries") #<< # `xlab =` is used to set the label of the x-axis of a plot. # `main = ` is used to set the title of a plot. # Use `?hist` to see additional options available for customizing a histogram. # Scatter Plots # RUN ALL THREE OF THESE FUNCTONS TO GET FULL PLOT: THEY STACK! plot(lifeExp ~ gdpPercap, data = gapminder, #<< xlab = "ln(GDP per Capita)", ylab = "Life Expectancy (years)", main = "Life Expectancy and log GDP per Capita", pch = 16, log="x") # log="x" sets x axis to log scale! abline(h = mean(gapminder$lifeExp), col = "firebrick") #<< abline(v = mean(gapminder$gdpPercap), col = "cornflowerblue") #<< # Note that `lifeExp ~ gdpPercap` is a **formula** of the type `y ~ x`. The first element (`lifeExp) gets plotted on the y-axis and the second (`gdpPercap`) goes on the x-axis. # The `abline()` calls place horizontal (`h =`) or vertical (`v =`) lines at the means of the variables used in the plot. # Formulae # Most modeling functions in R use a common formula format—the same seen with the previous plot: new_formula <- y ~ x1 + x2 + x3 new_formula class(new_formula) # The dependent variable goes on the left side of `~` and independent variables go on the right. # See here for more on [formulae](https://www.datacamp.com/community/tutorials/r-formula-tutorial). # Simple Tables # `table()` creates basic cross-tabulations of vectors. table(mtcars$cyl, mtcars$am) # Chi-Square # We can give the output from `table()` to `chisq.test()` to perform a Chi-Square test of assocation. chisq.test(table(mtcars$cyl, mtcars$am)) # Note the warning here. You can use rescaled (`rescale.p=TRUE`) or simulated p-values (`simulate.p.value=TRUE`) if desired. # T Tests # T tests for mean comparisons are simple to do. gapminder$post_1980 <- ifelse(gapminder$year > 1980, 1, 2) t.test(lifeExp ~ post_1980, data=gapminder) # Linear Models # We can run an ordinary least squares linear regression using `lm()`: lm(lifeExp~pop + gdpPercap + year + continent, data=gapminder) # Note we get a lot less output here than you may have expected! This is because # we're only viewing a tiny bit of the information produced by `lm()`. We need to expore the object `lm()` creates! # Model Summaries # The `summary()` function provides Stata-like regression output: lm_out <- lm(lifeExp~pop + gdpPercap + year + continent, data=gapminder) summary(lm_out) # Model Objects # `lm()` produces a lot more information than what is shown by `summary()` however. We can see the **str**ucture of `lm()` output using `str()`: str(lm_out) #`lm()` actually has an enormous quantity of output! This is a type of object called a **list**. # Model Objects # We can access parts of `lm()` output using `$` like with dataframe names: lm_out$coefficients # We can also do this with `summary()`, which provides additional statistics: summary(lm_out)$coefficients # ANOVA # ANOVAs can be fit and summarized just like `lm()` summary(aov(lifeExp ~ continent, data=gapminder)) # More Complex Models # R supports many more complex models, for example: # * `glm()` has syntax similar to `lm()` but adds a `family =` argument to specify model families and link functions like logistic regression # + ex: `glm(x~y, family=binomial(link="logit"))` # * The `lme4` package adds hierarchical (multilevel) GLM models. # * `lavaan` fits structural equation models with intuitive syntax. # * `plm` and `tseries` fit time series models. # Most of these other packages support mode summaries with `summary()` and all create output objects which can be accessed using `$`. # Because R is the dominant environment for statisticians, the universe of modeling tools in R is *enormous*. If you need to do it, it is probably in a package somewhere.