--- title: "Investigating objects and data patterns using base R" # potentially push to header subtitle: "EDUC 260A: Managing and Manipulating Data Using R" author: date: classoption: dvipsnames # for colors fontsize: 8pt urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? number_sections: true #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: tango # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: default #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` ```{r, echo=FALSE, include=FALSE} #THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamer_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE download.file(url = 'https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/beamer_header.tex', destfile = '../beamer_header.tex', mode = 'wb') ``` ```{r, echo=FALSE, include=FALSE, eval = FALSE} # Download images saved on github site imgs <- c('transform-logical.png','fp1.JPG', 'fp2.JPG') for (i in imgs) { if(!file.exists(i)){ download.file(url = paste0('https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/patterns_base_r/', i), destfile = i, mode = 'wb') } } # download images from Advanced R book # the 3 carriage train download.file(url = 'https://d33wubrfki0l68.cloudfront.net/1f648d451974f0ed313347b78ba653891cf59b21/8185b/diagrams/subsetting/train.png', destfile = 'three_carriage_train.png', mode = 'wb') # the 1 car train vs. contents of car 1 download.file(url = 'https://d33wubrfki0l68.cloudfront.net/aea9600956ff6fbbc29d8bd49124cca46c5cb95c/28eaa/diagrams/subsetting/train-single.png', destfile = 'one_carriage_train_vs_contents.png', mode = 'wb') # different versions of smaller trains download.file(url = 'https://d33wubrfki0l68.cloudfront.net/ef5798a60926462b9fc080afb0145977eca70b83/039f5/diagrams/subsetting/train-multiple.png', destfile = 'smaller_trains.png', mode = 'wb') ``` ### Lecture outline \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` # Investigate objects, base R ### Load .Rdata data frames we will use today Data on off-campus recruiting events by public universities - Data frame object `df_event` - One observation per university, recruiting event - Data frame object `df_school` - One observation per high school (visited and non-visited) ```{r} rm(list = ls()) # remove all objects in current environment getwd() #load dataset with one obs per recruiting event load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) #load("../../data/recruiting/recruit_event_somevars.Rdata") #load dataset with one obs per high school load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_somevars.RData")) #load("../../data/recruiting/recruit_school_somevars.Rdata") ``` ## Functions to describe objects ### Simple base R functions to describe objects This section introduces some base R functions to describe objects (some of these you have seen before) - list objects, `list.files()` and `ls()` - remove objects, `rm()` - object type, `typeof()` - object length (number of elements), `length()` - object structure, `str()` - number of rows and columns, `ncol()` and `nrow()` I use the functions `typeof()`, `length()`, `str()` anytime I encounter a new object - Helps me understand the object before I start working with it ### Listing objects __Files in your working directory__ `list.files()` function lists files in your current working directory - if you run this code from .Rmd file, working directory is location .Rmd file is stored ```{r} getwd() # what is your current working directory list.files() ``` ### Objects currently open in your R session __Listing objects currently open in your R session__ `ls()` function lists objects currently open in R ```{r} x <- "hello!" ls() # Objects open in R ``` __Removing objects currently open in your R session__ `rm()` function removes specified objects open in R ```{r} rm(x) ls() ``` Command to remove all objects open in R (I don't run it) ```{r, eval=FALSE} #rm(list = ls()) ``` ### Base R functions to describe objects, `typeof()` `typeof()` function determines the the internal storage type of an object (e.g., logical vector, integer vector, list) - syntax - `tyepof(x)` - arguments - `x`: any R object - help: ```{r, eval = FALSE} ?typeof ``` Examples - Recall that a data frame is an object where __type__ is a list ```{r} typeof(c(TRUE,TRUE,FALSE,NA)) typeof(df_event) typeof(x = df_event) ``` ### Base R functions to describe objects, `length()` `length()` function determines the length of an R object - for atomic vectors and lists, `length()` is the number of elements in the object - syntax - `length(x)` - arguments - `x`: any R object - help: ```{r, eval = FALSE} ?length ``` Example, length of an atomic vector is ```{r} length(c(TRUE,TRUE,FALSE,NA)) ``` Example, length of a list or data frame - length of a list is the number of elements - data frame is a list - length of a data frame = number of elements = number of variables ```{r} length(df_event) # = num elements = num columns ``` ### Base R functions to describe objects, `str()` `str()` function compactly displays the structure of an R object - "structure" includes type, length, and attribute of object and also nested objects - syntax: `str(object)` - arguments (partial) - `object`: any R object - `max.level`: max level of nesting to display nested structures; default `NA` = all levels - help: `?str` ```{r, eval = FALSE, include = FALSE} ?str ``` Example, atomic vectors ```{r} str(c(TRUE,TRUE,FALSE,NA)) str(object = c(TRUE,TRUE,FALSE,NA)) ``` Example, lists/data frames (output omitted) ```{r, results = "hide"} x <- list(c(1,2), list("apple", "orange"), list(2, 3)) # list str(x) str(df_event) # data frame ``` ### Base R functions to describe objects, `ncol()` and `nrow()` `ncol()` `nrow()` and `dim()` functions - Description - `ncol()` = number of columns; `nrow()` = number of rows - syntax: `ncol(x)` `nrow(x)` `dim(x)` - arguments - `x`: a vector, array, data frame, or NULL - value/return: - if object `x` is an atomic vector: `ncol()` and `nrow()` returns `NULL` - if object `x` is a list but not a data frame: `ncol()` and `nrow()` returns `NULL` - if object `x` is a data frame: `ncol()` and `nrow()` returns integer of length 1 Example, object is a data frame ```{r} ncol(df_event) # num columns = num elements = num variables nrow(df_event) # num rows = num observations # can wrap ncol() or nrow() within str() to see what functions return #str(ncol(df_event)) ``` Example, object is atomic vector or list that is not a data frame (output omitted) ```{r, results = "hide"} ncol(c(TRUE,TRUE,FALSE,NA)) # atomic vector x <- list(c(1,2), list("apple", "orange"), list(2, 3)) # list nrow(x) ``` ### Base R functions to describe objects, `dim()` `dim()` function returns the dimensions of an object (e.g., number of rows and columns) - syntax: `dim(x)` - arguments - `x`: a vector, array, data frame, or NULL - value/return: - if object `x` is a data frame: `dim()` returns integer of length 2 - first element = number of rows; second element = number of columns - if object `x` is an atomic vector: `dim()` returns `NULL` - if object `x` is a list but not a data frame: `dim()` returns `NULL` Example, object is a data frame ```{r} dim(df_event) # shows number rows by columns str(dim(df_event)) # can wrap dim() within str() to see what functions return ``` Example, object is atomic vector or list that is not a data frame (output omitted) ```{r, results = "hide"} dim(c(TRUE,TRUE,FALSE,NA)) # atomic vector x <- list(c(1,2), list("apple", "orange"), list(2, 3)) # list dim(x) ``` ## Variables names ### `names()` function `names()` function gets or sets the names of elements of an object - syntax: - get the names of an object: `names(x)` - set the names of an object: `names(x) <- value` - arguments (partial) - `x`: an R object - `value`: a character vector with same length as object `x` or `NULL` - value/return - `names(x)` returns a character vector of length = `length(x)` in which each element is the name of the element of `x` Example, get names (of atomic vector) ```{r} a <- c(v1=1,v2=2,3,v4="hi!") # named atomic vector a length(a) names(a) length(names(a)) # investigate length of object names(a) str(names(a)) # investigate structure of object names(a) ``` ### `names()` function `names()` function gets or sets the names of elements of an object - syntax: - get the names of an object: `names(x)` - set the names of an object: `names(x) <- value` - arguments (partial) - `x`: an R object - `value`: a character vector with same length as object `x` or `NULL` - value/return - `names(x)` returns a character vector of legnth = `length(x)` in which each element is the name of the element of `x` Example, set names (of atomic vector) ```{r} names(a) <- NULL # set names of vector a to NULL a names(a) names(a) <- c("var1","var2","var3","var4") # set names of vector a a names(a) ``` ### Applying `names()` function to a data frame Recall that a data frame is an object where __type__ is a __list__ and each __element__ is __named__ - each element is a variable - each element name is a variable name Example (output omitted) ```{r, results = "hide"} names(df_event) ``` Investigate the object `names(df_event)` ```{r} typeof(names(df_event)) # type = character vector length(names(df_event)) # length = number of variables in data frame str(names(df_event)) # structure of names(df_event) ``` We can even assign a new object based on `names(df_event)` ```{r} names_event <- names(df_event) typeof(names_event) # type = character vector length(names_event) # length = number of variables in data frame str(names_event) # structure of names(df_event) ``` ### Variable names Refer to specific named elements of an object using this syntax: - `object_name$element_name` When object is data frame, refer to specific variables using this syntax: - `data_frame_name$varname` - __This approach to isolating variables is very useful for investigating data__ ```{r} #df_event$instnm typeof(df_event$instnm) typeof(df_event$med_inc) ``` ### Variable names \medskip Data frames are lists with the following criteria: - each element of the list is (usually) a vector; each element of list is a variable - length of data frame = number of variables ```{r} length(df_event) nrow(df_event) #str(df_event) ``` - each element of the list (i.e., variable) has the same length - Length of each variable is equal to number of observations in data frame ```{r} typeof(df_event$event_state) length(df_event$event_state) str(df_event$event_state) typeof(df_event$med_inc) length(df_event$med_inc) str(df_event$med_inc) ``` ### Variable names The object `df_school` has one obs per high school - variable `visits_by_100751` shows number the of visits by University of Alabama to each high school - like all variables in a data frame, the var `visits_by_100751` is just a vector ```{r} typeof(df_school$visits_by_100751) length(df_school$visits_by_100751) # num elements in vector = num obs str(df_school$visits_by_100751) sum(df_school$visits_by_100751) # sum of values of var across all obs ``` We perform calculations on a variable like we would on any vector of same type ```{r} v <- c(2,4,6) typeof(v) length(v) sum(v) ``` ## View and print data ### Viewing and printing, data frames Many ways to view/print a data frame object. Here are three ways: 1. Simply type the object name (output omitted) - number of observations and rows printed depend on YAML header settings and on object "attributes" (attributes discussed in future week) ```{r, results="hide"} df_event ``` 2. Use the `View()` function to view data in a browser ```{r eval=FALSE} View(df_event) ``` 3. `head()` to show the first _n_ rows. The default is 6 rows. ```{r results="hide"} #?head #head(df_event) head(df_event, n=5) ``` ### Viewing and printing, data frames `obj_name[,]` to print specific rows and columns of data frame - particularly powerful when combined with sequences (e.g., `1:10`) \medskip Examples (output omitted): - Print first five rows, all vars ```{r results="hide"} df_event[1:5, ] ``` - Print first five rows and first three columns ```{r results="hide"} df_event[1:5, 1:3] ``` - Print first three columns of the 100th observation ```{r results="hide"} df_event[100, 1:3] ``` - Print the 50th observation, all variables ```{r results="hide"} df_event[50,] ``` ### Viewing and printing, variables within data frames Recall that: - `obj_name$var_name` print specifics elements (i.e., variables) of a data frame ```{r results="hide"} df_event$zip ``` - each element (i.e., variable) of data frame is an __atomic vector__ with __length__ = number of observations ```{r} typeof(df_event$zip) length(df_event$zip) ``` - each element of a variable is the value of the variable for one observation \medskip Print specific elements (i.e., observations) of variable based on element position - syntax: `obj_name$var_name[]` - vectors don't have "rows" or "columns"; they just have elements - syntax combined with sequences (e.g., print first 10 observations) ```{r} df_event$event_state[1:10] # print obs 1-10 of variable "event_state" df_event$event_type[6:10] # print obs 6-10 of variable "event_type" ``` ### Viewing and printing, variables within data frames Print specific elements (i.e., observations) of variable based on element position - syntax: `obj_name$var_name[]` Example, print individual elements ```{r} df_event$zip[1:5] # print obs 1-5 of variable for event zip code df_event$zip[1] # print obs 1 of variable for event zip code df_event$zip[5] # print obs 5 of variable for event zip code df_event$zip[c(1,3,5)] # print obs 5 of variable for event zip code ``` Print specific elements of multiple variables using combine function `c()` - syntax: `c(obj_name$var1_name[], obj_name$var2_name[],...)` - Example: print first five observations of variables `"event_state"` and `"event_type"` ```{r} c(df_event$event_state[1:5],df_event$event_type[1:5]) ``` ### Exercise Printing exercise using the df_school data frame 1. Use the `obj_name[,]` syntax to print the first 5 rows and 3 columns of the `df_school` data frame 1. Use the `head()` function to print the first 4 observations 1. Use the `obj_name$var_name[1:10]` syntax to print the first 10 observations of a variable in the `df_school` data frame 1. Use combine() to print the first 3 observations of variables "school_type" & "name" ### Solution 1. Use the `obj_name[,]` syntax to print the first 5 rows and 3 columns of the `df_school` data frame ```{r} df_school[1:5,1:3] ``` ### Solution 2. Use the `head()` function to print the first 4 observations ```{r} head(df_school, n=4) ``` ### Solution 3. Use the `obj_name$var_name[1:10]` syntax to print the first 10 observations of a variable in the `df_school` data frame ```{r} df_school$name[1:10] ``` ### Solution 4. Use combine() to print the first 3 observations of variables "school_type" & "name" ```{r} c(df_school$school_type[1:3],df_school$name[1:3]) ``` ## Missing values ### Missing values Missing values have the value `NA` - `NA` is a special keyword, not the same as the character string `"NA"` use `is.na()` function to determine if a value is missing - `is.na()` returns a logical vector ```{r} is.na(5) is.na(NA) is.na("NA") typeof(is.na("NA")) # example of a logical vector nvector <- c(10,5,NA) is.na(nvector) typeof(is.na(nvector)) # example of a logical vector svector <- c("e","f",NA,"NA") is.na(svector) ``` ### Missing values are "contagious" What does "contagious" mean? - operations involving a missing value will yield a missing value ```{r} 7>5 7>NA sum(1,2,NA) 0==NA 2*c(0,1,2,NA) NA*c(0,1,2,NA) ``` ### Functions and missing values example, `table()` `table()` function is useful for investigating categorical variables ```{r} str(df_event$event_type) table(df_event$event_type) ``` ### Functions and missing values example, `table()` By default `table()` ignores `NA` values ```{r} #?table str(df_event$school_type_pri) table(df_event$school_type_pri) ``` `useNA` argument controls if table includes counts of `NA`s. Allowed values: - never ("no") [DEFAULT VALUE] - only if count is positive ("ifany"); - even for zero counts ("always")" ```{r} nrow(df_event) table(df_event$school_type_pri, useNA="always") ``` Broader point: Most functions that create descriptive statistics have options about how to treat missing values` - When investigating data, good practice to _always_ show missing values # Subsetting using subset operators ### Subsetting to Extract Elements "Subsetting" refers to isolating particular elements of an object \medskip Subsetting operators can be used to select/exclude elements (e.g., variables, observations) - there are three subsetting operators: `[]`, `$` , `[[]]` - these operators function differently based on vector types (e.g, atomic vectors, lists, data frames) ### Wichham refers to number of "dimensions" in R objects An atomic vector is a 1-dimensional object that contains n elements ```{r} x <- c(1.1, 2.2, 3.3, 4.4, 5.5) str(x) ``` Lists are multi-dimensional objects - Contains n elements; each element may contain a 1-dimensional atomic vector or a multi-dimensional list. Below list contains 3 dimensions ```{r} list <- list(c(1,2), list("apple", "orange")) str(list) ``` Data frames are 2-dimensional lists - each element is a variable (dimension=columns) - within each variable, each element is an observation (dimension=rows) ```{r} ncol(df_school) nrow(df_school) ``` ## Subset atomic vectors using [] ### Subsetting elements of atomic vectors "Subsetting" a vector refers to isolating particular elements of a vector - I sometimes refer to this as "accessing elements of a vector" - subsetting elements of a vector is similar to "filtering" rows of a data-frame - `[]` is the subsetting function for vectors Six ways to subset an atomic vector using `[]` 1. Using positive integers to return elements at specified positions 2. Using negative integers to exclude elements at specified positions 3. Using logicals to return elements where corresponding logical is `TRUE` 4. Empty `[]` returns original vector (useful for dataframes) 5. Zero vector [0], useful for testing data 6. If vector is "named," use character vectors to return elements with matching names ### 1. Using positive integers to return elements at specified positions (subset atomic vectors using []) Create atomic vector `x` ```{r} (x <- c(1.1, 2.2, 3.3, 4.4, 5.5)) str(x) ``` `[]` is the subsetting function for vectors - contents inside `[]` can refer to element number (also called "position"). - e.g., `[3]` refers to contents of 3rd element (or position 3) ```{r} x[5] #return 5th element x[c(3, 1)] #return 3rd and 1st element x[c(4,4,4)] #return 4th element, 4th element, and 4th element #Return 3rd through 5th element x[3:5] ``` ### 2. Using negative integers to exclude elements at specified positions (subset atomic vectors using []) Before excluding elements based on position, investigate object ```{r} x length(x) str(x) ``` Use negative integers to exclude elements based on element position ```{r} x[-1] # exclude 1st element x[c(3,1)] # 3rd and 1st element x[-c(3,1)] # exclude 3rd and 1st element ``` ### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using []) ```{r} x ``` When using `x[y]` to subset `x`, good practice to have `length(x)==length(y)` ```{r} length(x) # length of vector x length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # length of y length(x) == length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # condition true x[c(TRUE,TRUE,FALSE,FALSE,TRUE)] ``` Recycling rules: - in `x[y]`, if `x` is different length than `y`, R "recycles" length of shorter to match length of longer ```{r} length(c(TRUE,FALSE)) x x[c(TRUE,FALSE)] ``` ### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using []) ```{r} x ``` Note that a missing value (`NA`) in the index always yields a missing value in the output: ```{r} x[c(TRUE, FALSE, NA, TRUE, NA)] ``` Return all elements of object `x` where element is greater than 3: ```{r} x # print object X x>3 # for each element of X, print T/F whether element value > 3 str(x>3) x[x>3] # prints only the values that had TRUE at that position ``` ### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using []) [cont.] The `visits_by_100751` column shows how many visits the University of Alabama made to each school. Let's subset this to only include 2 or more visits: ```{r} df_school$visits_by_100751[1:100] df_school$visits_by_100751[1:100]>2 df_school$visits_by_100751[df_school$visits_by_100751>2] ``` ### 4. Empty `[]` returns original vector (subset atomic vectors using []) ```{r} x x[] ``` This is useful for sub-setting data frames, as we will show below ### 5. Zero vector [0] (subset atomic vectors using []) Zero vector, `x[0]` - R interprets this as returning element 0 ```{r} x[0] ``` Wickham states: - "This is not something you usually do on purpose, but it can be helpful for generating test data." ### 6. If vector is named, character vectors to return elements with matching names (subset atomic vectors using []) Create vector `y` that has values of vector `x` but each element is named ```{r} x (y <- c(a=1.1, b=2.2, c=3.3, d=4.4, e=5.5)) ``` Return elements of vector based on name of element - enclose element names in single `''` or double `""` quotes ```{r} #show element named "a" y["a"] #show elements "a", "b", and "d" y[c("a", "b", "d" )] ``` ## Subsetting lists/data frames using [] ### Subsetting lists using [] Using `[]` operator to subset lists works the same as subsetting atomic vector - Using `[]` with a list always returns a list ```{r} list_a <- list(list(1,2),3,"apple") str(list_a) #create new list that consists of elements 3 and 1 of list_a list_b <- list_a[c(3, 1)] str(list_b) #show elements 3 and 1 of object list_a #str(list_a[c(3, 1)]) ``` ### Subsetting data frames using [] Recall that a data frame is just a particular kind of list - each element = a column = a variable Using `[]` with a list always returns a list - Using `[]` with a data frame always returns a data frame Two ways to use `[]` to extract elements of a data frame 1. use "single index" `df_name[]` to extract columns (variables) based on element position number (i.e., column number) 1. use "double index" `df_name[, ]` to extact particular rows and columns of a data frame ### Subsetting data frames using [] to extract columns (variables) based on element position Use "single index" `df_name[]` to extract columns (variables) based on element number (i.e., column number) \medskip Examples [output omitted] ```{r, results="hide"} names(df_event) #extract elements 1 through 4 (elements=columns=variables) df_event[1:4] df_event[c(1,2,3,4)] str(df_event[1:4]) #extract columns 13 and 7 df_event[c(13,7)] ``` ### Subsetting Data Frames to extract columns (variables) and rows (observations) based on positionality use "double index" syntax `df_name[, ]` to extact particular rows and columns of a data frame - often combined with sequences (e.g., `1:10`) ```{r} #Return rows 1-3 and columns 1-4 df_event[1:3, 1:4] #Return rows 50-52 and columns 10 and 20 df_event[50:52, c(10,20)] ``` ### Subsetting Data Frames to extract columns (variables) and rows (observations) based on positionality use "double index" syntax `df_name[, ]` to extact particular rows and columns of a data frame \medskip recall that empty `[]` returns original object (output omitted) ```{r results="hide"} #return original data frame df_event[] #return specific rows and all columns (variables) df_event[1:5, ] #return all rows and specific columns (variables) df_event[, c(1,2,3)] ``` ### Use [] to extract data frame columns based on variable names Selecting columns from a data frame by subsetting with `[]` and list of element names (i.e., variable names) enclose in quotes \medskip "single index" approach extracts specific variables, all rows (output omitted) ```{r, results="hide"} df_event[c("instnm", "univ_id", "event_state")] ``` "Double index" approach extracts specific variables and specific rows - syntax `df_name[, ]` ```{r} df_event[1:5, c("instnm", "event_state", "event_type")] ``` ### Student exercises Use subsetting operators from base R in extracting columns (variables), observations: 1. Use both "single index" and "double index" in subsetting to create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from the `df_event` data frame. And show what columns (variables) are in the newly created dataframe. 2. Use subsetting to return rows 1-5 of columns `state_code`, `name`, `address` from the `df_school` data frame. ### Solution to Student Exercises Solution to 1 __base R__ using subsetting operators ```{r} # single index df_event_br <- df_event[c("instnm", "event_date", "event_type")] #double index df_event_br <- df_event[, c("instnm", "event_date", "event_type")] names(df_event_br) ``` Solution to 2 __base R__ using subsetting operators ```{r} df_school[1:5, c("state_code", "name", "address")] ``` ## Subsetting lists/data frames using [[]] and $ ### Subset single element from object using [[]] operator, atomic vectors So far we have used `[]` to extract elements from an object - Apply `[]` to atomic vector: returns atomic vector with elements you requested - Apply `[]` to list: returns list with elements you requested `[[]]` also extract elements from an object - Applying `[[]]` to atomic vector gives same result as `[]`; that is, an atomic vector with element you request ```{r} (x <- c(1.1, 2.2, 3.3, 4.4, 5.5)) str(x[3]) str(x[[3]]) ``` - Caveat: when applying `[[]]` to atomic vector, you can only subset a single element ```{r} x[c(3,4)] # single bracket; this works #x[[c(3,4)]] # double bracket; this won't work ``` ### Subsetting lists using `[]` vs. `[[]]`, introduce "train metaphor" Applying `[[]]` to a list - Understanding what `[]` vs. `[[]]` does to a list is very important but requires some explanation! _Advanced R_ [chapter 4.3](https://adv-r.hadley.nz/subsetting.html#subset-single) by Wickham uses the "train metaphor" to explain a list vs. **contents** of a list and how this relates to `[]` vs. `[[]]` Below code chunk makes a list named `list_x` that contains 3 elements ```{r} list_x <- list(1:3, "a", 4:6) # create list object list_x ``` In our train metaphor, object `list_x` is a train that contains 3 carriages [![](three_carriage_train.png)](https://adv-r.hadley.nz/subsetting.html#subset-single) ### Subsetting lists using `[]` vs. `[[]]`, introduce "train metaphor" list object `list_x` is a train that contains 3 carriages ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("three_carriage_train.png") #[![](three_carriage_train.png)](https://adv-r.hadley.nz/subsetting.html#subset-single) ``` When we "subset a list" -- that is, extract one or more elements from the list -- we have two broad choices (image below) ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("one_carriage_train_vs_contents.png") # [![](one_carriage_train_vs_contents.png)](https://adv-r.hadley.nz/subsetting.html#subset-single) ``` 1. Extracting elements using `[]` always returns a list, usually one with fewer elements - you can think of this as a train with fewer carriages ```{r} #str(list_x) str(list_x[1]) # returns a list ``` 2. Extracting element using `[[]]` returns **_contents_** of particular carriage - I say applying `[[]]` to a list or data frame returns a simpler object that moves up one level of hierarchy ```{r} str(list_x[[1]]) # returns an atomic vector ``` ### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[]` Rules about applying subset operator `[]` to a list - Applying `[]` to a list always returns a list - Resulting list contains 1 or more elements depending on what typed inside `[]` Here is a list object named `list_x` ```{r} list_x <- list(1:3, "a", 4:6) ``` Here is an image of a few "trains" that can be created by applying `[]` to `list_x` ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("smaller_trains.png") ``` And here is code to create the "trains" shown in above image (output omitted) ```{r, results = "hide"} list_x[1:2] list_x[-2] list_x[c(1,1)] list_x[0] list_x[] # returns the original list; not shown in above train picture ``` ### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[[]]` Rules about applying subset operator `[[]]` to a list - Can apply `[[]]` to return the **contents** of a **single element** of a list Create list `list_x` and show "train" Image of applying `list_x[1]` vs. `list_x[[1]]` ```{r} list_x <- list(1:3, "a", 4:6) ``` ```{r, out.width = "45%", echo = FALSE} library(knitr) include_graphics("one_carriage_train_vs_contents.png") ``` Object created by `list_x[1]` is a list with one element (output omitted) ```{r, results = "hide"} list_x[1] str(list_x[1]) ``` Object created by `list_x[[1]]` is a vector with 3 elements (output omitted) - `list_x[[1]]` gives us "contents" of element 1 - Since element 1 contains a numeric vector, object created by `list_x[[1]]` is a numeric vector ```{r, results = "hide"} list_x[[1]] str(list_x[[1]]) ``` ### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[[]]` Rules about applying subset operator `[[]]` to a list - Can apply `[[]]` to return the **contents** of a **single element** of a list ```{r} list_x <- list(1:3, "a", 4:6) # create list list_x ``` We cannot use `[[]]` to subset multiple elements of a list (output omitted) - e.g., we could write `list_x[[2]]` but not `list_x[[2:3]]` ```{r, eval = FALSE} list_x[[c(2)]] # this works, subset element 2 using [[]] list_x[[c(2,3)]] # this doesn't work; subset element 2 and 3 using [[]] list_x[c(2,3)] # this works; subset element 2 and 3 using [] ``` ### Subset lists using `[]` vs. `[[]]`, deepen understanding of `[[]]` Like `[]`, can use `[[]]` to return contents of __named__ elements specified using quotes - syntax: `obj_name[["element_name"]]` ```{r} list_x <- list(var1=1:3, var2="a", var3=4:6) # create list with named elements ``` Subset list `list_x` using `[[]]` element names ```{r} list_x[["var1"]] # subset by element position: list_x[[1]] str(list_x[["var1"]]) str(list_x["var1"]) # note: suggests var name is attribute of list, not atomic vector ``` Can do same thing with data frames because data frames are lists (output omitted) - e.g., `df_event[["zip"]]` returns contents of element named `"zip"` - object created by `df_event[["zip"]]` is character vector of length = 18,680 ```{r, results='hide'} # df_event[["zip"]] # this works but long output str(df_event[["zip"]]) # character vector of length 18,860 typeof(df_event[["zip"]]) length(df_event[["zip"]]) str(df_event["zip"]) # by contrast, this is a dataframe w/ one variable ``` ### General rules of applying `[]` vs `[[]]` to (nested) objects What we just learned about applying `[]` vs `[[]]` to lists applies more generally to "nested objects" - "nested objects" are objects with a hierarchical structure such that an element of an object contains another object General rules of applying `[]` vs. `[[]]` to nested objects - subset any object `x` using `[]` will return object with same data structure as `x` - subset any object `x` using `[[]]` will return an object thay may or may not have same data structure of `x` - if object `x` is not a nested object, then applying `[[]]` to a single element of `x` will return object with same data structure as `x` - if object `x` has a nested data structure, then then applying `[[]]` to a single element of `x` will "move up one level of hierarchy" to extract the **contents** of an element within the object `x` When working w/ data frames, functions that calculate things expect to be working with atomic vectors (think `[[]]`) not lists (think `[]`) ```{r} mean(df_event[['med_inc']], na.rm = TRUE) # mean(df_event['med_inc'], na.rm = TRUE) # by contrast, this doesn't work ``` ### Subset lists/data frames using $ ```{r} list_x <- list(var1=1:3, var2="a", var3=4:6) ``` `obj_name$element_name` is shorthand operator for `obj_name[["element_name"]]` These three lines of code all give the same result ```{r} list_x[[1]] list_x[["var1"]] list_x$var1 ``` `df_name$var_name`: easiest way in base R to refer to variable in a data frame - these two lines of code are equivalent ```{r} str(df_event[["zip"]]) str(df_event$zip) ``` ## Subset Data frames by combining [] and $ ### Subset Data Frames by combining `[]` and `$`, Motivation Motivation - When working with data frames we often want to isolate those observations that satisfy certain conditions - This is often referred to as "filtering" - We filter observations based on the values of one or more variables - Perhaps you have seen "filtering" in Microsoft Excel - open some spreadsheet that contains variables (columns) and observations (rows) - click on `Data` >> `Filter` and then filter observations based on values of variable(s) Filtering example using data frame `df_school` - Observations: - One observation per high school (public and private) - Variables: - high school characteristics; number of off-campus recruiting visits from particular universities - NCES ID for UC Berkeley is `110635` - variable `visits_by_110635` shows number of visits a high school received from UC Berkeley - **Task**: - Isolate observations where the high school received at least 1 visit from UC Berkeley ### Subset Data Frames by combining `[]` and `$` **Task**: - Isolate obs where school received at least 1 visit from UC Berkeley General syntax: `df_name[df_name$var_name , ]` - where `` is something that evaluates to `TRUE` or `FALSE` for each element of the atomic vector (i.e., variable) - Note that syntax uses "double index" `df_name[, ]` syntax - Therefore, the `` in above syntax is isolating `` - __Cannot__ use "single index" syntax `df_name[]` Solution to task (output omitted) - Note: below code filters observations but keeps all variables ```{r results="hide"} df_school[df_school$visits_by_110635 >= 1, ] ``` ### Subset Data Frames by combining `[]` and `$`, decompose syntax **Task**: Isolate obs where school received at least 1 visit from UC Berkeley - general syntax: `df_name[df_name$var_name , ]` - solution: `df_school[df_school$visits_by_110635 >= 1, ]` ```{r results="hide", include = FALSE} df_school[df_school$visits_by_110635 >= 1, ] ``` Decomposing syntax `df_school[df_school$visits_by_110635 >= 1, ]` - `df_school$visits_by_110635 >= 1`: returns a logical (`TRUE`/`FALSE`) atomic vector with length equal to number of obs in `df_school` ```{r results="hide"} typeof(df_school$visits_by_110635 >= 1) length(df_school$visits_by_110635 >= 1) str(df_school$visits_by_110635 >= 1) ``` - `df_school[df_school$visits_by_110635 >= 1, ]` - uses "double index" `df_name[, ]` syntax to extract rows, columns - rows: extract rows where `df_school$visits_by_110635 >= 1` is `TRUE` - columns: since `` is empty, extracts all columns - __key point__: `df_name[df_name$var_name , ]` is "subset a vector approach #3": "Using logicals to return elements where condition `TRUE`" - example using atomic vectors (output omitted) ```{r results="hide"} x <- c(1.1, 2.2, 3.3, 4.4, 5.5) x[x>3] ``` ### Subset Data Frames by combining `[]` and `$`, keep desired columns - General syntax to filter desired observations (rows) and variables (columns) of data frame: - `df_name[df_name$var_name , ]` __Tasks__ (output omitted) - Extract observations where the high school received at least 1 visit from UC Berkeley and the first three columns ```{r results="hide"} df_school[df_school$visits_by_110635 >= 1, 1:3] ``` - Extract observations where the high school received at least 1 visit from UC Berkeley and variables "state_code" "school_type" "name" ```{r results="hide"} df_school[df_school$visits_by_110635 >= 1, c("state_code","school_type","name")] ``` ### Subset Data Frames by combining `[]` and `$`, more examples Syntax: - filter based on one variable: - `df_name[df_name$var_name , ]` - Example syntax to filter based on two conditions being true - `df_name[df_name$var_name & df_name$var_name , ]` Pro tip: - wrap above syntax within `nrow()` function to count how many observations (rows) satisfy the condition (as opposed to printing all rows that satisfy condition) __Tasks__ - Count obs where high schools received at least 1 visit by Bama (100751) **and** at least one visit by Berkeley (110635) ```{r} nrow(df_school[df_school$visits_by_110635 >= 1 & df_school$visits_by_100751 >= 1, ]) # Equivalently: # nrow(df_school[df_school[["visits_by_110635"]] >= 1 & # df_school[["visits_by_100751"]] >= 1, ]) ``` - Count obs where schools received 1+ visit by Bama **or** 1+ visit by Berkeley ```{r} nrow(df_school[df_school$visits_by_110635 >= 1 | df_school$visits_by_100751 >= 1, ]) ``` ### Logical operators for comparisons - Logical operators to isolate/filter observations of data frame Symbol | Meaning -------|------- `==` | Equal to `!=` | Not equal to `>` | greater than `>=` | greater than or equal to `<` | less than `<=` | less than or equal to `&` | AND `|` | OR `%in%` | includes \medskip - Visualization of "Boolean" operators (e.g., AND, OR, AND NOT) !["Boolean" operations, x=left circle, y=right circle, from Wichkam (2018)](transform-logical.png){width=40%} ### Subset Data Frames by combining `[]` and `$`, more examples **Example**: Count the number of out-of-state high schools that UC Berkeley visited. \smallskip ```{r} # The `inst_110635` variable contains the home state of UC Berkeley unique(df_school$inst_110635) # Filter for schools visited by UC Berkeley AND whose state is not "CA" nrow(df_school[df_school$visits_by_110635 >= 1 & df_school$state_code != df_school$inst_110635, ]) ``` \bigskip **Example**: Count the number of schools in the Northeast that received a visit from either UC Berkeley, U of Alabama, or CU Boulder. \smallskip ```{r} # Vector containing states located in the Northeast region northeast_states <- c('CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA') # Filter for schools in the Northeast AND visited by any of the 3 univs nrow(df_school[df_school$state_code %in% northeast_states & (df_school$visits_by_110635 >= 1 | df_school$visits_by_100751 >= 1 | df_school$visits_by_126614 >= 1), ]) ``` ### Subset Data Frames by combining `[]` and `$`, `NA` Observations Filtering observations of data frame using `[]` combined with `$` is more complicated in the presence of missing values (`NA` values) \medskip The next few slides will explain - why it is more complicated - how to filter correctly when `NA`s are present ### Subset Data Frames by combining `[]` and `$`, `NA` Observations When sub-setting via `[]` combined with `$`, result will include: - rows where condition is `TRUE` - __as well as__ rows with `NA` (missing) values for ``. __Task__ (using `df_event`, which has one obs per university, recruiting event) - How many events at public HS with at least $50k median household income? ```{r} sum(is.na(df_event$med_inc)) # number of observations (all events) with missing values for med_inc #num obs event_type=="public hs" and med_inc is missing nrow(df_event[df_event$event_type == "public hs" & is.na(df_event$med_inc)==1 , ]) # note TRUE evaluates to 1 #num obs event_type=="public hs" & med_inc is not NA & med_inc >= $50,000 nrow(df_event[df_event$event_type == "public hs" & is.na(df_event$med_inc)==0 & df_event$med_inc>=50000 , ]) # note FALSE evaluates to 0 #num obs event_type=="public hs" and med_inc >= $50,000 nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000 , ]) ``` ### Subset Data Frames by combining `[]` and `$`, `NA` Observations To exclude rows where condition is `NA` if subset using `[]` combined w/ `$` - use `which()` to ask only for values where condition evaluates to `TRUE` - `which()` returns position numbers for elements where condition is `TRUE` ```{r} #?which c(TRUE,FALSE,NA,TRUE) str(c(TRUE,FALSE,NA,TRUE)) which(c(TRUE,FALSE,NA,TRUE)) ``` Task: Count events at public HS with at least $50k median household income? ```{r} #Base R, `[]` combined with `$`; without which() nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000, ]) #Base R, `[]` combined with `$`; with which() nrow(df_event[which(df_event$event_type == "public hs" & df_event$med_inc>=50000), ]) ``` ### Student Exercises Subsetting Data Frames with `[]` and `$`: 1. Show how many public high schools in California with at least 50% Latinx (hispanic in data) student enrollment from df_school. 2. Show how many out-state events at public high schools with more than $30K median from df_event (do not forget to exclude missing values). ### Solution to Student Exercises Solution to 1 __base R__ using [] and $ ```{r} df_school_br1<- df_school[df_school$school_type == "public" & df_school$pct_hispanic >= 50 & df_school$state_code == "CA", ] nrow(df_school_br1) ``` ### Solution to Student Exercises Solution to 2: __base R__ using [] and $ ```{r} # use is.na to exclude NA nrow(df_event[df_event$event_type == "public hs" & df_event$event_inst =="Out-State" & df_event$med_inc > 30000 & is.na(df_event$med_inc) ==0, ]) # use which to exclude NA nrow(df_event[which(df_event$event_type == "public hs" & df_event$event_inst =="Out-State" & df_event$med_inc > 30000 ), ]) ``` # Subset using subset() function ### Subset function The `subset()` is a base R function to "filter" observations from some object `x` - object `x` can be a matrix, data frame, list - `subset()` automatically excludes elements/rows with `NA` for condition - Can also use `subset()` to select variables - what `subset()` function returns: - "An object similar to x contain just the selected \ldots rows and columns (for a matrix or data frame)" - `subset()` can be combined with: - assignment (`<-`) to create new objects - `nrow()` to count number of observations that satisfy criteria ```{r, eval=FALSE} ?subset ``` \medskip Syntax [when object is data frame]: __subset(x, subset, select, drop = FALSE)__ - `x` is object to be subset - `subset` is the logical expression(s) (evaluates to `TRUE/FALSE`) indicating elements (rows) to keep - `select` indicates columns to select from data frame (if argument is not used default will keep all columns) - `drop` to preserve original __dimensions__ [SKIP] ### Subset function, examples Recall the previous example where we count events at public HS with at least $50k median household income. - _Note_. `subset()` automatically excludes rows where condition is `NA`: ```{r} #Base R, `[]` combined with `$`, without which(); includes `NA` nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000, ]) #Base R, `[]` combined with `$`, with which(); excludes `NA` nrow(df_event[which(df_event$event_type == "public hs" & df_event$med_inc>=50000), ]) #Base R, `subset()`; excludes `NA` nrow(subset(df_event, event_type == "public hs" & med_inc>=50000)) #Base R, `subset()`; excludes `NA`; explicitly name arguments of subset() nrow(subset(x = df_event, subset = event_type == "public hs" & med_inc>=50000)) ``` ### Subset function, examples Using `df_school`, show all public high schools that are at least 50% Latinx (var=`pct_hispanic`) student enrollment in California - Using base R, `subset()` [output omitted] ```{r, results="hide"} #public high schools with at least 50% Latinx student enrollment subset(x= df_school, subset = school_type == "public" & pct_hispanic >= 50 & state_code == "CA") ``` - Can wrap `subset()` within `nrow()` to count number of observations that satisfy criteria ```{r} nrow(subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA")) ``` ### Subset function, examples Note that `subset()` identify the number of observations for which the condition is `TRUE` ```{r} nrow(subset(x = df_school, subset = TRUE)) nrow(subset(x = df_school, subset = FALSE)) ``` ### Subset function, examples Count all CA public high schools that are at least 50% Latinx and received at least 1 visit from UC Berkeley (var=`visits_by_110635`) ```{r} nrow(subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA" & visits_by_110635 >= 1)) ``` ### Subset function, examples `subset()` can also use `%in%` operator, which is more efficient version of __OR__ operator `|` - Count number of schools from MA, ME, or VT that received at least one visit from University of Alabama (var=`visits_by_100751`) ```{r} nrow(subset(df_school, state_code %in% c("MA","ME","VT") & visits_by_100751 >= 1)) ``` ### Subset function, examples Use the `select` argument within `subset()` to keep selected variables - syntax: `select = c(var_name1,var_name2,...,var_name_n)` Subset all CA public high schools that are at least 50% Latinx __AND__ only keep variables `name` and `address` ```{r} subset(x = df_school, subset = school_type == "public" & pct_hispanic >= 50 & state_code == "CA", select = c(name, address)) ``` ### Subset function, examples Combine `subset()` with assignment (`<-`) to create a new data frame Create a new date frame of all CA public high schools that are at least 50% Latinx __AND__ only keep variables `name` and `address` ```{r} df_school_v2 <- subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA", select = c(name, address)) head(df_school_v2, n=5) nrow(df_school_v2) ``` ### Student Exercises Using `subset()` from base R: 1. Create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from `df_event` data frame. And show what columns (variables) are in the newly created dataframe. 2. Create a new dataframe from the `df_school` data frame that includes out-of-state public high schools with 50%+ Latinx student enrollment that received at least one visit by the University of California Berkeley (var= visits_by_110635). And count the number of observations. 3. Count the number of public schools from CA, FL or MA that received one or two visits from UC Berkeley from the `df_school` data frame. 4. Subset all public out-of-state high schools visited by University of California Berkeley that enroll at least 50% Black students, and only keep variables `state_code`, `name` and `zip_code`. ### Solution to Student Exercises Solution to 1 ```{r} df_event_br <- subset(df_event, select=c(instnm, event_date, event_type)) names(df_event_br) ``` Solution to 2 ```{r} df_school_br <- subset(df_school, state_code != "CA" & school_type == "public" & pct_hispanic >= 50 & visits_by_110635 >=1 ) nrow(df_school_br) ``` Solution to 3 ```{r} nrow(subset(df_school, state_code %in% c("CA", "FL", "MA") & school_type == "public" & visits_by_110635 %in% c(1,2) )) ``` ### Solution to Student Exercises Solution to 4 ```{r} subset(df_school, school_type == "public" & state_code != "CA" & visits_by_110635 >= 1 & pct_black >= 50, select = c(state_code, name, zip_code)) ``` # Creating variables ### Create new data frame based on `df_school_all` Data frame `df_school_all` has one obs per US high school and then variables identifying number of visits by particular universities ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) names(df_school_all) ``` ### Create new data frame based on `df_school_all` Create new version of data frame, called `school_v2`, which we'll use to introduce how to create new variables ```{r, results='hide'} library(tidyverse) # below code use tidyverse functions and pipe operator school_v2 <- df_school_all %>% select(-contains("inst_")) %>% # remove vars that start with "inst_" rename( # rename selected variables visits_by_berkeley = visits_by_110635, visits_by_boulder = visits_by_126614, visits_by_bama = visits_by_100751, visits_by_stonybrook = visits_by_196097, visits_by_rutgers = visits_by_186380, visits_by_pitt = visits_by_215293, visits_by_cinci = visits_by_201885, visits_by_nebraska = visits_by_181464, visits_by_georgia = visits_by_139959, visits_by_scarolina = visits_by_218663, visits_by_ncstate = visits_by_199193, visits_by_irvine = visits_by_110653, visits_by_kansas = visits_by_155317, visits_by_arkansas = visits_by_106397, visits_by_sillinois = visits_by_149222, visits_by_umass = visits_by_166629, num_took_read = num_took_rla, num_prof_read = num_prof_rla, med_inc = avgmedian_inc_2564 ) glimpse(school_v2) ``` ### Base R approach to creating new variables Create new variables using assignment operator `<-` and subsetting operators `[]` and `$` to create new variables and set conditions of the input variables \medskip Pseudo syntax: `df$newvar <- ...` - where `...` argument is expression(s)/calculation(s) used to create new variables - expressions can include subsetting operators and/or other base R functions \medskip __Task__: Create measure of percent of students on free-reduced lunch __base R approach__ ```{r} school_v2_temp<- school_v2 #create copy of dataset; not necessary school_v2_temp$pct_fr_lunch <- school_v2_temp$num_fr_lunch/school_v2_temp$total_students #investigate variable you created str(school_v2_temp$pct_fr_lunch) school_v2_temp$pct_fr_lunch[1:5] # print first 5 obs ``` __tidyverse approach (with pipes)__ ```{r} school_v2_temp <- school_v2 %>% mutate(pct_fr_lunch = num_fr_lunch/total_students) ``` ### Base R approach to creating new variables If creating new variable based on the condition/values of input variables, basically the tidyverse equivalent of `mutate()` __with__ `if_else()` or `recode()` \medskip - Pseudo syntax: `df$newvar[logical condition]<- new value` - `logical condition`: a condition that evaluates to `TRUE` or `FALSE` ### Base R approach to creating new variables __Task__: Create 0/1 indicator if school has median income greater than $100k __tidyverse approach (using pipes)__ ```{r} school_v2_temp %>% select(med_inc) %>% mutate(inc_gt_100k= if_else(med_inc>100000,1,0)) %>% count(inc_gt_100k) # note how NA values of med_inc treated ``` __Base R approach__ ```{r} school_v2_temp$inc_gt_100k<-NA #initialize an empty column with NAs # otherwise you'll get warning school_v2_temp$inc_gt_100k[school_v2_temp$med_inc>100000] <- 1 school_v2_temp$inc_gt_100k[school_v2_temp$med_inc<=100000] <- 0 count(school_v2_temp, inc_gt_100k) ``` ### Creating variables __Task__: Using data frame `wwlist` and input vars `state` and `firstgen`, create a 4-category var with following categories: - "instate_firstgen"; "instate_nonfirstgen"; "outstate_firstgen"; "outstate_nonfirstgen" __tidyverse approach (using pipes)__ ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) wwlist_temp <- wwlist %>% mutate(state_gen = case_when( state == "WA" & firstgen =="Y" ~ "instate_firstgen", state == "WA" & firstgen =="N" ~ "instate_nonfirstgen", state != "WA" & firstgen =="Y" ~ "outstate_firstgen", state != "WA" & firstgen =="N" ~ "outstate_nonfirstgen") ) str(wwlist_temp$state_gen) wwlist_temp %>% count(state_gen) ``` ### Base R approach to creating new variables __Task__: Using `wwlist` and input vars `state` and `firstgen`, create a 4-category var __base R approach__ ```{r} wwlist_temp <- wwlist wwlist_temp$state_gen <- NA wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="Y"] <- "instate_firstgen" wwlist_temp$state_gen[wwlist_temp$state == "WA" & wwlist_temp$firstgen =="N"] <- "instate_nonfirstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="Y"] <- "outstate_firstgen" wwlist_temp$state_gen[wwlist_temp$state != "WA" & wwlist_temp$firstgen =="N"] <- "outstate_nonfirstgen" str(wwlist_temp$state_gen) count(wwlist_temp, state_gen) ``` # Appendix ## Sorting data ### Base R `sort()` for vectors `sort()` is a base R function that sorts vectors Syntax: `sort(x, decreasing=FALSE, ...)` - where x is object being sorted - By default it sorts in ascending order (low to high) - Need to set decreasing argument to `TRUE` to sort from high to low ```{r} #?sort() x<- c(31, 5, 8, 2, 25) sort(x) sort(x, decreasing = TRUE) ``` ### Base R `order()` for dataframes `order()` is a base R function that sorts vectors - Syntax: `order(..., na.last = TRUE, decreasing = FALSE)` - where `...` are variable(s) to sort by - By default it sorts in ascending order (low to high) - Need to set decreasing argument to `TRUE` to sort from high to low Descending argument only works when we want either one (and only) variable descending or all variables descending (when sorting by multiple vars) - use `-` when you want to indicate which variables are descending while using the default ascending sorting ```{r results="hide"} df_event[order(df_event$event_date), ] df_event[order(df_event$event_date, df_event$total_12), ] #sort descending via argument df_event[order(df_event$event_date, decreasing = TRUE), ] df_event[order(df_event$event_date, df_event$total_12, decreasing = TRUE), ] #sorting by both ascending and descending variables df_event[order(df_event$event_date, -df_event$total_12), ] ``` ### Example, sorting - Create a new dataframe from df_events that sorts by ascending by `event_date`, ascending `event_state`, and descending `pop_total`. __base R__ using `order()` function: ```{r results="hide"} df_event_br1 <- df_event[order(df_event$event_date, df_event$event_state, -df_event$pop_total), ] ```