--- title: 'Lecture 3: Investigating data patterns using Base R' author: "" date: "" output: beamer_presentation: df_print: tibble highlight: tango includes: in_header: ../beamer_header.tex keep_tex: yes latex_engine: xelatex slide_level: 3 theme: default toc: no pdf_document: toc: no fontsize: 8pt subtitle: Managing and Manipulating Data Using R classoption: dvipsnames urlcolor: blue --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` # Introduction ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Load libraries and .Rdata data frames we will use today Data on off-campus recruiting events by public universities - Data frame object `df_event` - One observation per university, recruiting event - Data frame object `df_school` - One observation per high school (visited and non-visited) ```{r} rm(list = ls()) # remove all objects in current environment library(tidyverse) #load tidyverse library #load dataset with one obs per recruiting event load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) #load dataset with one obs per high school load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_somevars.RData")) ``` ### Why learn to "wrangle" data both via tidyverse and base R? __Tidyverse__ has become the leading way many people clean and manipulate data in R - these packages make data wrangling simpler than core base R commands (most times) - tidyverse commands can be more more efficient (less lines of code, consolidate steps) \medskip But you will inevitably run into edge cases where tidyverse commands don't work the way you expect them to and you'll need to use __base R__ \medskip It's good to have a basic foundation on both approaches and then decide which you prefer for most data tasks! - this class will primarily use tidyverse approach - future data science seminar will provide examples of edge cases where base R is necessary ### Tidyverse vs. base R functions tidyverse | base R | operation -------|-------|------- `select()` | `subset()` __OR__ `[ ]`+ `c()` | "extract" variables `filter()` | `subset()` __OR__ `[ ]`+ `$` | "extract" observations `arrange()` | `order()` | sorting data # Subsetting using subset() function ### Subset function The `subset()` is a base R function and easiest way to "filter" observations - can also used `subset()` to select variables - Like tidyverse `filter()`, `subset()` can be combined with: - with assignment (`<-`) to create new objects - with `count()` to count number of observations that satisfy criteria ```{r, eval=FALSE} ?subset ``` \medskip Syntax [when object is data frame]: __subset(x, subset, select, drop = FALSE)__ - `x` is object to be subset - `subset` is the logical expression(s) (evaluates to `TRUE/FALSE`) indicating elements (rows) to keep - `select` indicates columns to select from data frame (if argument is not used default will keep all columns) - `drop` to preserve original __dimensions__ [SKIP] - cane take values `TRUE` or `FALSE`; default is `FALSE` - only need to worry about dataframes when subset output is single column ### Subset function, examples Using `df_school`, show all public high schools that are at least 50% Latinx (var=`pct_hispanic`) student enrollment in California - Using tidyverse `filter()` [output omitted] ```{r, results="hide"} filter(df_school, school_type == "public", pct_hispanic >= 50, state_code == "CA") filter(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA") # same as above ``` - Using base R, `subset()` [output omitted] ```{r, results="hide"} #public high schools with at least 50% Latinx student enrollment subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA") ``` ### Subset function, examples Count all CA public high schools that are at least 50% Latinx - Can wrap `filter()` or `subset()` within `count()` to count number of observations that satisfy criteria ```{r} #filter() count(filter(df_school, school_type == "public", pct_hispanic >= 50, state_code == "CA")) count(filter(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA")) #subset() count(subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA")) ``` ### Subset function, examples Note that both `filter()` and `subset()` identify the number of observations for which the condition is `TRUE` ```{r} count(filter(df_school, TRUE)) count(subset(df_school, TRUE)) count(filter(df_school, FALSE)) count(subset(df_school, FALSE)) ``` ### Subset function, examples Count all CA public high schools that are at least 50% Latinx and received at least 1 visit from UC Berkeley (var=`visits_by_110635`) ```{r} #filter() count(filter(df_school, school_type == "public", pct_hispanic >= 50, state_code == "CA", visits_by_110635 >= 1)) #subset() count(subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA" & visits_by_110635 >= 1)) ``` ### Subset function, examples `subset()` can also use %in% operator, which is more efficient version of __OR__ operator `|` - Count number of schools from MA, ME, or VT that received at least one visit from University of Alabama (var=`visits_by_100751`) ```{r} #filter() count(filter(df_school, state_code %in% c("MA","ME","VT"), visits_by_100751 >= 1)) #subset() count(subset(df_school, state_code %in% c("MA","ME","VT") & visits_by_100751 >= 1)) ``` ### Subset function, examples Use the `select` argument within `subset()` to keep selected variables - syntax: `select = c(var_name1,var_name2,...,var_name_n)` Subset all CA public high schools that are at least 50% Latinx __AND__ only keep variables `name` and `address` ```{r} subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA", select = c(name, address)) ``` ### Subset function, examples Combine `subset()` with assignment (`<-`) to create a new data frame Create a new date frame of all CA public high schools that are at least 50% Latinx __AND__ only keep variables `name` and `address` ```{r} df_school_v2 <- subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA", select = c(name, address)) head(df_school_v2, n=5) nrow(df_school_v2) ``` ### Student Exercises Compare tidyverse to subset() from base R in extracting columns (variables), observations: 1. Use both base R and tidyverse to create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from df_event. And show what columns (variables) are in the newly created dataframe. 2. Use both base R and tidyverse to create a new dataframe from df_school that includes out-of-state public high schools with 50%+ Latinx student enrollment that received at least one visit by the University of California Berkeley (var= visits_by_110635). And count the number of observations. 3. Use both base R and tidyverse to count the number of public schools from CA, FL or MA that received one or two visits from UC Berkeley from df_school. 4. Use base R to subset all public out-of-state high schools visited by University of California Berkeley that enroll at least 50% Black students, and only keep variables "state_code", "name" and "zip_code" . ### Solution to Student Exercises Solution to 1 __base R__ using `subset()` function ```{r} df_event_br <- subset(df_event, select=c(instnm, event_date, event_type)) names(df_event_br) ``` __tidyverse__ using `select()` function ```{r} df_event_tv <- select(df_event, instnm, event_date, event_type) names(df_event_tv) ``` Solution to 2 __base R__ using `subset()` function ```{r} df_school_br <- subset(df_school, state_code != "CA" & school_type == "public" & pct_hispanic >= 50 & visits_by_110635 >=1 ) nrow(df_school_br) ``` __tidyverse__ using `filter()` function ```{r} df_school_tv <- filter(df_school, state_code != "CA" & school_type == "public" & pct_hispanic >= 50 & visits_by_110635 >=1 ) nrow(df_school_tv) ``` ### Solution to Student Exercises Solution to 3 __base R__ using `subset()` function ```{r} count(subset(df_school, state_code %in% c("CA", "FL", "MA") & school_type == "public" & visits_by_110635 %in% c(1,2) )) ``` __tidyverse__ using `filter()` function ```{r} count(filter(df_school, state_code %in% c("CA", "FL", "MA") & school_type == "public" & visits_by_110635 %in% c(1,2) )) ``` ### Solution to Student Exercises Solution to 4 __base R__ using `subset()` function ```{r} subset(df_school, school_type == "public" & state_code != "CA" & visits_by_100751 >= 1 & pct_hispanic >= 50, select = c(state_code, name, zip_code)) ``` # Subsetting using subsetting operators ### Subsetting to Extract Elements "Subsetting" refers to isolating particular elements of an object \medskip Subsetting operators can be used to select/exclude elements (e.g., variables, observations) - there are three subsetting operators: `[]`, `$` , `[[]]` - these operators function differently based on vector types (e.g, atomic vectors, lists, data frames) ### Wichham refers to number of "dimensions" in R objects An atomic vector is a 1-dimensional object that contains n elements ```{r} x <- c(1.1, 2.2, 3.3, 4.4, 5.5) str(x) ``` Lists are multi-dimensional objects - Contains n elements; each element may contain a 1-dimensional atomic vector or a multi-dimensional list. Below list contains 3 dimensions ```{r} list <- list(c(1,2), list("apple", "orange")) str(list) ``` Data frames are 2-dimensional lists - each element is a variable (dimension=columns) - within each variable, each element is an observation (dimension=rows) ```{r} ncol(df_school) nrow(df_school) ``` ## Subset atomic vectors using [] ### Subsetting elements of atomic vectors "Subsetting" a vector refers to isolating particular elements of a vector - I sometimes refer to this as "accessing elements of a vector" - subsestting elements of a vector is similar to "filtering" rows of a data-frame - `[]` is the subsetting function for vectors Six ways to subset an atomic vector using `[]` 1. Using positive integers to return elements at specified positions 2. Using negative integers to exclude elements at specified positions 3. Using logicals to return elements where corresponding logical is `TRUE` 4. Empty `[]` returns original vector (useful for dataframes) 5. Zero vector [0], useful for testing data 6. If vector is "named," use character vectors to return elements with matching names ### 1. Using positive integers to return elements at specified positions (subset atomic vectors using []) Create atomic vector `x` ```{r} (x <- c(1.1, 2.2, 3.3, 4.4, 5.5)) str(x) ``` `[]` is the subsetting function for vectors - contents inside `[]` can refer to element number (also called "position"). - e.g., `[3]` refers to contents of 3rd element (or position 3) ```{r} x[5] #return 5th element x[c(3, 1)] #return 3rd and 1st element x[c(4,4,4)] #return 4th element, 4th element, and 4th element #Return 3rd through 5th element str(x) x[3:5] ``` ### 2. Using negative integers to exclude elements at specified positions (subset atomic vectors using []) Before excluding elements based on position, investigate object ```{r} x length(x) str(x) ``` Use negative integers to exclude elements based on element position ```{r} x[-1] # exclude 1st element x[c(3,1)] # 3rd and 1st element x[-c(3,1)] # exclude 3rd and 1st element ``` ### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using []) ```{r} x ``` When using `x[y]` to subset `x`, good practice to have `length(x)==length(y)` ```{r} length(x) # length of vector x length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # length of y length(x) == length(c(TRUE,FALSE,TRUE,FALSE,TRUE)) # condition true x[c(TRUE,TRUE,FALSE,FALSE,TRUE)] ``` Recycling rules: - in `x[y]`, if `x` is different length than `y`, R "recycles" length of shorter to match length of longer ```{r} length(c(TRUE,FALSE)) x x[c(TRUE,FALSE)] ``` ### 3. Using logicals to return elements where corresponding logical is `TRUE` (subset atomic vectors using []) ```{r} x ``` Note that a missing value (`NA`) in the index always yields a missing value in the output ```{r} x[c(TRUE, FALSE, NA, TRUE, NA)] ``` Return all elements of object `x` where element is greater than 3 ```{r} x x[x>3] ``` ### 4. Empty `[]` returns original vector (subset atomic vectors using []) ```{r} x x[] ``` This is useful for sub-setting data frames, as we will show below ### 5. Zero vector [0] (subset atomic vectors using []) Zero vector, `x[0]` - R interprets this as returning element 0 ```{r} x[0] ``` Wickham states: - "This is not something you usually do on purpose, but it can be helpful for generating test data." ### 6. If vector is named, character vectors to return elements with matching names (subset atomic vectors using []) Create vector `y` that has values of vector `x` but each element is named ```{r} x (y <- c(a=1.1, b=2.2, c=3.3, d=4.4, e=5.5)) ``` Return elements of vector based on name of element - enclose element names in single `''` or double `""` quotes ```{r} #show element named "a" y["a"] #show elements "a", "b", and "d" y[c("a", "b", "d" )] ``` ## Subsetting lists/data frames using [] ### Subsetting lists using [] Using `[]` operator to subset lists works the same as subsetting atomic vector - Using `[]` with a list always returns a list ```{r} list_a <- list(list(1,2),3,"apple") str(list_a) #create new list that consists of elements 3 and 1 of list_a list_b <- list_a[c(3, 1)] str(list_b) #show elements 3 and 1 of object list_a #str(list_a[c(3, 1)]) ``` ### Subsetting data frames using [] Recall that a data frame is just a particular kind of list - each element = a column = a variable Using `[]` with a list always returns a list - Using `[]` with a data frame always returns a data frame Two ways to use `[]` to extract elements of a data frame 1. use "single index" `df_name[]` to extract columns (variables) based on element position number (i.e., column number) 1. use "double index" `df_name[, ]` to extact particular rows and columns of a data frame ### Subsetting data frames using [] to extract columns (variables) based on element position Use "single index" `df_name[]` to extract columns (variables) based on element number (i.e., column number) \medskip Examples [output omitted] ```{r, results="hide"} names(df_event) #extract elements 1 through 4 (elements=columns=variables) df_event[1:4] df_event[c(1,2,3,4)] str(df_event[1:4]) #extract columns 13 and 7 df_event[c(13,7)] ``` ### Subsetting Data Frames to extract columns (variables) and rows (observations) based on positionality use "double index" syntax `df_name[, ]` to extact particular rows and columns of a data frame - often combined with sequences (e.g., `1:10`) ```{r} #Return rows 1-3 and columns 1-4 df_event[1:3, 1:4] #Return rows 50-52 and columns 10 and 20 df_event[50:52, c(10,20)] ``` ### Subsetting Data Frames to extract columns (variables) and rows (observations) based on positionality use "double index" syntax `df_name[, ]` to extact particular rows and columns of a data frame \medskip recall that empty `[]` returns original object (output omitted) ```{r results="hide"} #return original data frame df_event[] #return specific rows and all columns (variables) df_event[1:5, ] #return all rows and specific columns (variables) df_event[, c(1,2,3)] ``` ### Use [] to extract data frame columns based on variable names Selecting columns from a data frame by subsetting with `[]` and list of element names (i.e., variable names) enclose in quotes \medskip "single index" approach extracts specific variables, all rows (output omittted) ```{r, results="hide"} df_event[c("instnm", "univ_id", "event_state")] select(df_event,instnm,univ_id,event_state) # same same ``` "Double index" approach extracts specific variables and specific rows - syntax `df_name[, ]` ```{r} df_event[1:5, c("instnm", "event_state", "event_type")] ``` ### Student exercises Use subsetting operators from base R in extracting columns (variables), observations: 1. Use both "single index" and "double index" in subsetting to create a new dataframe by extracting the columns `instnm`, `event_date`, `event_type` from df_event. And show what columns (variables) are in the newly created dataframe. 2. Use subsetting to return rows 1-5 of columns `state_code`, `name`, `address` from df_school. ### Solution to Student Exercises Solution to 1 __base R__ using subsetting operators ```{r} # single index df_event_br <- df_event[c("instnm", "event_date", "event_type")] #double index df_event_br <- df_event[, c("instnm", "event_date", "event_type")] names(df_event_br) ``` Solution to 2 __base R__ using subsetting operators ```{r} df_school[1:5, c("state_code", "name", "address")] ``` ## Subsetting lists/data frames using [[]] and $ ### Subset single element from object using [[]] operator So far we have used `[]` to excract elements from an object - Applying `[]` to an atomic vector returns an atomic vector with specific elements you requested - Applying `[]` to a list returns a shorter list that contains the specific elements you requested `[[]]` also extract elements from an object - Applying `[[]]` gives same result as `[]`; that is, an atomic vector with element you request ```{r} (x <- c(1.1, 2.2, 3.3, 4.4, 5.5)) str(x[3]) str(x[[3]]) ``` - Applying `[[]]` to list gives the "contents" of the list, rather than list itself ```{r} list_a <- list(1:3, "a", 4:6) str(list_a) str(list_a[1]) str(list_a[[1]]) ``` ### Subset single element from object using [[]] operator Wickham "Advanced R" chapter 4.3 [[LINK HERE](https://adv-r.hadley.nz/subsetting.html#subset-single)] uses "Train Metaphor" to differentiate list vs. contents of list The list is the entire train. Create a list with three elements (three "carriages") ```{r} list_a <- list(1:3, "a", 4:6) str(list_a) ``` When extracting element(s) of a list you have two options: 1. Extracting elements using `[]` always returns a smaller list (smaller train) ```{r} str(list_a[1]) # returns a list ``` 2. Extracting element using `[[]]` returns contents of particular carriage - I say applying `[[]]` to a list or data frame returns a simpler object that moves up one level of hierarchy ```{r} str(list_a[[1]]) # returns an atomic vector ``` ### Subset single element from object using [[]] operator In contrast to `[]`, we use `[[]]` to extract individual elements rather than multiple elements - we could write `x[4]` or `x[4:6]` - we could write `x[[4]]` but not `x[[4:6]]` ### Subset single element from object using [[]] operator Just like `[]` can use `[[]]` to return contents of __named__ elements, specified using quotes - syntax: `obj_name[["element_name"]]` ```{r} list_b <- list(var1=1:3, var2="a", var3=4:6) str(list_b) str(list_b["var1"]) str(list_b[["var1"]]) ``` Works the same with data frames ```{r} str(df_event["zip"]) str(df_event[["zip"]]) ``` ### Subset lists/data frames using $ `obj_name$element_name` shorthand operator for `obj_name[["element_name"]]` ```{r} str(list_b) list_b[["var1"]] list_b$var1 str(list_b[["var1"]]) str(list_b$var1) ``` `df_name$var_name`: easiest way in base R to refer to variable in a data frame ```{r} str(df_event[["zip"]]) str(df_event$zip) ``` ## Subsetting data frames with [] combined with $ ### Subsetting Data Frames with [] combined with $ Combine `[]` with `$` to subset data frame same as `filter()` or `subset()` Syntax: `df_name[df_name$var_name , ]` - Note: Uses "double index" `df_name[, ]` syntax - __Cannot__ use "single index" `df_name[]` Examples (output omitted) - All observations where the hich school received at least 1 visit from UC Berkeley (var=`visits_by_110635`) and all columns ```{r results="hide"} df_school[df_school$visits_by_110635 >= 1, ] ``` - All obs where the high school received at least 1 visit from UC Berkeley and the first three columns ```{r results="hide"} df_school[df_school$visits_by_110635 == 1, 1:3] ``` - All obs where the high school received at least 1 visit from UC Berkeley and variables "state_code" "school_type" "name" ```{r results="hide"} df_school[df_school$visits_by_110635 == 1, c("state_code","school_type","name")] ``` ### Subsetting Data Frames with [] combined with $ Combine `[]` with `$` to subset data frame same as `filter()` or `subset()` - Syntax: `df_name[df_name$var_name , ]` - Can be combined with `count()` or `nrow()` to avoid printing many rows \medskip Count obs where high schools received at least 1 visit by Bama (100751) and at least one visit by Berkeley (110635) - compare with `filter()` and `subset()` approaches ```{r} #[] combined with $ approach count(df_school[df_school$visits_by_110635 >= 1 & df_school$visits_by_100751 >= 1, ]) count(df_school[df_school[["visits_by_110635"]] >= 1 & df_school[["visits_by_100751"]] >= 1, ]) df_school[] #filter() approach nrow(filter(df_school, visits_by_110635 >= 1, visits_by_100751 >= 1)) #subset() approach nrow(subset(df_school, visits_by_110635 >= 1 & visits_by_100751 >= 1)) ``` ### Subsetting Data Frames with [] and $, NA Observations When sub-setting via `[]` combined with `$`, result will include: - rows where condition is `TRUE` - __as well as__ rows with `NA` (missing) values for condition. Task: How many events at public high schools with at least $50k median household income - extracting observations via `[]` combined with `$` ```{r} #num obs event_type=="public hs" and med_inc is missing nrow(df_event[df_event$event_type == "public hs" & is.na(df_event$med_inc)==1 , ]) #num obs event_type=="public hs" & med_inc is not NA & med_inc >= $50,000 nrow(df_event[df_event$event_type == "public hs" & is.na(df_event$med_inc)==0 & df_event$med_inc>=50000 , ]) #num obs event_type=="public hs" and med_inc >= $50,000 nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000 , ]) ``` ### Subsetting Data Frames with [] and $, NA Observations subset using `[]` combined with `$`, result includes: - rows where condition `TRUE`; __AND__ rows with `NA` for condition Base R filter using `subset()` excludes rows with `NA` for condition ```{r} #num obs event_type=="public hs" and med_inc is missing nrow(subset(df_event, event_type == "public hs" & is.na(med_inc)==1)) #num obs event_type=="public hs" & med_inc is not NA & med_inc >= $50,000 nrow(subset(df_event, event_type == "public hs" & is.na(med_inc)==0 & med_inc>=50000)) #num obs event_type=="public hs" & med_inc >= $50,000 nrow(subset(df_event, event_type == "public hs" & med_inc>=50000)) ``` Tidyverse `filter()` excludes rows with `NA` for condition. ```{r} #num obs event_type=="public hs" and med_inc is missing nrow(filter(df_event, event_type == "public hs", is.na(med_inc)==1)) #num obs event_type=="public hs" & med_inc is not NA & med_inc >= $50,000 nrow(filter(df_event, event_type == "public hs", is.na(med_inc)==0, med_inc>=50000)) #num obs event_type=="public hs" & med_inc >= $50,000 nrow(filter(df_event, event_type == "public hs", med_inc>=50000)) ``` ### Subsetting Data Frames with [] and $, NA Observations To exclude rows where condition is `NA` if subset using `[]` combined w/ `$` - use `which()` to ask only for values where condition evaluates to `TRUE` - `which()` returns position numbers for elements where condition is `TRUE` ```{r} #?which c(TRUE,FALSE,NA,TRUE) str(c(TRUE,FALSE,NA,TRUE)) which(c(TRUE,FALSE,NA,TRUE)) ``` Task: Count events at public HS with at least $50k median household income? ```{r} #Tidyverse, filter() nrow(filter(df_event, event_type == "public hs" & med_inc>=50000)) #Base R, `[]` combined with `$`; without which() nrow(df_event[df_event$event_type == "public hs" & df_event$med_inc>=50000, ]) #Base R, `[]` combined with `$`; with which() nrow(df_event[which(df_event$event_type == "public hs" & df_event$med_inc>=50000), ]) ``` ### Student Exercises Subsetting Data Frames with (1) [] and $; (2) subset() and filter(): 1. Show how many public high schools in California with at least 50% Latinx (hispanic in data) student enrollment from df_school. 2. Show how many out-state events at public high schools with more than $30K median from df_event (do not forget to exclude missing values). ### Solution to Student Exercises Solution to 1 __base R__ using [] and $ ```{r} df_school_br1<- df_school[df_school$school_type == "public" & df_school$pct_hispanic >= 50 & df_school$state_code == "CA", ] nrow(df_school_br1) ``` __base R__ using `subset()` function ```{r} df_school_br2 <- subset(df_school, school_type == "public" & pct_hispanic >= 50 & state_code == "CA" ) nrow(df_school_br2) ``` __tidyverse__ using `filter()` function ```{r} df_school_tv <- df_school %>% filter(school_type == "public" & pct_hispanic >= 50 & state_code == "CA" ) nrow(df_school_tv) ``` ### Solution to Student Exercises Solution to 2: __base R__ using [] and $ (NA included) ```{r} # use is.na to exclude NA nrow(df_event[df_event$event_type == "public hs" & df_event$event_inst =="Out-State" & df_event$med_inc > 30000 & is.na(df_event$med_inc) ==0, ]) # use which to exclude NA nrow(df_event[which(df_event$event_type == "public hs" & df_event$event_inst =="Out-State" & df_event$med_inc > 30000 ), ]) ``` __base R__ using `subset()` function (NA excluded) ```{r} nrow(subset(df_event, event_type == "public hs" & event_inst =="Out-State"& df_event$med_inc > 30000 )) ``` __tidyverse__ using `filter()` function (NA excluded) ```{r} count(filter(df_event, event_type == "public hs" & event_inst =="Out-State" & df_event$med_inc > 30000 )) ``` # Sorting data ### Base R `sort()` for vectors `sort()` is a base R function that sorts vectors Syntax: `sort(x, decreasing=FALSE, ...)` - where x is object being sorted - By default it sorts in ascending order (low to high) - Need to set decreasing argument to `TRUE` to sort from high to low ```{r} #?sort() x<- c(31, 5, 8, 2, 25) sort(x) sort(x, decreasing = TRUE) ``` ### Base R `order()` for dataframes `order()` is a base R function that sorts vectors - Syntax: `order(..., na.last = TRUE, decreasing = FALSE)` - where `...` are variable(s) to sort by - By default it sorts in ascending order (low to high) - Need to set decreasing argument to `TRUE` to sort from high to low Descending argument only works when we want either one (and only) variable descending or all variables descending (when sorting by multiple vars) - use `-` when you want to indicate which variables are descending while using the default ascending sorting ```{r results="hide"} df_event[order(df_event$event_date), ] df_event[order(df_event$event_date, df_event$total_12), ] #sort descending via argument df_event[order(df_event$event_date, decreasing = TRUE), ] df_event[order(df_event$event_date, df_event$total_12, decreasing = TRUE), ] #sorting by both ascending and descending variables df_event[order(df_event$event_date, -df_event$total_12), ] ``` ### Compare tidyverse to base r, sorting -Create a new dataframe from df_events that sorts by ascending by `event_date`, ascending `event_state`, and descending `pop_total` . __tidyverse__ ```{r results="hide"} df_event_tv <- arrange(df_event, event_date, event_state, desc(pop_total)) ``` __base R__ using `order()` function ```{r results="hide"} df_event_br1 <- df_event[order(df_event$event_date, df_event$event_state, -df_event$pop_total), ] ```