--- title: "Lecture 10: Accessing object elements and looping" subtitle: "Managing and Manipulating Data Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: default #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) ``` # Introduction ### Logistics __Course evaluations__ Please take a few minutes to complete instructor evaluations - Should have received an email! \medskip __Remaining Class Sessions__ No class next week 11/27: Happy Thanksgiving! - Problem set #12 still due 12/4 Writing Functions - No problem set due 12/11 Intro to GitHub + Wrap-Up - Problem set # 13 (functions) due ### Logistics __Problem set 12__ - Due next Wednesday at noon; focuses on accessing vector elements and basics of looping - We give big hints; less emphasis on "figure it out on your own" \medskip __Reading:__ - Grolemund and Wickham 20.4 - 20.5 (Chapter 20 is on "Vectors") - Grolemund and Wickham 21.1 - 21.3 (Chapter 21 is on "Iteration") - [OPTIONAL] Any slides from lecture we don't cover - This lecture is written knowing we won't have time to get through all sections - Slides we don't cover are mainly for your future reference ### What we will do today \medskip \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries and data ```{r} library(tidyverse) ``` Data frame ```{r} #load dataset with one obs per recruiting event load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) ``` # Accessing elements of vectors and lists ## Review: Types of vectors ### Review: Types of vectors Recall that there are two broad types of vectors, __atomic vectors__ and __lists__ 1. __Atomic vectors__. There are six types: - logical, integer, double, character, complex, and raw 2. __lists__. "sometimes called recursive vectors lists can contain other lists" \medskip Main difference between atomic vectors and lists: - atomic vectors are "homogenous," meaning each element in vector must have same type (e.g., integer, logical, character) - lists are "heterogeneous," meaning that data type can differ across elements within a list Link to figure of data structures overview [HERE](http://r4ds.had.co.nz/diagrams/data-structures-overview.png) ### Review: Types of atomic vectors \medskip 1. logical. each element can be three potential values: `TRUE`, `FALSE`, `NA` ```{r} typeof(c(TRUE,FALSE,NA)) typeof(c(1==1,1==2)) ``` 2. Numeric (integer or double) ```{r} typeof(c(1.5,2,1)) typeof(c(1,2,1)) ``` - Numbers are doubles by default. To make integer, place `L` after number: ```{r} typeof(c(1L,2L,1L)) ``` 3. character ```{r} typeof(c("element of character vector","another element")) length(c("element of character vector","another element")) ``` ### `TRUE/FALSE` functions that identify __type__ of vector \medskip Identifying vector __type__, Grolemund and Wickham: - "Sometimes you want to do different things based on the type of vector. One option is to use `typeof()`. Another is to use a test function which returns a `TRUE` or `FALSE`" `is_*()` functions are provided by `purrr` package within `Tidyverse`: Function | logical | int | dbl | chr | list ---------|---------|-----|-----|-----|----- `is_logical()` | X | | | | `is_integer()` | |X | | | `is_double()` | | |X | | `is_numeric()` | |X |X | | `is_character()` | | | |X | `is_atomic()` |X |X |X |X | `is_list()` | | | | | X `is_vector()` |X |X |X |X |X ```{r} is_numeric(c(5,6,7)) ``` ### `TRUE/FALSE` functions that identify __type__ of vector Recall that elements of a vector must have the same type - If vector contains elements of different type, type will be most "complex" - From simplest to most complex: logical, integer, double, character ```{r, results="hide"} is_logical(c(TRUE,TRUE,NA)) is_logical(c(TRUE,TRUE,NA,1)) typeof(c(TRUE,1L)) is_integer(c(TRUE,1L)) typeof(c(TRUE,1L,1.5,"b")) is_character(c(TRUE,1L,1.5,"b")) ``` ### `TRUE/FALSE` functions that identify __type__ vs. __class__ of vector \medskip Comparing `is_*()` vs. `is.*()` functions - `is_*()` functions (e.g., `is_numeric()`) identifies vector __type__ - They are the `TRUE/FALSE` versions of `typeof()` function - `is.*()` functions (e.g., `is.numeric()`) refer to both __type__ and __class__ - Review: __class__ is an object __attribute__ that defines how object can be treated by object oriented programming language (e.g., which functions you can apply) - Recall that R functions care about __class__, not __type__ \medskip ```{r, results="hide", warning= FALSE} df_event %>% select(instnm,univ_id,event_date,med_inc,titlei_status_pub) %>% str() ``` Variable = `univ_id` ```{r, results="hide", warning= FALSE} typeof(df_event$univ_id) class(df_event$univ_id) is_numeric(df_event$univ_id) is.numeric(df_event$univ_id) ``` Variable = `event_date` ```{r, results="hide", warning= FALSE} typeof(df_event$event_date) class(df_event$event_date) is_numeric(df_event$event_date) is.numeric(df_event$event_date) ``` ### `TRUE/FALSE` functions that identify __type__ vs. __class__ of vector Comparing `is_*()` vs. `is.*()` functions - `is_*()` functions (e.g., `is_numeric()`) identifies vector __type__ - `is.*()` functions (e.g., `is.numeric()`) refer to both __type__ and __class__ \medskip Variable = `med_inc` ```{r, results="hide", warning= FALSE} typeof(df_event$med_inc) class(df_event$med_inc) is_numeric(df_event$med_inc) is.numeric(df_event$med_inc) ``` Variable = `titlei_status_pub` ```{r, results="hide", warning= FALSE} typeof(df_event$titlei_status_pub) class(df_event$titlei_status_pub) is_numeric(df_event$titlei_status_pub) is.numeric(df_event$titlei_status_pub) ``` ## Accessing elements of (atomic) vectors ### Subsetting elements of vector "Subsetting" a vector, refers to isolating particular elements of a vector - I sometimes refer to this as "accessing elements of a vector" - subsestting elements of a vector is similar to "filtering" rows of a data-frame `[]` is the subsetting function for vectors - contents inside `[]` can refer to element number (also called "position"). - e.g., `[3]` refers to contents of 3rd element (or position 3) - contents inside `[]` can also refer to name of element - e.g., `["a"]` refers to contents inside an element named "a" ### Subsetting elements of vector, based on position number __Examples of referring to elements based on element position number__ ```{r} x <- c("a","b","c","d","e") x # all elements x[1] # 1st element x[5] # 5th element c(x[1],x[2],x[2]) # 1st, 2nd, and 2nd element x[c(1,2,2)] # 1st, 2nd, and 2nd element ``` ### Subsetting elements of vector, based on position number __Examples of referring to elements based on position number, continued__ ```{r} y <- c(4,5,10,29,15,12) length(y) y[c(1,3,6)] y[c(3,6,1)] ``` While subsetting with positive numbers keeps elements in those positions, subsetting with negative numbers drops elements at those positions ```{r} y y[c(-3,-4,-5,-6)] # show elements except 3rd, 4th, 5th and 6th elements ``` ### Subsetting elements of named vector, by element name __Review: naming vectors__ - All vectors can be "named" (i.e., you name individual elements within the vector) Unnamed vector ```{r} x <- c(1,2,3,"hi!") x str(x) ``` Named vector ```{r} y <- c(a=1,b=2,3,c="hi!") y str(y) ``` ### Subsetting elements of named vector, by element name If you have a __named vector__, you can subset it with a character vector: - i.e., access __element values__ based on __element names__ ```{r} x <- c(abc = 1, def = 2, xyz = 5) x x["xyz"] # show value of element named "xyz" x[c("xyz", "def")] # show value of element named "xyz" and element named "def" ``` ### Subsetting elements of vector, with a logical vector Subsetting elements with a logical vector - i.e., access elements that satisfy some `TRUE/FALSE` condition ```{r, warning=FALSE} (x <- c(10, 3, NA, 5, 8, 1, NA,"Hi!")) typeof(x) x[is_character(x)] # since vector type is character, all elements are character x[is_numeric(x)] # since vector type is character, no elements are numeric (y <- c(10, 3, NA, 5, 8, 1, NA)) typeof(y) y[is_numeric(y)] ``` ## Accessing elements of lists ### Review: Lists \medskip Like atomic vectors, lists are objects that contain elements. However: - "type" of elements can vary within a list - elements of a list can contain another list Examples: ```{r} x1 <- list(c(1, 2), c(3, 4)) x2 <- list(list(1, 2), list(3, 4)) x3 <- list(1, list(2, list(3))) ``` `str()` function is helpful for understanding structure and contents of a list ```{r} str(x1) str(x2) ``` ### Review: Data frames are lists \medskip Recall the relationship between "lists" and "data frames" - data frames have "type==list" - data frames are lists with these additional structure requirements - each element of data frame must be a vector (not a list) - each element (i.e., vector) in data frame must have the same length - data frames have additional attributes (e.g., each vector is named) ```{r} (df <- tibble(x = 1:3, y = 3:1)) typeof(df) str(df) typeof(df_event) ``` ### Subsetting/accessing elements of a list __Accessing elements of a list is important for writing loops, writing functions, and many other applications in R__ \medskip I will demonstrate accessing elements of a list using two lists: 1. \medskip A list that has more complicated structure than a data frame (from Grolemund and Wickham example) ```{r, results="hide"} list_a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5)) typeof(list_a) str(list_a) ``` 2. \medskip List that is 7 variables and first 5 obs of `df_event`, corresponding to University of Alabama ```{r, results="hide"} df_bama <- df_event %>% arrange(univ_id,event_date) %>% select(instnm,univ_id,event_date,event_type,event_state,zip,med_inc) %>% filter(row_number()<6) typeof(df_bama) str(df_bama) ``` ### Subsetting/accessing elements of a list __Accessing elements of a list is important for writing loops, writing functions, and many other applications in R__ \medskip GW [20.5.2](https://r4ds.had.co.nz/vectors.html#subsetting-1) ("Subsetting"): 3 ways to "subset" (i.e., access elements of) list 1. `[]` "extracts a sub-list. The result will always be a list" - like subsetting vectors, you can subset with a logical, integer, or character vector 2. `[[]]` "extracts a single component from a list. It removes a level of hierarchy from the list" 3. `$` "shorthand for extracting named elements of a list. It works similarly to `[[]]` except that you don’t need to use quotes." ### Subset a list using `[]` `[]` "extracts a sub-list" - contents of `[]` can be position number, name of element in list, logical vector, etc. ```{r, results="hide"} str(list_a) length(list_a) list_a[1] # extract first element of list str(list_a[1]) # extract first element of list str(list_a["a"]) # extract element named "a" str(list_a[1:2]) # extract first two elements of list str(list_a[c(1,2)]) # extract first two elements of list str(list_a[c("a","c")]) # extract element named "a" and element named "c" ``` Key takeaway about subsetting a list using `[]`: __The result will always be a list__ - that is, `[]` does not remove a level of hierarchy - structure and attributes of object you isolate using `[]` will be the same as its structure and attributes in the list it is taken from ### Subset a list using `[]`: Student task Applying `[]` to the object `df_bama`: - Isolate the 1st element of `df_bama` - Isolate the 3rd through 5th element of `df_bama` - Isolate the 3rd, 7th, and 1st element of `df_bama` - Isolate the element named `"event_type"` - Isolate the elements named `"event_type"` and `"med_inc` ### Subset a list using `[]`: Student task [SOLUTIONS] Applying `[]` to the object `df_bama`: ```{r, results="hide"} #- Isolate the 1st element of `df_bama` df_bama[1] str(df_bama[1]) #- Isolate the 3rd through 5th element of `df_bama` df_bama[3:5] str(df_bama[3:5]) #- Isolate the 3rd, 7th, and 1st element of `df_bama` df_bama[c(3,7,1)] #- Isolate the element named `"event_type"` df_bama["event_type"] str(df_bama["event_type"]) #- Isolate the elements named `"event_type"` and `"med_inc` df_bama[c("event_type","med_inc")] ``` ### Subset a list using `[[]]` GW: "`[[]]` extracts a single component from a list." More specifically, `[[]]`: 1. Extracts a __single__ element of the list __AND__ 1. Removes a "level of hierarchy"" from the list ```{r, results="hide"} str(list_a) str(list_a[1]) # []; result is a one-element list [length=1]; this list contains a numeric vector with 3 elements str(list_a[[1]]) # [[]]; result is a numeric vector with 3 elements str(list_a["a"]) # []; result is a one-element list [length=1]; this list contains a numeric vector with 3 elements str(list_a[["a"]]) # [[]]; result is a numeric vector with 3 elements str(list_a[4]) # []; result is a one-element list, which contains a two-element list str(list_a[[4]]) # [[]]; result is a two-element list ``` ### Subset a list using `[[]]`, data frames \medskip Comparing `[]` to `[[]]` when working with lists that are data frames - Data frame object always has type=list and each element is a vector - If you subset using `[]` the result will always have type==list - If you subset using `[[]]` the result will always have type==vector `[]` vs. `[[]]`: Subsetting data frame using __element position number__ ```{r, results="hide"} df_bama[1] df_bama[[1]] str(df_bama[1]) str(df_bama[[1]]) typeof(df_bama[1]) typeof(df_bama[[1]]) class(df_bama[1]) class(df_bama[[1]]) attributes(df_bama[3]) attributes(df_bama[[3]]) ``` ### Subset a list using `[[]]`, data frames \medskip Comparing `[]` to `[[]]` when working with lists that are data frames - If you subset using `[]` the result will always have type==list - If you subset using `[[]]` the result will always have type==vector \medskip `[]` vs. `[[]]`: Subsetting data frame using __element name__ (i.e., variable name) - note: whether using `[]` or `[[]]`, element name must be in quotes ```{r, results="hide"} str(df_bama["event_type"]) # a "tibble" data frame with one variable str(df_bama[["event_type"]]) # a character vector with 5 elements attributes(df_bama["event_type"]) # contains attributes (e.g., variiable name) attributes(df_bama[["event_type"]]) # no attributes; just the data str(df_bama["event_date"]) # a "tibble" data frame with one variable str(df_bama[["event_date"]]) # a numeric "date" vector with 5 elements attributes(df_bama["event_date"]) # attributes of the data frame attributes(df_bama[["event_date"]]) # class=date ``` ### Subset a list using `$` `$` is a shorthand for extracting __named__ elements of a list. - works similarly to `[[]]` except that you don’t need to use quotes. - Like `[[]]`, subsetting using `$` removes a level of hierarchy Note: we have been using this method of subsetting variables in a data frame all term! ```{r, results="hide"} str(list_a) list_a["a"] # list of one element, which contains integer vector of 3 elements list_a[["a"]] # integer vector of 3 elements list_a$a # integer vector of 3 elements ``` These two approaches yield the same result: ```{r} df_bama[["med_inc"]] df_bama$med_inc ``` ### Extracting multiple elements of a list using `[[]]` and `$` Each instance of `[[]]` or `$` can only extract a __single__ element from the list \medskip Using `[[]]` to extract multiple elements of list ```{r} c(df_bama[["med_inc"]],df_bama[["event_type"]]) ``` Using `$` to extract multiple elements of list ```{r} c(df_bama$med_inc,df_bama$event_type) ``` \medskip By contrast, `[]` can extract multiple elements within each instance of `[]` ```{r, results="hide"} str(df_bama[c("instnm","med_inc")]) # "tibble" data frame with two variables #df_bama[[c("instnm","med_inc")]] # this code will yield an error ``` ### Subset a vector using `[[]]`: Student task Applying `[[]]` to the object `df_bama`: - Isolate the 5th element of `df_bama` - Isolate the element `"event_type"` of `df_bama` - Isolate the element `"zip"` of `df_bama` using `$` - Isolate the elements named `"event_date"` and `"event_type"` ### Subset a vector using `[]`: Student task [SOLUTIONS] Applying `[[]]` to the object `df_bama`: ```{r, results="hide"} #- Isolate the 5th element of `df_bama` df_bama[[5]] str(df_bama[[5]]) #- Isolate the element `"event_type"` of `df_bama` df_bama[["event_type"]] str(df_bama[["event_type"]]) #- Isolate the element `"zip"` of `df_bama` using `$` df_bama$zip #- Isolate the elements named `"event_date"` and `"event_type"` c(df_bama[["event_date"]],df_bama[["event_type"]]) str(c(df_bama[["event_date"]],df_bama[["event_type"]])) ``` ## Review key concepts for loops ### Sequences (Loose) definition - a sequence is a list of numbers in ascending or descending order Creating sequences using colon operator ```{r} -5:5 5:-5 ``` Creating sequences using `seq()` function - basic syntax: ```{r, eval=FALSE} seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...) ``` - examples: ```{r} seq(10,15) seq(from=10,to=15,by=1) seq(from=100,to=150,by=10) ``` ### Length of atomic vectors \medskip Definition: __length__ of an object is its number of elements \medskip Length of vectors, using `length()` function ```{r} x <- c(1,2,3,4,"ha ha"); length(x) y <- seq(1,10); length(y) z <- c(seq(1,10),"ho ho"); length(z) ``` Once vector length known, isolate element contents based on position number using `[]` ```{r} x[5] z[1] ``` For atomic vectors, applying `[[]]` to vector gives same result as `[]` ```{r} x[[5]] z[[1]] ``` ### Length of lists \medskip Definition: __length__ of an object is its number of elements ```{r} typeof(df_bama); length(df_bama) ``` Once list length known, isolate element contents based on position number using `[]` or `[[]]` - subset one element of list with `[]` yields list w/ length==1 ```{r} typeof(df_bama[7]); length(df_bama[7]) ``` - subset one element of list with `[[]]` yields vector w length== # rows ```{r} df_bama[[7]]; typeof(df_bama[[7]]); length(df_bama[[7]]) ``` subset one element of list with `$` is same as `[[]]` ```{r} df_bama$med_inc; typeof(df_bama$med_inc); length(df_bama$med_inc) ``` ### Combine sequences and length \medskip When writing loops, very common to create a sequence from 1 to the length (i.e., number of elements) of an object \medskip Here, we do this with a vector object ```{r} (x <- c("a","b","c","d","e")) length(x) 1:length(x) seq(from=1,to=length(x),by=1) ``` Can do same thing with list object ```{r} length(df_bama) 1:length(df_bama) seq(2,length(df_bama)) ``` # Loop basics ### Simple loop example \medskip What are loops?: __Loops__ execute some set of commands multiple times - We build loops using the `for()` function - Each time the loop executes the set of commands is an __iteration__ - The below loop iterates 4 times __Example__ - Create loop that prints each value of vector `c(1,2,3,4)`, one at a time ```{r} c(1,2,3,4) for(i in c(1,2,3,4)) { # Loop sequence print(i) # Loop body } ``` I use loops to perform practical tasks more efficienlty (e.g., read in data) - But we'll introduce loop concepts by doing things that aren't very useful ### Components of a loop ```{r} for(i in c(1,2,3,4)) { # Loop sequence print(i) # Loop body } ``` Components of a loop 1. __Sequence__. Determines what to "loop over" (e.g., from 1 to 4 by 1) - sequence in above loop is `for(i in c(1,2,3,4))` - this creates a temporary/local object named `i`; could name it anything - `i` will no longer exist after the loop is finished running - each iteration of loop will assign a different value to `i` - c(1,2,3,4) is the set of values that will be assigned to `i` - in first iteration, value of `i` is `1` - in second iteration, value of `i` is `2`, etc. 2. __Body__. What commands to execute for each iteration through the loop - Body in above loop is `print(i)` - Each time (i.e., iteration) through the loop, body prints the value of object `i` ### Using `cat()` to print value of sequence var for each iteration \medskip __When building a loop, I always include a line like `cat("z=",z, fill=TRUE)` to help me understand what loop is doing__ \medskip Below two loops are essentially the same; I prefer second approach. Why?: - Writing name of sequence var object (here `z`) and seeing value of sequence var object for each iteration helps me understand loop better ```{r} for(z in c(1,2,3)) { # Loop sequence print(z) # Loop body } for(z in c(1,2,3)) { # Loop sequence cat("object z=",z, fill=TRUE) # "fill=TRUE" forces line break after each iteration } ``` Without `fill=TRUE` [not recommended] ```{r} for(z in c(1,2,3)) { # Loop sequence cat("object z=",z) # "Loop body } ``` ### Components of a loop \medskip Note that these three loops all do the same thing - __Loop body__ is the same in each loop - __Loop sequence__ written slightly differently in each loop ```{r} for(z in c(1,2,3)) { # Loop sequence cat("object z=",z, fill=TRUE) # Loop body } for(z in 1:3) { # Loop sequence cat("object z=",z, fill=TRUE) # Loop body } num_sequence <- 1:3 for(z in num_sequence) { # Loop sequence cat("object z=",z, fill=TRUE) # Loop body } ``` ### Student exercise Try on your own or just follow along. \medskip __Task__ 1. Create a numeric vector that has year of birth of members of your family - you decide who to include - e.g., `birth_years <- c(1944,1950,1981,2016)` 2. Write a loop that calculates current year minus birth year and prints this number for each member of your family - Within this loop, you will create a new variable that calculates current year minus birth year \medskip Note: multiple correct ways to complete this task ### Student exercise [SOLUTION] 1. Create a numeric vector that has year of birth of members of your family (you decide who to include) 2. Write a loop that calculates current year minus birth year and prints this number for each member of your family ```{r} birth_years <- c(1944,1950,1981,2016) birth_years for(y in birth_years) { # Loop sequence cat("object y=",y, fill=TRUE) # Loop body z <- 2018-y cat("value of",y,"minus",2018,"is",z, fill=TRUE) } ``` # Three ways to loop over a vector (atomic vector or a list) ### Plan for learning more about loops Rest of lecture on loops will proceed as follows: 1. Describe the three different ways to "loop over" a vector 2. Describe the two broad sorts of tasks to accomplish within body of a loop 1. Modify an existing object (e.g., vector or list/data frame) 2. Create a new object Throughout, I'll try to give you lots of examples and practice ### Three ways to loop over an object \medskip There are 3 ways to loop over elements of an object 1. __Loop over the elements__ [approach we have used so far] 2. __Loop over names of the elements__ 3. __Loop over numeric indices associated with element position__ [approach recommended by Grolemnund and Wickham] Will demonstrate 3 approaches on a named atomic vector and list/data frame - Create named vector ```{r} vec=c("a"=5,"b"=-10,"c"=30) vec ``` - Create data frame with fictitious data, 3 columns (vars) and 4 rows (obs) ```{r} set.seed(12345) # so we all get the same variable values df <- tibble(a = rnorm(4),b = rnorm(4),c = rnorm(4)) str(df) ``` ## Loop over elements ### Approach 1: loop over elements of object [object=atomic vector] \medskip - \medskip __sequence__ syntax: `for (i in object_name)` - Sequence iterates through each element of the object - That is, __sequence iterates through _value_ of each element, rather than _name_ or _position_ of element__ - in __body__. - value of `i` is equal to the contents of the `ith` element of the object ```{r} vec # print atomic vector object for (i in vec) { cat("value of object i=",i, fill=TRUE) cat("object i has: type=",typeof(i),"; length=",length(i),"; class=",class(i), "; attributes=",attributes(i),"\n",sep="",fill=TRUE) # "\n" adds line break } ``` ### Approach 1: loop over elements of object [object=list] \medskip - \medskip __sequence__ syntax: `for (i in object_name)` - Sequence iterates through each element of the object - That is, __sequence iterates through _value_ of each element__ - in __body__: value of `i` is equal to __contents__ of `ith` element of object ```{r} df # print list/data frame object #class(df) #attributes(df) for (i in df) { cat("value of object i=",i, fill=TRUE) cat("object type=",typeof(i),"; length=",length(i),"; class=",class(i), "; attributes=",attributes(i),"\n",sep="",fill=TRUE) } ``` ### Approach 1: loop over elements of object \medskip __Example task__: - calculate mean value of each element of list object `df` ```{r} df # print list/data frame object for (i in df) { # sequence cat("value of object i=",i, fill=TRUE) cat("mean value of object i=",mean(i, na.rm = TRUE), "\n", fill=TRUE) } ``` ## Loop over element names ### Approach 2: loop over names of object elements \medskip To use this approach, elements in object must have name attributes __sequence__ syntax: `for (i in names(object_name))` - Sequence iterates through the _name_ of each element in object in __body__, value of `i` is equal to _name_ of `ith` element in object - Access element contents using `object_name[i]` - same object type as `object_name`; retains attributes (e.g., _name_) - Access element contents using `object_name[[i]]` - removes level of hierarchy, thereby removing attributes - Approach recommended by Wickham because isolates value of element Example: Object= atomic vector ```{r} vec # print atomic vector object names(vec) ``` ```{r, results="hide"} for (i in names(vec)) { cat("\n","value of object i=",i,"; type=",typeof(i),sep="",fill=TRUE) print(str(vec[i])) # "Access element contents using []" print(str(vec[[i]])) # "Access element contents using [[]]" } ``` ### Approach 2: loop over names of object elements [object = list] \medskip __sequence__ syntax: `for (i in names(object_name))` - Sequence iterates through the _name_ of each element in object in __body__, value of `i` is equal to _name_ of `ith` element in object - Access element contents using `object_name[i]` - Same object type as `object_name`; retains attributes (e.g., _name_) - Access element contents using `object_name[[i]]` - Removes level of hierarchy, thereby removing attributes - Approach recommended by Wickham because isolates value of element \medskip Example, object is a list ```{r} names(df) ``` ```{r, results="hide"} for (i in names(df)) { cat("\n","value of object i=",i,"; type=",typeof(i),sep="",fill=TRUE) print(str(df[i])) # "Access element contents using []" print(str(df[[i]])) # "Access element contents using [[]]" } ``` ### Approach 2: loop over names of elements in object \medskip __Example task__: calculate mean value of each element of list object `df`, using `[[]]` to access element contents ```{r} str(df) for (i in names(df)) { cat("mean of element named",i,"is",mean(df[[i]], na.rm = TRUE), fill=TRUE) } ``` What if we try to complete task using , `[]` to access element contents? ```{r, eval=FALSE} for (i in names(df)) { cat("mean of element named",i,"is",mean(df[i],na.rm = TRUE), fill=TRUE) #print(typeof(df[i])) #print(class(df[i])) } #?mean # mean function only works for particular *classes* of objects ``` ## Loop over element position number ### Approach 3: Loop over numeric indices of element position \medskip First explain sequence syntax, using atomic vector `vec` as object - __sequence__ syntax: `for (i in 1:length(object_name))` ```{r} vec # print named atomic vector vec length(vec) 1:length(vec) for (i in 1:length(vec)) { # loop sequence cat("value of object i=",i,fill=TRUE) # loop body } ``` Note: These two approaches yield same result as above ```{r, results="hide"} for (i in c(1,2,3)) { cat("value of object i=",i,fill=TRUE) } for (i in 1:3) { cat("value of object i=",i,fill=TRUE) } ``` ### Approach 3: Loop over numeric indices of element position \medskip Loop over element position number: Simple sequence syntax ```{r} for (i in 1:length(vec)) { cat("value of object i=",i,fill=TRUE) } ``` __Wickham's preferred sequence syntax__: `for (i in seq_along(object_name))` - `seq_along(x)` function returns a sequence from 1 value of `length(x)` ```{r} length(vec) seq_along(vec) for (i in seq_along(vec)) { cat("value of object i=",i,fill=TRUE) } ``` ### Approach 3: Loop over numeric indices [SKIP] \medskip Why Wickham prefers `seq_along(object_name)` over `1:length(object_name)` - `seq_along` handles zero-length vectors correctly, and is therefore the "safe" version of `1:length(object_name)` ```{r} # create vector of length=0 y <- vector("double", 0) length(y) 1:length(y) for (i in 1:length(y)) { cat("value of object i=",i,fill=TRUE) } seq_along(y) for (i in seq_along(y)) { cat("value of object i=",i,fill=TRUE) } ``` Personally, I find `1:length(object_name)` much more intuitive ### Approach 3: Loop over numeric indices of element position \medskip __sequence__ syntax: `for (i in 1:length(object_name))` __OR__ `for (i in seq_along(object_name))` - Sequence iterates through _position number_ of each element in the object In __body__, value of `i` equals the _position number_ of `ith` element in object - Access element contents using `object_name[i]` - Same object type as `object_name`; retains attributes (e.g., _name_) - Access element contents using `object_name[[i]]` [RECOMMENDED] - Removes level of hierarchy, thereby removing attributes __Example, object is atomic vector__ ```{r} vec ``` ```{r, results="hide"} for (i in 1:length(vec)) { cat("\n","value of object i=",i,"; type=",typeof(i),sep="",fill=TRUE) print(str(vec[i])) # "Access element contents using []" print(str(vec[[i]])) # "Access element contents using [[]]" } ``` ### Approach 3: Loop over numeric indices of element position __Example, object is a list__ ```{r} df %>% head(n=3) ``` ```{r, results="hide"} for (i in 1:length(df)) { cat("\n","value of object i=",i,"; type=",typeof(i),sep="",fill=TRUE) print(str(df[i])) # "Access element contents using []" print(str(df[[i]])) # "Access element contents using [[]]" } ``` ### Approach 3: Loop over numeric indices of element position __Example task__: - Calculate mean value of each element of list object `df`, using `for (i in seq_along(df))` to create sequence and using `[[]]` to access element contents ```{r} for (i in seq_along(df)) { cat("mean of element named",i,"is",mean(df[[i]], na.rm = TRUE), fill=TRUE) } ``` What happens if we try to complete task using , using `[]` to access element contents? ```{r, eval=FALSE} for (i in seq_along(df)) { cat("mean of element named",i,"is",mean(df[i],na.rm = TRUE), fill=TRUE) #print(typeof(df[i])) #print(class(df[i])) } #?mean # mean(object) requires object to be numeric or logical ``` ### Approach 3: Loop over numeric indices of element position \medskip __When looping over numeric indices, you can extract element names based on element position__ - First, let's experiment w/ `attributes()` and `names()` functions `attributes()` function [output omitted] ```{r, results="hide"} attributes(df) attributes(df[1]) # not null attributes(df[[1]]) # null: removing level of hierarchy removes attributes ``` `names()` functions ```{r} names(df) names(df[1]) # not null names(df[[1]]) # null: object df[[1]] has no attributes; just values names(df)[[1]] # not null: we extract names of df, then select first element ``` ### Approach 3: Loop over numeric indices of element position \medskip __When looping over numeric indices, you can extract element names based on element position__ - First, experiment w/ `names()` function ```{r} names(df) names(df)[[1]] # not null: we extract names of df, then select first element ``` - Second, apply what we learned to loop ```{r} for (i in seq_along(df)) { #print(names(df)[[i]]) cat("i=",i,"; names=",names(df)[[i]],sep="",fill=TRUE) } ``` ### Summary: Three ways to loop over object 1. Loop over elements 1. Loop over element names 1. Loop over numeric indices of element position Why Wickham prefers "loop over numeric indices of element" approach [3]: - given element position number, can extract element name[2] and value[1] ```{r} for (i in seq_along(df)) { cat("i=",i,sep="",fill=TRUE) name <- names(df)[[i]] # value of object "name" is what we loop over in approach 2 cat("name=",name,sep="",fill=TRUE) value <- df[[i]] # value of object "value" is what we loop over in approach 1 cat("value=",value,"\n",sep=" ",fill=TRUE) } ``` # Modifying vs. Creating new object ### Modify object or create new object Grolemund and Wickham differentiate between two types of tasks loops accomplish: (1) modify existing object; and (2) create new object 1. __Modify an existing object__ - example: looping through a set of variables in a data frame to: - Modifying these variables OR - Creating new variables (within the existing data frame object) - When writing loops in Stata/SAS/SPSS, we are usually modifying an existing object because these programs typically only have one object - a dataset - open at a time) 2. __Create a new object__ - Example: Create an object that has summary statistics for each variable; this object will be the basis for a table or graph - Often the new object will be a vector of results based on looping through elements of a data frame - In R (as opposed to Stata/SAS/SPSS) creating a new object is very common because R can hold many objects at the same time ## Loops that create new object ### Creating a new object So far our loops have two components: 1. sequence 1. body When we create a new object to store the results of a loop, our loops have three components 1. sequence 1. body 1. output - this is the new object that will store results created from your loop Grolemund and Wickham recommend creating this new object __prior__ to writing the loop (rather than creating the new object within the loop) > "Before you start loop...allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow." ### Creating a new object Create sample data frame named `df` ```{r} set.seed(54321) df <- tibble(a = rnorm(10),b = rnorm(10),c = rnorm(10),d = rnorm(10)) ``` __Task__: - Using the data frame `df`, which contains data on four numeric variables, create a new object that contains the mean value of each variable In a previous example, we calculated mean for each variable ```{r} for (i in seq_along(df)) { cat("mean of element named",i,"is",mean(df[[i]], na.rm = TRUE),fill=TRUE) } ``` Now we just have to create an object to store these results ### Creating a new object __Task__: Create a new object that contains mean value of each variable in `df` \medskip Wickham recommends creating new object __prior__ to creating loop - You must specify type and length of new object - New object will contain mean for each variable; should be numeric vector with number of elements (length) equal to number of variables in `df` \medskip Create object to hold output; we'll name this object `output` ```{r} output <- vector("double", ncol(df)) # create object typeof(output) length(output) length(df) ``` Create loop; use position number to assign variable means to elements of vector `output` ```{r} for (i in seq_along(df)) { #cat("i=",i,fill=TRUE) output[[i]] <- mean(df[[i]], na.rm = TRUE) # mean of df[[1]] assigned to output[[1]], etc. } output ``` ## Loops that modify existing object ### Example of modifying an object: z-score loop __Task__ (from Christenson lecture): - Write a loop that calculates z-score for a set of variables in a data frame and then replaces the original variables with the z-score variables The z-score for observation _i_ is number of standard deviations from mean: $z_i = \frac{x_i - \bar{x}}{sd(x)}$ Task: calculate z-score for first 4 observations of `df$a` ```{r} (df$a[1] - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE) (df$a[2] - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE) (df$a[3] - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE) (df$a[4] - mean(df$a, na.rm=TRUE))/sd(df$a, na.rm=TRUE) ``` ### Example of modifying an object: z-score loop __Task__: write loop that replaces variables with z-scores of those variables \medskip When modifying existing object, we only need to write __sequence__ and __body__ - __sequence__. - data frame `df` has 4 variables and all are quantitative - so write a sequence that loops across each element of `df` - `for (i in seq_along(df))` - __body__. - body of z-score function: - `(x - mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)` - Substitute `df[[i]]` for `x`: - `(df[[i]] - mean(df[[i]], na.rm=TRUE))/sd(df[[i]], na.rm=TRUE)` - Assign (replace) each observation the value of its z-score: - `df[[i]] <- (df[[i]] - mean(df[[i]], na.rm=TRUE))/sd(df[[i]], na.rm=TRUE)` ```{r, results="hide"} set.seed(54321) (df <- tibble(a = rnorm(10),b = rnorm(10),c = rnorm(10),d = rnorm(10))) for (i in seq_along(df)) { cat("i=",i,"; mean=",mean(df[[i]], na.rm=TRUE),"; sd=",sd(df[[i]], na.rm=TRUE),sep="",fill=TRUE) #print((df[[i]] - mean(df[[i]], na.rm=TRUE))/sd(df[[i]], na.rm=TRUE)) # show z-score for each obs df[[i]] <- (df[[i]] - mean(df[[i]], na.rm=TRUE))/sd(df[[i]], na.rm=TRUE) # modify values } str(df) ``` ### Modify z-score loop to work with non-numeric variables What happens if we apply our loop to the data frame `df_bama`, which has both string and numeric variables? \medskip Create data frame `df_bama` ```{r, results="hide"} load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) df_bama <- df_event %>% arrange(univ_id,event_date) %>% select(instnm,univ_id,event_date,event_type,event_state,zip,med_inc) %>% filter(row_number()<6) str(df_bama) ``` Attempt to run loop; what went wrong? ```{r, eval=FALSE} for (i in seq_along(df_bama)) { cat("i=",i,"; mean=",mean(df_bama[[i]], na.rm=TRUE),"; sd=",sd(df_bama[[i]], na.rm=TRUE),sep="",fill=TRUE) #print((df_bama[[i]] - mean(df_bama[[i]], na.rm=TRUE))/sd(df_bama[[i]], na.rm=TRUE)) df_bama[[i]] <- (df_bama[[i]] - mean(df_bama[[i]], na.rm=TRUE))/sd(df_bama[[i]], na.rm=TRUE) } df_bama ``` ### Modify z-score loop to work with non-numeric variables What happens if we apply our loop to the data frame `df_bama`, which has both string and numeric variables? \medskip Let's modify our loop so that it only calculates z-score only for non-integer, numeric variables ```{r, results="hide"} str(df_bama) for (i in seq_along(df_bama)) { cat("i=",i,"; var name=",names(df_bama)[[i]],"; type=",typeof(df_bama[[i]]), "; class=",class(df_bama[[i]]),sep="",fill=TRUE) if(is.numeric(df_bama[[i]]) & (!is_integer(df_bama[[i]]))) { df_bama[[i]] <- (df_bama[[i]] - mean(df_bama[[i]], na.rm=TRUE))/sd(df_bama[[i]], na.rm=TRUE) } else { # do nothing } } str(df_bama) ``` ### Modify object: embed z-score loop in function [SKIP] Recreate `df` and `df_bama` [ouput and code omitted] ```{r, results="hide", echo=FALSE} #recreate df set.seed(12345) df <- tibble(a = rnorm(10),b = rnorm(10),c = rnorm(10),d = rnorm(10)) #recreate df_bama load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_event_somevars.RData")) df_bama <- df_event %>% arrange(univ_id,event_date) %>% select(instnm,univ_id,event_date,event_type,event_state,zip,med_inc) %>% filter(row_number()<6) ``` \medskip Can we embed this loop in a function that takes the data frame as an argument so we don't have to modify loop for each data frame? ```{r, results="hide"} z_score <- function(x) { for (i in seq_along(x)) { cat("i=",i,"; var name=",names(x)[[i]],"; type=",typeof(x[[i]]), "; class=",class(x[[i]]),sep="",fill=TRUE) if(is.numeric(x[[i]]) & (!is_integer(x[[i]]))) { x[[i]] <- (x[[i]] - mean(x[[i]], na.rm=TRUE))/sd(x[[i]], na.rm=TRUE) } else { #do nothing } } } #apply to data frame df df_z <- z_score(df) df; df_z #apply to data frame df_bama df_bama_z <- z_score(df_bama) df_bama; df_bama_z ``` # When to write a loop; recipe for writing loops ### When to write a loop __Broadly, rationale for writing loop__: - Do not duplicate code - Can make changes to code in one place rather than many \medskip __When to write a loop__: - Grolemund and Wickham say __don't copy and paste more than twice__ - If you find yourself doing this, consider writing a loop or function \medskip __Don't worry about knowing all the situations you should write a loop__ - Rather, you'll be creating analysis dataset or analyzing data and you will notice there is some task that you are repeating over and over - Then you'll think "oh, I should write a loop or function for this" ### When to write a loop vs a functions [SKIP- for next week] \medskip Usually obvious when you are duplicating code, but unclear whether you should write a loop or whether you should write a function. - Often, a repeated task can be completed with a loop or a function In my experience, loops are better for repeated tasks when the individual tasks are __very__ similar to one another - e.g., a loop that reads in data sets from individual years; each dataset you read in differs only by directory and name - e.g., a loop that converts negative values to `NA` for a set of variables Because functions can have many arguments, functions are better when the individual tasks differ substantially from one another - Example: function that runs regression and creates formatted results table - function allows you to specify (as function arguments): dependent variable; independent variables; what model to run, etc. __Note__ - Can embed loops within functions; can call functions within loops - But for now, just try to understand basics of functions and loops ### Recipe for how to write loop The general recipe for how to write a loop: 1. Complete the task for one instance outside a loop (this is akin to writing the __body__ of the loop) 2. Write the __sequence__ 3. Which parts of the body need to change with each iteration 4. _if_ you are creating a new object store output of the loop, create this outside of the loop 5. Construct the loop # Practice: How well do public universities cover in-state public high schools? ### Load recruiting data Load data frame with one observation per high school and variables for visits by each public research university in sample - Note: this data frame has more vars than previous data frame we used ```{r, results="hide"} rm(list = ls()) # remove all objects load(url("https://github.com/ozanj/rclass/raw/master/data/recruiting/recruit_school_allvars.RData")) ``` We are interested in creating measures of how good a job public universities are doing visiting in-state public high schools - Create data frame with one observation for each public high school ```{r, results="hide"} #names(df_school_all) df_school_all %>% str() df_pubhs <- df_school_all %>% # Create data-frame that keeps public high schools filter(school_type=="public") %>% select(-school_type) rm(df_school_all) ``` Create standalone objects (output and code omitted) 1. Character vector containing ID for each public university 2. A named list containing university name ```{r, results="hide", echo=FALSE} ids<- c( '196097', '186380', '215293', '201885', '181464', '139959', '218663', '100751', '199193', '110635', '110653', '126614', '155317', '106397', '149222', '166629') str(ids) #create named list #value of each element is university name #name assigned to each element is the university ID instnm<- list( "196097"="Stony Brook", "186380"="Rutgers", "215293"="Pitt", "201885"="Cinci", "181464"="U Nebraska Lincoln", "139959"="U Georgia", "218663" ="U South Carolina", "100751"="U Alabama", "199193"="NC State", "110635"="UC Berkeley", "110653"="UC Irvine", "126614"="CU Boulder" , "155317"="U Kansas", "106397"="U Arkansas", "149222"="U S Illinois", "166629"="UMass Amherst") instnm<- list( "Stony Brook"="196097", "Rutgers"="186380", "Pitt"="215293", "Cinci"="201885", "U Nebraska Lincoln"="181464", "U Georgia"="139959", "U South Carolina"="218663", "U Alabama"="100751", "NC State"="199193", "UC Berkeley"="110635", "UC Irvine"="110653", "CU Boulder"="126614", "U Kansas"="155317", "U Arkansas"="106397", "U S Illinois"="149222", "UMass Amherst"="166629") str(instnm) ``` ### How well do public universities cover in-state public high schools \medskip __Task__: for each public research university, calculate the number and percent of public high schools in the university's home state that received a visit First, let's accomplish task outside of a loop for one university [Tidyverse] - let's choose `"U of South Carolina"`, `ID==218663` ```{r, results="hide"} #"state_code" is the 2-letter high school state code df_pubhs %>% select(state_code) %>% str() #variables starting with "inst_" identify state the university is located in df_pubhs %>% select(inst_218663) %>% str() df_pubhs %>% select(inst_218663) %>% count(inst_218663) # these vars don't vary #variables starting with "visits_by_" indicate number of visits HS got in 2017 df_pubhs %>% select(visits_by_218663) %>% str() df_pubhs %>% select(visits_by_218663) %>% count(visits_by_218663) #filter only obs where HS state code equals home state of university df_pubhs %>% filter(state_code==inst_218663) %>% count() # count pub HS in SC #Create measures: number pub HS in SC; number w/ visit; pct w/ visit df_pubhs %>% filter(state_code==inst_218663) %>% select(visits_by_218663) %>% mutate(got_visit=ifelse(visits_by_218663>0,1,0)) %>% summarise(n_hs=n(),n_visit=sum(got_visit),pct_visit=sum(got_visit)/n()) ``` ### How well do public universities cover in-state public high schools [SKIP- BASE R] \medskip __Task__: for each public research university, calculate the number and percent of public high schools in the university's home state that received a visit First, let's accomplish task outside of a loop for one university [Base R] - let's choose `"U of South Carolina"`, `ID==218663` ```{r, results="hide"} #"state_code" is the 2-letter high school state code str(df_pubhs$state_code) #variables starting with "inst_" identify state the university is located in str(df_pubhs$inst_218663) table(df_pubhs$inst_218663, useNA='ifany') # these vars don't vary #variables starting with "visits_by_" indicate number of visits HS got in 2017 str(df_pubhs$visits_by_218663) table(df_pubhs$visits_by_218663, useNA='ifany') #filter only obs where HS state code equals home state of university tempdf <- subset(df_pubhs,df_pubhs[["state_code"]]==df_pubhs[["inst_218663"]]) #tempdf <- subset(df_pubhs,df_pubhs$state_code==df_pubhs$inst_218663) # same as above #tempdf <- subset(df_pubhs,state_code==inst_218663) # same as above #Create 0/1 indicator of whether got visit tempdf$got_visit <- ifelse(tempdf$visits_by_218663>0,1,0) #frequency count of schools that got visits vs. not table(tempdf$got_visit, useNA='ifany') #create objects for count table and proportion table ct_table <- table(tempdf$got_visit, useNA='ifany') ct_table typeof(ct_table); length(ct_table); str(ct_table) # named vector with 2 elements prop.table(ct_table) pr_table <-prop.table(ct_table) pr_table typeof(pr_table); length(pr_table); str(pr_table) # named vector with 2 elements ``` ### How well do public universities cover in-state public high schools \medskip __Task__: for each public research university, calculate the number and percent of public high schools in the university's home state that received a visit Build loop - first, loop through each value of list `instnm` ```{r, results="hide"} instnm for (i in seq_along(instnm)) { cat("\n","i=",i,sep="",fill=TRUE) name <- names(instnm)[[i]] # name of element cat("name=",name,sep="",fill=TRUE) value <- instnm[[i]] # value of element cat("value=",value,sep="",fill=TRUE) } ``` ### How well do public universities cover in-state public high schools \medskip __Task__: for each public research university, calculate the number and percent of public high schools in the university's home state that received a visit Build loop - next, create "inst_..." and "visits_by_..." vars for each id - keep obs in same state as university - create 0/1 variable of whether high school got a visit ```{r, results="hide"} for (i in seq_along(instnm)) { cat("\n","i=",i,"; ",names(instnm)[[i]],sep="",fill=TRUE) #create object called inst_var; value is "inst_id" (e.g., "inst_166629") cat("inst_",instnm[[i]],sep="",fill=TRUE) inst_var <- paste("inst_",instnm[[i]],sep="") print(inst_var) #create object called visits_by_var; value is "visits_by_id" (e.g., "visits_by_166629") visits_by_var <- paste("visits_by_",instnm[[i]],sep="") print(visits_by_var) #create subset of data with high schools in same state as the university tempdf <- subset(df_pubhs,df_pubhs[["state_code"]]==df_pubhs[[inst_var]]) # code df_pubhs[[inst_var]]) evaluates to df_pubhs[["inst_166629"]]) or whatever current value of inst_var is # this is same as instnm[[i]] evaluating to instnm[[16]] or whatever current value of i is #Create 0/1 indicator of whether got visit tempdf$got_visit <- ifelse(tempdf[[visits_by_var]]>0,1,0) } ``` ### How well do public universities cover in-state public high schools \medskip __Task__: for each public research university, calculate the number and percent of public high schools in the university's home state that received a visit Build loop - next, create count of number of visted and non-visited in-state schools ```{r, results="hide"} for (i in seq_along(instnm)) { cat("\n","i=",i,"; ",names(instnm)[[i]],sep="",fill=TRUE) inst_var <- paste("inst_",instnm[[i]],sep="") visits_by_var <- paste("visits_by_",instnm[[i]],sep="") tempdf <- subset(df_pubhs,df_pubhs[["state_code"]]==df_pubhs[[inst_var]]) # keep obs in same state as university tempdf$got_visit <- ifelse(tempdf[[visits_by_var]]>0,1,0) # create 0/1 indicator #create frequency table of number of schools with and without visits print(table(tempdf$got_visit, useNA='ifany')) ct_table <- table(tempdf$got_visit, useNA='ifany') # named vector with 2 elements str(ct_table) #create proportion table print(prop.table(ct_table)) pr_table <- prop.table(ct_table) # named vector with 2 elements str(pr_table) } ``` ### How well do public universities cover in-state public high schools [SKIP] \medskip __Task__: for each public research university, calculate the number and percent of public high schools in the university's home state that received a visit Here is tidyverse approach to loop, which uses some programming concepts we haven't covered ```{r, results="hide",eval=FALSE} for (i in seq_along(instnm)) { cat("\n","i=",i,"; ",names(instnm)[[i]],sep="",fill=TRUE) #create object called inst_var; value is "inst_id" (e.g., "inst_166629") inst_var <- paste("inst_",instnm[[i]],sep="") #create object called visits_by_var; value is "visits_by_id" (e.g., "visits_by_166629") visits_by_var <- paste("visits_by_",instnm[[i]],sep="") #Create measures: number pub HS in SC; number w/ visit; pct w/ visit df_pubhs %>% filter_(glue::glue("state_code=={inst_var}")) %>% select_(visits_by_var) %>% mutate_(got_visit=glue::glue("ifelse({visits_by_var}>0,1,0)")) %>% summarise(n_hs=n(),n_visit=sum(got_visit),pct_visit=sum(got_visit)/n()) } ```