--- title: "Augmented vectors, factor + labelled class" subtitle: "EDUC 263: Introduction to Programming and Data Management Using R" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: default #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` ```{r, echo=FALSE, include=FALSE} #THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamder_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE download.file(url = 'https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/beamer_header.tex', destfile = '../beamer_header.tex', mode = 'wb') ``` ```{r, echo=FALSE, include=FALSE} # Download images imgs <- c('data-structures-overview.png') for (i in imgs) { if(!file.exists(i)){ download.file(url = paste0('https://raw.githubusercontent.com/anyone-can-cook/rclass1/master/lectures/intro_to_r/', i), destfile = i, mode = 'wb') } } ``` \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries we will use __Load the packages__ we will use by running this code chunk: ```{r, message=FALSE} library(tidyverse) library(haven) library(labelled) library(lubridate) ``` __If package not yet installed__, then must install before you load. Install in "console" rather than .Rmd file: - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` __Note__: When we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ### Dataset we will use ```{r} rm(list = ls()) # remove all objects load(url("https://github.com/anyone-can-cook/rclass1/raw/master/data/prospect_list/wwlist_merged.RData")) ``` # Attributes and augmented vectors ## Review data types and structures ### Review data structures: __Vectors__ \medskip Two types of vectors: 1. __Atomic vectors__ 1. __Lists__ \medskip ![Overview of data structures (Grolemund and Wickham, 2018)](data-structures-overview.png){width=60%} ### Review data structures: atomic vectors \medskip An __atomic vector__ is a collection of values - Each value in an atomic vector is an __element__ - All elements within vector must have same __data type__ ```{r} (a <- c(1,2,3)) # parentheses () assign and print object in one step length(a) # length = number of elements typeof(a) # numeric atomic vector, type=double str(a) # investigate structure of object ``` Can assign __names__ to vector elements, creating a __named atomic vector__ ```{r} (b <- c(v1=1,v2=2,v3=3)) length(b) typeof(b) str(b) ``` ### Review data structures: lists \medskip - Like atomic vectors, __lists__ are objects that contain __elements__ - However, __data type__ can differ across elements within a list - E.g., an element of a list can be another list ```{r} list_a <- list(1,2,"apple") typeof(list_a) length(list_a) str(list_a) list_b <- list(1, c("apple", "orange"), list(1, 2)) length(list_b) str(list_b) ``` ### Review data structures: lists Like atomic vectors, elements within a list can be named, thereby creating a __named list__ ```{r} # not named str(list_b) # named list_c <- list(v1=1, v2=c("apple", "orange"), v3=list(1, 2, 3)) str(list_c) ``` ### Review data structures: data frames \medskip A __data frame__ is a list with the following characteristics: - All the elements must be __vectors__ with the same __length__ - Data frames are __augmented lists__ because they have additional __attributes__ ```{r} # a regular list (list_d <- list(col_a = c(1,2,3), col_b = c(4,5,6), col_c = c(7,8,9))) typeof(list_d) attributes(list_d) ``` ### Review data structures: data frames ```{r} # a data frame (df_a <- data.frame(col_a = c(1,2,3), col_b = c(4,5,6), col_c = c(7,8,9))) typeof(df_a) attributes(df_a) ``` ## Attributes and augmented vectors ### Atomic vectors versus augmented vectors __Atomic vectors__ [our focus so far] - I think of atomic vectors as "just the data" - Atomic vectors are the building blocks for augmented vectors \medskip __Augmented vectors__ - __Augmented vectors__ are atomic vectors with additional __attributes__ attached __Attributes__ - __Attributes__ are additional "metadata" that can be attached to any object (e.g., vector or list) Example: Variables of a dataset - A data frame is a list - Each element in the list is a variable, which consists of: - Atomic vector ("just the data") - Any attributes we want to attach to each element/variable - Variable __name__, an attribute of the data frame object Other examples of attributes in R - Value __labels__: Character labels (e.g., "Charter School") attached to numeric values - Object __class__: Specifies how object is treated by object oriented programming language __Main takeaway__: - __Augmented vectors__ are __atomic vectors__ (just the data) with additional __attributes__ attached ### Attributes and functions to identify/modify attributes Description of attributes from Grolemund and Wickham 20.6 - "Any vector can contain arbitrary additional __metadata__ through its __attributes__" - "You can think of __attributes__ as named list of vectors that can be attached to any object" Functions to identify and modify attributes - `attributes()` function to describe all attributes of an object - `attr()` to see individual attribute of an object or set/change an individual attribute of an object ### attributes() function: describes all attributes of an object ```{r, eval=FALSE} # pull up help file for the attributes() function ?attributes ``` Attributes of a __named atomic vector__: ```{r} # create named atomic vector (vector1 <- c(a = 1, b = 2, c = 3, d = 4)) attributes(vector1) attributes(vector1) %>% str() # a named list of vectors! # remove all attributes from the object attributes(vector1) <- NULL vector1 attributes(vector1) ``` ### attributes() function, attributes of a variable in a data frame \medskip __Accessing variable using `[[]]` subset operator__ - Recall `object_name[["element_name"]]` accesses contents of the element - If object is a data frame, `df_name[["var_name"]]` accesses contents of variable - For simple vars like `firstgen`, syntax yields an atomic vector ("just the data") - Shorthand syntax for `df_name[["var_name"]]` is `df_name$var_name` ```{r} str(wwlist[["firstgen"]]) attributes(wwlist[["firstgen"]]) str(wwlist$firstgen) # same same attributes(wwlist$firstgen) ``` __Accessing variable using `[]` subset operator__ - `object_name["element_name"]` creates object of same type as `object_name` - If object is a data frame, `df_name["var_name"]` returns a data frame containing just the `var_name` column ```{r, results="hide"} str(wwlist["firstgen"]) attributes(wwlist["firstgen"]) ``` ### attributes() function, attributes of lists and data frames \medskip Attributes of a __named list__: ```{r} list2 <- list(col_a = c(1,2,3), col_b = c(4,5,6)) str(list2) attributes(list2) ``` Note that the `names` attribute is an attribute of the list, not an attribute of the elements within the list (which are atomic vectors) ```{r} list2[['col_a']] # the element named 'col_a' str(list2[['col_a']]) # structure of the element named 'col_a' attributes(list2[['col_a']]) # attributes of element named 'col_a' ``` ### attributes() function, attributes of lists and data frames Attributes of a __data frame__: ```{r} list3 <- data.frame(col_a = c(1,2,3), col_b = c(4,5,6)) str(list3) attributes(list3) ``` Note: attributes `names`, `class` and `row.names` are attributes of the data frame - they are not attributes of the elements (variables) within the data frame, which are atomic vectors (i.e., just the data) ```{r} str(list3[['col_a']]) # structure of the element named 'col_a' attributes(list3[['col_a']]) # attributes of element named 'col_a' ``` ### attr() function: get or set specific attributes of an object ```{r, eval=FALSE, include=FALSE} ?attr ``` Syntax - Get: `attr(x, which, exact = FALSE)` - Set: `attr(x, which) <- value` Arguments - `x`: an object whose attributes are to be accessed - `which`: a non-empty character string specifying which attribute is to be accessed - `exact` (logical): should `which` be matched exactly? default is `exact = FALSE` - `value`: an object, new value of attribute, or `NULL` to remove attribute \medskip __Using `attr()` to get specific attribute of an object__ ```{r} vector1 <- c(a = 1, b= 2, c= 3, d = 4) attributes(vector1) attr(x=vector1, which = "names", exact = FALSE) attr(vector1, "names") attr(vector1, "name") # we don't provide exact name of attribute attr(vector1, "name", exact = TRUE) # don't provide exact name of attribute ``` ### attr() function: get or set specific attributes of an object ```{r, eval=FALSE, include=FALSE} ?attr ``` Syntax - Get: `attr(x, which, exact = FALSE)` - Set: `attr(x, which) <- value` Arguments - `x`: an object whose attributes are to be accessed - `which`: a non-empty character string specifying which attribute is to be accessed - `exact` (logical): should `which` be matched exactly? default is `exact = FALSE` - `value`: an object, new value of attribute, or `NULL` to remove attribute \medskip __Using `attr()` to set specific attribute of an object__ (output omitted) ```{r, results = "hide"} (vector1 <- c(a = 1, b= 2, c= 3, d = 4)) attributes(vector1) # see all attributes attr(x=vector1, which = "greeting") <- "Hi!" # create new attribute attr(x=vector1, which = "greeting") # see attribute attr(vector1, "farewell") <- "Bye!" # create attribute attr(x=vector1, which = "names") # see names attribute attr(x=vector1, which = "names") <- NULL # delete names attribute attributes(vector1) # see all attributes ``` ### attr() function, apply on data frames \medskip __Using `wwlist`, create data frame with three variables__ ```{r, results = "hide"} wwlist_small <- wwlist[1:25, ] %>% select(hs_state,firstgen,med_inc_zip) str(wwlist_small) attributes(wwlist_small) attributes(wwlist_small) %>% str() ``` __Get/set attribute of a data frame__ ```{r, results = "hide"} #get/examine names attribute attr(x=wwlist_small, which = "names") str(attr(x=wwlist_small, which = "names")) # names attribute is character atomic vector, length=3 #add new attribute to data frame attr(x=wwlist_small, which = "new_attribute") <- "contents of new attribute" attributes(wwlist_small) ``` __Get/set attribute of a variable in data frame__ ```{r, results = "hide"} str(wwlist_small$med_inc_zip) attributes(wwlist_small$med_inc_zip) #create attribute for variable med_inc_zip attr(wwlist_small$med_inc_zip, "inc attribute") <- "inc attribute contents" #investigate attribute for variable med_inc_zip attributes(wwlist_small$med_inc_zip) str(wwlist_small$med_inc_zip) attr(wwlist_small$med_inc_zip, "inc attribute") ``` ### Why add attributes to data frame or variables of data frame? Pedagogical reasons - Important to know how you can apply `attributes()` and `attr()` to data frames and to variables within data frames \medskip Example practical application: interactive dashboards - When creating "dashboard" you might want to add "tooltips" - "Tooltip" is a message that appears when cursor is positioned over an icon - The text in the tooltip is the contents of an attribute - Example dashboard: [LINK](https://jkcf.shinyapps.io/dashboard/) ### Student exercises 1. Using `wwlist`, create data frame of 30 observations with three variables: `state`, `zip5`, `pop_total_zip` 2. Return all attributes of this new data frame using `attributes()`. Then, get the `names` attribute of the data frame using `attr()`. 3. Add a new attribute to the data frame called `attribute_data` whose content is `"new attribute of data"`. Then, return all attributes of the data frame as well as get the value of the newly created `attribute_data`. 4. Return the attributes of the variable `pop_total_zip` in the data frame. 5. Add a new attribute to the variable `pop_total_zip` called `attribute_variable` whose content is `"new attribute of variable"`. Then, return all attributes of the variable as well as get the value of the newly created `attribute_variable`. ### Solution to student exercises ```{r, results = "hide"} # Part 1 wwlist_exercise <- wwlist[1:30, ] %>% select(state, zip5, pop_total_zip) # Part 2 attributes(wwlist_exercise) attr(x=wwlist_exercise, which = "names") # Part 3 attr(x=wwlist_exercise, which = "attribute_data") <- "new attribute of data" attributes(wwlist_exercise) attr(wwlist_exercise, which ="attribute_data") # Part 4 attributes(wwlist_exercise$pop_total_zip) # Part 5 attr(wwlist_exercise$pop_total_zip, "attribute_variable") <- "new attribute of variable" attributes(wwlist_exercise$pop_total_zip) attr(wwlist_exercise$pop_total_zip, "attribute_variable") ``` # Object class ### Object class \medskip Every object in R has a __class__ - Class is an __attribute__ of an object - Object class controls how functions work and defines the rules for how objects can be treated by object oriented programming language - E.g., which functions you can apply to object of a particular class - E.g., what the function does to one object class, what it does to another object class You can use the `class()` function to identify object class: ```{r} (vector2 <- c(a = 1, b= 2, c= 3, d = 4)) typeof(vector2) class(vector2) ``` When I encounter a new object I often investigate object by applying `typeof()`, `class()`, and `attributes()` functions: ```{r} typeof(vector2) class(vector2) attributes(vector2) ``` ### Why is object class important? Functions care about object __class__, not object __type__ \medskip Specific functions usually work with only particular __classes__ of objects - "Date" functions usually only work on objects with a date class - "String" functions usually only work on objects with a character class - Functions that do mathematical computation usually work on objects with a numeric class ### Functions care about object __class__, not object __type__ \medskip __Example__: `sum()` applies to __numeric__, __logical__, or __complex__ class objects ```{r, eval=FALSE, include=FALSE} ?sum ``` Apply `sum()` to object with class = __logical__: ```{r} x <- c(TRUE, FALSE, NA, TRUE) typeof(x) class(x) sum(x, na.rm = TRUE) ``` Apply `sum()` to object with class = __numeric__: ```{r} typeof(wwlist$med_inc_zip) class(wwlist$med_inc_zip) wwlist$med_inc_zip[1:5] sum(wwlist$med_inc_zip[1:5], na.rm = TRUE) ``` What happens when we try to apply `sum()` to an object with class = __character__? ```{r, eval=FALSE} typeof(wwlist$hs_city) class(wwlist$hs_city) wwlist$hs_city[1:5] sum(wwlist$hs_city[1:5], na.rm = TRUE) ``` ### Functions care about object __class__, not object __type__ __Example__: `year()` from `lubridate` package applies to date-time objects ```{r, include=FALSE, results = "hide"} library(lubridate) ``` ```{r, eval=FALSE, include=FALSE} ?year ``` Apply `year()` to object with class = __Date__: ```{r} wwlist$receive_date[1:5] typeof(wwlist$receive_date) class(wwlist$receive_date) year(wwlist$receive_date[1:5]) ``` What happens when we try to apply `year()` to an object with class = __numeric__? ```{r, eval=FALSE} typeof(wwlist$med_inc_zip) class(wwlist$med_inc_zip) year(wwlist$med_inc_zip[1:10]) ``` ### Functions care about object __class__, not object __type__ __Example__: `tolower()` applies to __character__ class objects - Syntax: `tolower(x)` - `x` is "a character vector, or an object that can be coerced to character by `as.character()`" Most string functions are intended to apply to objects with a __character__ class - __type__ = character - __class__ = character ```{r, eval=FALSE, include=FALSE} ?tolower ``` Apply `tolower()` to object with class = __character__: ```{r} str(wwlist$hs_city) typeof(wwlist$hs_city) class(wwlist$hs_city) wwlist$hs_city[1:6] tolower(wwlist$hs_city[1:6]) ``` ### Class and object-oriented programming R is an object-oriented programming language \medskip Definition of object oriented programming from this [LINK](https://www.webopedia.com/TERM/O/object_oriented_programming_OOP.html) \medskip > "Object-oriented programming (OOP) refers to a type of computer programming in which programmers define not only the data type of a data structure, but also the types of operations (functions) that can be applied to the data structure." \medskip Object __class__ is fundamental to object oriented programming because: - Object class determines which functions can be applied to the object - Object class also determines what those functions do to the object - E.g., a specific function might do one thing to objects of __class__ A and another thing to objects of __class__ B - What a function does to objects of different class is determined by whoever wrote the function \medskip Many different object classes exist in R - You can also create your own classes - Example: the `labelled` class is an object class created by Hadley Wickham when he created the `haven` package - In this course we will work with classes that have been created by others # Class == factor ### Recoding variable `ethn_code` from data frame `wwlist` Let's first recode the `ethn_code` variable: ```{r, results= "hide"} wwlist <- wwlist %>% mutate(ethn_code = recode(ethn_code, "american indian or alaska native" = "nativeam", "asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban" = "latinx", "mexican/mexican american" = "latinx", "not reported" = "not_reported", "other-2 or more" = "multirace", "other spanish/hispanic" = "latinx", "puerto rican" = "latinx", "white" = "white" ) ) str(wwlist$ethn_code) wwlist %>% count(ethn_code) ``` ### Factors __Factors__ are an object _class_ used to display categorical data (e.g., marital status) - A factor is an __augmented vector__ built by attaching a __levels__ attribute to an (atomic) integer vectors Usually, we would prefer a categorical variable (e.g., race, school type) to be a factor variable rather than a character variable - So far in the course I have made all categorical variables character variables because we had not introduced factors yet __Create factor version of character variable `ethn_code` using base R `factor()` function__: ```{r} str(wwlist$ethn_code) class(wwlist$ethn_code) # create factor var; tidyverse approach wwlist <- wwlist %>% mutate(ethn_code_fac = factor(ethn_code)) #wwlist$ethn_code_fac <- factor(wwlist$ethn_code) # base r approach str(wwlist$ethn_code) str(wwlist$ethn_code_fac) ``` ### Factors __Character variable `ethn_code`__: ```{r} typeof(wwlist$ethn_code) class(wwlist$ethn_code) attributes(wwlist$ethn_code) str(wwlist$ethn_code) ``` __Factor variable `ethn_code_fac`__: ```{r} typeof(wwlist$ethn_code_fac) class(wwlist$ethn_code_fac) attributes(wwlist$ethn_code_fac) str(wwlist$ethn_code_fac) ``` ### Working with factor variables Main things to note about variable `ethn_code_fac` - __type__ = integer - __class__ = factor, because the variable has a __levels__ attribute - Underlying data are integers, but the values of the __levels__ attribute is what's displayed: ```{r} # Print first few obs of ethn_code_fac wwlist$ethn_code_fac[1:5] # Print count for each category in ethn_code_fac wwlist %>% count(ethn_code_fac) ``` ### Working with factor variables Apply `as.integer()` to display underlying integer values of factor variable ```{r, eval=FALSE, include=FALSE} ?as.integer ``` Investigate `as.integer()` function: ```{r} typeof(wwlist$ethn_code_fac) class(wwlist$ethn_code_fac) typeof(as.integer(wwlist$ethn_code_fac)) class(as.integer(wwlist$ethn_code_fac)) ``` Display underlying integer values of variable `ethn_code_fac`: ```{r} wwlist %>% count(as.integer(ethn_code_fac)) ``` ### Working with factor variables \medskip Refer to categories of a factor (e.g., when filtering obs) using values of __levels__ attribute rather than underlying values of variable - Values of __levels__ attribute for `ethn_code_fac` (output omitted) ```{r, results='hide'} attributes(wwlist$ethn_code_fac) ``` \medskip __Example__: Count the number of prospects in `wwlist` who identify as "white" ```{r} # referring to variable value; this doesn't work wwlist %>% filter(ethn_code_fac==7) %>% count() #referring to value of level attribute; this works wwlist %>% filter(ethn_code_fac=="white") %>% count() ``` ### Working with factor variables __Example__: Count the number of prospects in `wwlist` who identify as "white" - To refer to underlying integer values, apply `as.integer()` function to factor variable ```{r} attributes(wwlist$ethn_code_fac) wwlist %>% filter(as.integer(ethn_code_fac)==7) %>% count ``` ### How to identify the variable values associated with factor levels Create a factor version of the character variable `psat_range` ```{r, results="hide", warning = FALSE} wwlist %>% count(psat_range) wwlist <- wwlist %>% mutate(psat_range_fac = factor(psat_range)) wwlist %>% count(psat_range_fac) attributes(wwlist$psat_range_fac) ``` Investigate values associated with factor levels using `levels()` and `nlevels()` ```{r, results="hide"} levels(wwlist$psat_range_fac) #starts at 1 nlevels(wwlist$psat_range_fac) #7 levels total levels(wwlist$psat_range_fac)[1:3] #prints levels 1-3 ``` Once values associated with factor levels are known: - Can filter based on underling integer values using `as.integer()` ```{r} wwlist %>% filter(as.integer(psat_range_fac)==4) %>% count() ``` - Or filter based on value of factor __levels__ ```{r} wwlist %>% filter(psat_range=="1270-1520") %>% count() ``` ### Creating factor variables from character variables or from integer variables See Appendix ### Factor student exercise 1. After running the code below, use `typeof()`, `class()`, `str()`, and `attributes()` functions to check the new variable `receive_year` 2. Create a factor variable from the input variable `receive_year` and name it `receive_year_fac` 3. Run the same functions (`typeof()`, `class()`, etc.) from the first question using the new variable you created 4. Get a count of `receive_year_fac`. (__hint:__ you could also run this in the console to see values associated with each factor) Run this code to create a year variable from the input variable `receive_date`: ```{r, results="hide", message=FALSE} # wwlist %>% glimpse() library(lubridate) # load library if you haven't already wwlist <- wwlist %>% mutate(receive_year = year(receive_date)) # create year variable with lubridate # Check variable wwlist %>% count(receive_year) wwlist %>% group_by(receive_year) %>% count(receive_date) ``` ### Factor student exercise solutions 1. After running the code below, use `typeof()`, `class()`, `str()`, and `attributes()` functions to check the new variable `receive_year` ```{r} typeof(wwlist$receive_year) class(wwlist$receive_year) str(wwlist$receive_year) attributes(wwlist$receive_year) ``` ### Factor student exercise solutions 2. Create a factor variable from the input variable `receive_year` and name it `receive_year_fac` ```{r} # create factor var; tidyverse approach wwlist <- wwlist %>% mutate(receive_year_fac = factor(receive_year)) ``` ### Factor student exercise solutions 3. Run the same functions (`typeof()`, `class()`, etc.) from the first question using the new variable you created ```{r} typeof(wwlist$receive_year_fac) class(wwlist$receive_year_fac) str(wwlist$receive_year_fac) attributes(wwlist$receive_year_fac) ``` ### Factor student exercise solutions 4. Get a count of `receive_year_fac`. (__hint:__ you could also run this in the console to see values associated with each factor) ```{r} wwlist %>% count(receive_year_fac) ``` # Class == labelled ### Data we will use to introduce `labelled` class High school longitudinal surveys from National Center for Education Statistics (NCES) - Follow U.S. students from high school through college, labor market \medskip We will be working with [High School Longitudinal Study of 2009 (HSLS:09)](https://nces.ed.gov/surveys/hsls09/index.asp) - Follows 9th graders from 2009 - Data collection waves - Base Year (2009) - First Follow-up (2012) - 2013 Update (2013) - High School Transcripts (2013-2014) - Second Follow-up (2016) ### Using `haven` package to read SAS/SPSS/Stata datasets into R [`haven`](https://haven.tidyverse.org/), which is part of __tidyverse__, "enables R to read and write various data formats" from the following statistical packages: - SAS - SPSS - Stata \medskip When using `haven` to read data, resulting R objects have these characteristics: - Data frames are __tibbles__, Tidyverse's preferred __class__ of data frames - Transform variables with "value labels" into the `labelled()` class - `labelled` is an object class, just like `factor` is an object class - `labelled` is an object __class__ created by folks who created `haven` package - `labelled` and `factor` classes are both viable alternatives for categorical variables - Helpful description of `labelled` class [HERE](https://haven.tidyverse.org/articles/semantics.html) - Dates and times converted to R date/time classes - Character vectors not converted to factors ### Using `haven` package to read SAS/SPSS/Stata datasets into R Use `read_dta()` function from `haven` package to import Stata dataset into R ```{r, results="hide"} hsls <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/hsls/hsls_stu_small.dta") ``` __Must__ run this code chunk; permanently changes uppercase variable names to lowercase ```{r, results="hide"} names(hsls) names(hsls) <- tolower(names(hsls)) # convert names to lowercase names(hsls) # names now lowercase str(hsls) # ugh ``` __Investigate variable `s3classes` from data frame `hsls`__ - Identifies whether respondent taking postsecondary classes as of 11/1/2013 ```{r, results="hide"} typeof(hsls$s3classes) class(hsls$s3classes) str(hsls$s3classes) ``` __Investigate attributes of `s3classes`__ ```{r, results="hide"} attributes(hsls$s3classes) # all attributes #specific attributes: using syntax: attr(x, which, exact = FALSE) attr(x=hsls$s3classes, which = "label") # label attribute attr(x=hsls$s3classes, which = "labels") # labels attribute ``` ### What is object class = `labelled`? \medskip __Variable labels__ are labels attached to a specific variable (e.g., marital status) __Value labels__ [in Stata] are labels attached to specific values of a variable, e.g.: - Var value `1` attached to value label "married", `2`="single", `3`="divorced" \medskip `labelled` is object class for importing vars with __value labels__ from SAS/SPSS/Stata - `labelled` object class created by `haven` package - Characteristics of variables in R data frame with `class==labelled`: - Data `type` can be numeric(double) or character - To see `value labels` associated with each value: - `attr(df_name$var_name,"labels")` - E.g., `attr(hsls$s3classes,"labels")` Investigate the attributes of `hsls$s3classes` ```{r, results="hide"} typeof(hsls$s3classes) class(hsls$s3classes) str(hsls$s3classes) attributes(hsls$s3classes) ``` Use `attr(object_name,"attribute_name")` to refer to each attribute ```{r, results="hide"} attr(hsls$s3classes,"label") attr(hsls$s3classes,"format.stata") attr(hsls$s3classes,"class") attr(hsls$s3classes,"labels") ``` ### `labelled` package Purpose of the `labelled` package is to work with data imported from SPSS/Stata/SAS using the `haven` package - `labelled` package contains functions to work with objects that have `labelled` class - From package documentation: - "purpose of the `labelled` package is to provide functions to manipulate _metadata_ as variable labels, value labels and defined missing values using the `labelled` class and the `label` attribute introduced in `haven` package." - More info on the `labelled` package: [LINK](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html) Functions in `labelled` package - [Full list](https://www.rdocumentation.org/packages/labelled/versions/1.1.0) ```{r, eval=FALSE, include=FALSE} ?labelled ``` ## Get variable and value labels ### Functions to get variable labels and value labels \medskip __Get variable labels__ using `var_label()` ```{r} hsls %>% select(s3classes) %>% var_label() ``` \medskip __Get value labels__ using `val_labels()` ```{r} hsls %>% select(s3classes) %>% val_labels() ``` ### Working with `labelled` class data \medskip Create frequency tables with `labelled` class variables using `count()` - Default setting is to show variable __values__ not __value labels__ ```{r} hsls %>% count(s3classes) ``` \medskip To make frequency table show __value labels__ add `%>% as_factor()` to pipe - `as_factor()` is function from `haven` that converts an object to a factor ```{r} hsls %>% count(s3classes) %>% as_factor() ``` ### Working with `labelled` class data To isolate values of `labelled` class variables in `filter()` function: - Refer to variable __value__, not the __value label__ __Task__ - How many observations in var `s3classes` associated with "Unit non-response" - How many observations in var `s3classes` associated with "Yes" General steps to follow: 1. Investigate object 1. Use `filter()` to isolate desired observations Investigate object ```{r, results="hide"} class(hsls$s3classes) hsls %>% select(s3classes) %>% var_label() #show variable label hsls %>% select(s3classes) %>% val_labels() #show value label hsls %>% count(s3classes) # freq table, values hsls %>% count(s3classes) %>% as_factor() # freq table, value labels ``` Filter specific values ```{r, results="hide"} hsls %>% filter(s3classes==-8) %>% count() # -8 = unit non-response hsls %>% filter(s3classes==1) %>% count() # 1 = yes ``` ## Set variable and value labels ### Functions to set variable labels and value labels \medskip __Set variable labels__ using `var_label()` or `set_variable_labels()` ```{r, eval=F} # Set one variable label var_label(df_name$var_name) <- 'variable label' # Set multiple variable labels df_name <- df_name %>% set_variable_labels( var_name_1 = 'variable label 1', var_name_2 = 'variable label 2', var_name_3 = 'variable label 3' ) ``` \medskip __Set value labels__ using `val_label()` or `set_value_labels()` ```{r, eval=F} # Set one value label val_label(df_name$var_name, 'variable_value') <- 'value_label' # Set multiple value labels df_name <- df_name %>% set_value_labels( var_name_1 = c('value_label_1' = 'variable_value_1', 'value_label_2' = 'variable_value_2', var_name_2 = c('value_label_3' = 'variable_value_3', 'value_label_4' = 'variable_value_4') ) ``` ### Create example data frame ```{r} df <- tribble( ~id, ~edu, ~sch, #--|--|---- 1, 2, 2, 2, 1, 1, 3, 3, 2, 4, 4, 2, 5, 1, 2 ) df str(df) ``` ### Set variable labels Use `set_variable_labels()` or `var_label()` to manually set variable labels ```{r} str(df$sch) var_label(df$sch) # Using set_variable_labels() df <- df %>% set_variable_labels( id = "Unique identification number", edu = "Education level" ) # Using var_label() var_label(df$sch) <- 'Type of school attending' str(df$sch) var_label(df$sch) ``` ### Set value labels Use `set_value_labels()` or `val_label()` to manually set value labels ```{r} val_labels(df$sch) # Using set_value_labels() df <- df %>% set_value_labels( edu = c('High School' = 1, 'AA degree' = 2, 'BA degree' = 3, 'MA or higher' = 4), sch = c('Private' = 1)) # Using val_label() val_label(df$sch, 2) <- 'Public' str(df$sch) val_labels(df$sch) ``` ### View the set variable and value labels ```{r} # View variable and value labels using attributes() attributes(df$sch) # View variable label var_label(df$sch) attr(df$sch, 'label') # View value labels val_labels(df$sch) attr(df$sch, 'labels') ``` ### `labelled` student exercise 1. Get variable and value labels of the variable `s3hs` in the `hsls` data frame 2. Get a count of the variable `s3hs` showing the values and the value labels (__hint__: use `as_factor()`) 3. Get a count of the rows whose value for `s3hs` is associated with "Missing" (__hint__: use `filter()`) 4. Get a count of the rows whose value for `s3hs` is associated with "Missing" or "Unit non-response" 5. Add variable label for `pop_asian_zip` & `pop_asian_state` in data frame `wwlist` 6. Add value labels for `ethn_code` in data frame `wwlist` ### `labelled` student exercise solutions 1. Get variable and value labels of the variable `s3hs` in the `hsls` data frame ```{r} hsls %>% select(s3hs) %>% var_label() hsls %>% select(s3hs) %>% val_labels() ``` ### `labelled` student exercise solutions 2. Get a count of the variable `s3hs` showing the values and the value labels (__hint__: use `as_factor()`) ```{r} hsls %>% count(s3hs) hsls %>% count(s3hs) %>% as_factor() ``` ### `labelled` student exercise solutions 3. Get a count of the rows whose value for `s3hs` is associated with "Missing" (__hint__: use `filter()`) ```{r} hsls %>% filter(s3hs== -9) %>% count() ``` ### `labelled` student exercise solutions 4. Get a count of the rows whose value for `s3hs` is associated with "Missing" or "Unit non-response" ```{r} hsls %>% filter(s3hs== -9 | s3hs== -8) %>% count() ``` ### `labelled` student exercise solutions 5. Add variable label for `pop_asian_zip` & `pop_asian_state` in data frame `wwlist` ```{r} # variable labels wwlist %>% select(pop_asian_zip, pop_asian_state) %>% var_label() # set variable labels wwlist <- wwlist %>% set_variable_labels( pop_asian_zip = "total asian population in zip", pop_asian_state ="total asian population in state" ) # attribute of variable attributes(wwlist$pop_asian_zip) attributes(wwlist$pop_asian_state) ``` ### `labelled` student exercise solutions 6. Add value labels for `ethn_code` in data frame `wwlist` ```{r, results="hide"} # count wwlist %>% count(ethn_code) # value labels wwlist %>% select(ethn_code) %>% val_labels # set value labels to ethn_code variable wwlist <- wwlist %>% set_value_labels( ethn_code = c("asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban or mexican/mexican american or other spanish/hispanic or puerto rican" = "latinx", "other-2 or more" = "multirace", "american indian or alaska native" = "nativeam", "not reported" = "not_reported", "white" = "white" ) ) ``` # Comparing labelled class to factor class ### Comparing `class==labelled` to `class==factor` | | `class==labelled` | `class==factor` |---|----------|-------------| | data type | numeric or character | integer | | name of value label attribute | labels | levels | | refer to data using | variable values | levels attribute | \bigskip So should you work with `class==labelled` or `class==factor`? - No right or wrong answer; this is a subjective decision - Personally, I prefer `labelled' class - Easier to control underlying variable value - Feels more suited to working with survey data variables, where there are usually several different values that represent different kinds of "missing" values ### Converting `class==labelled` to `class==factor` The `as_factor()` function from `haven` package converts variables with `class==labelled` to `class==factor` - Can be used for descriptive statistics ```{r, results="hide"} hsls %>% select(s3classes) %>% count(s3classes) hsls %>% select(s3classes) %>% count(s3classes) %>% as_factor() ``` - Can create object with some or all `labelled` vars converted to `factor` ```{r} hsls_f <- as_factor(hsls, only_labelled = TRUE) ``` Let's examine this object ```{r, results="hide"} glimpse(hsls_f) hsls_f %>% select(s3classes,s3clglvl) %>% str() typeof(hsls_f$s3classes) class(hsls_f$s3classes) attributes(hsls_f$s3classes) hsls_f %>% select(s3classes) %>% var_label() hsls_f %>% select(s3classes) %>% val_labels() ``` ### Working with `class==factor` data Showing factor levels associated with a factor variable ```{r} hsls_f %>% count(s3classes) ``` Showing variable values associated with a factor variable ```{r} hsls_f %>% count(as.integer(s3classes)) ``` ### Working with `class==factor` data When sub-setting observations (e.g., filtering), refer `level` attribute not variable value ```{r} hsls_f %>% filter(s3classes=="Yes") %>% count(s3classes) ``` # Appendix: Creating factor variables ### Create factors [from string variables] To create a factor variable from string variable: 1. Create a character vector containing underlying data 1. Create a vector containing valid levels 3. Attach levels to the data using the `factor()` function ```{r} # Underlying data: months my fam is born x1 <- c("Jan", "Aug", "Apr", "Mar") # Create vector with valid levels month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") # Attach levels to data x2 <- factor(x1, levels = month_levels) ``` Note how attributes differ: ```{r} str(x1) str(x2) ``` Sorting also differs: ```{r} sort(x1) sort(x2) ``` ### Create factors [from string variables] Let's create a character version of variable `hs_state` and then turn it into a factor: ```{r, eval=FALSE} #wwlist %>% # count(hs_state) # Subset obs to West Coast states wwlist_temp <- wwlist %>% filter(hs_state %in% c("CA", "OR", "WA")) # Create character version of high school state for West Coast states only wwlist_temp$hs_state_char <- as.character(wwlist_temp$hs_state) # Investigate character variable str(wwlist_temp$hs_state_char) table(wwlist_temp$hs_state_char) # Create new variable that assigns levels wwlist_temp$hs_state_fac <- factor(wwlist_temp$hs_state_char, levels = c("CA","OR","WA")) str(wwlist_temp$hs_state_fac) attributes(wwlist_temp$hs_state_fac) #wwlist_temp %>% # count(hs_state_fac) rm(wwlist_temp) ``` ### Create factors [from string variables] How the `levels` argument works when underlying data is character: - Matches value of underlying data to value of the level attribute - Converts underlying data to integer, with level attribute attached \medskip See [Chapter 15 of Wickham](https://r4ds.had.co.nz/factors.html) for more on factors (e.g., modifying factor order, modifying factor levels) ### Creating factors [from integer vectors] Factors are just integer vectors with level attributes attached to them. So, to create a factor: 1. Create a vector for the underlying data 1. Create a vector that has level attributes 3. Attach levels to the data using the `factor()` function ```{r} a1 <- c(1,1,1,0,1,1,0) # A vector of data a2 <- c("zero","one") # A vector of labels # Attach labels to values a3 <- factor(a1, labels = a2) a3 str(a3) ``` Note: By default, `factor()` function attached "zero" to the lowest value of vector `a1` because "zero" was the first element of vector `a2` ### Creating factors [from integer vectors] Let's turn an integer variable into a factor variable in the `wwlist` data frame Create integer version of `receive_year`: ```{r} #typeof(wwlist_temp$receive_year) wwlist$receive_year_int <- as.integer(wwlist$receive_year) str(wwlist$receive_year_int) typeof(wwlist$receive_year_int) ``` Assign levels to values of integer variable: ```{r, eval=FALSE} wwlist$receive_year_fac <- factor(wwlist$receive_year_int, labels=c("Twenty-sixteen","Twenty-seventeen","Twenty-eighteen")) str(wwlist$receive_year_fac) str(wwlist$receive_year) #Check variable wwlist %>% count(receive_year_fac) wwlist %>% count(receive_year) ```