--- title: "Lecture 6: Augmented vectors, factor + labelled class" subtitle: "" author: date: fontsize: 8pt classoption: dvipsnames # for colors urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: default #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` ```{r, echo=FALSE, include=FALSE} #THIS CODE DOWNLOADS THE MOST RECENT VERSION OF THE FILE beamder_header.tex AND SAVES IT TO THE DIRECTORY ONE LEVEL UP FROM THIS .RMD LECTURE FILE download.file(url = 'https://raw.githubusercontent.com/ozanj/rclass/master/lectures/beamer_header.tex', destfile = '../beamer_header.tex', mode = 'wb') ``` ```{r, echo=FALSE, include=FALSE} #DO NOT WORRY ABOUT THIS if(!file.exists('data-structures-overview.png')){ download.file(url = 'https://raw.githubusercontent.com/ozanj/rclass/master/lectures/lecture5/data-structures-overview.png', destfile = 'data-structures-overview.png', mode = 'wb') } ``` ```{r, echo=FALSE, include=FALSE, eval=FALSE} ``` # Introduction ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Libraries we will use today "Load" the package we will use today (output omitted) - __you must run this code chunk after installing these packages__ ```{r, message=FALSE} library(tidyverse) library(haven) library(labelled) library(lubridate) ``` __If package not yet installed__, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ### Data we will use to introduce augmented vectors ```{r} rm(list = ls()) # remove all objects #load("../../data/prospect_list/western_washington_college_board_list.RData") load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) # will this get rid of warnings "Unknown or uninitialised column: 'ethn_code_fac'"? wwlist <- wwlist %>% mutate(eth_code_fac = NULL) # remove variable ``` # Attributes and augmented vectors ## Review data types and structures ### Review data structures: __Vectors__ \medskip Two types of vectors: 1. __atomic vectors__ 1. __lists__ \medskip ![Overview of data structures (Grolemund and Wickham, 2018)](data-structures-overview.png){width=60%} ### Review data structures: atomic vectors \medskip An __atomic vector__ is a collection of values - each value in an atomic vector is an __element__ - all elements within vector must have same __data type__ ```{r} (a <- c(1,2,3)) # parentheses () assign and print object in one step length(a) # length = number of elements typeof(a) # numeric atomic vector, type=double str(a) # investigate structure of object ``` Can assign __names__ to vector elements, creating a __named atomic vector__ ```{r} (b <- c(v1=1,v2=2,v3=3)) length(b) typeof(b) str(b) ``` ### Review data structures: lists \medskip - Like atomic vectors, __lists__ are objects that contain __elements__ - However, __data type__ can differ across elements within a list - an element of a list can be another list ```{r} list_a <- list(1,2,"apple") typeof(list_a) length(list_a) str(list_a) list_b <- list(1, c("apple", "orange"), list(1, 2)) length(list_b) str(list_b) ``` ### Review data structures: lists Like atomic vectors, elements within a list can be named, thereby creating a __named list__ ```{r} # not named str(list_b) # named list_c <- list(v1=1, v2=c("apple", "orange"), v3=list(1, 2, 3)) str(list_c) ``` ### Review data structures: a data frame is a list A __data frame__ is a list with the following characteristics: - All the elements must be __vectors__ with the same __length__ - Data frames are __augmented lists__ because they have additional __attributes__ ```{r} #a regular list (list_d <- list(col_a = c(1,2,3), col_b = c(4,5,6), col_c = c(7,8,9))) typeof(list_d) attributes(list_d) #a data frame (df_a <- data.frame(col_a = c(1,2,3), col_b = c(4,5,6), col_c = c(7,8,9))) typeof(df_a) attributes(df_a) ``` ## Attributes and augmented vectors ### Atomic vectors versus augmented vectors __Atomic vectors__ [our focus so far] - I think of atomic vectors as "just the data" - Atomic vectors are the building blocks for augmented vectors \medskip __Augmented vectors__ - __Augmented vectors__ are atomic vectors with additional __attributes__ attached __Attributes__ - __Attributes__ are additional "metadata" that can be attached to any object (e.g., vector or list) Example: variables of a dataset - a data frame is a list - each element in the list is a variable, which consists of: - atomic vector ("just the data"); - variable __name__, which is an attribute we attach to the element/variable - any other attributes we want to attach to element/variable Other examples of attributes in R - __value labels__: character labels (e.g., "Charter School") attached to numeric values - __Object class__: Specifies how object treated by object oriented programming language __Main takaway__: - Augmented vectors are atomic vectors (just the data) with additional attributes attached ### Attributes and functions to identify/modify attributes Description of attributes from Grolemund and Wickham 20.6 - "Any vector can contain arbitrary additional __metadata__ through its __attributes__" - "You can think of __attributes__ as named list of vectors that can be attached to any object" Functions to identify and modify attributes - `attributes()` function to describe all attributes of an object - `attr()` to see individual attribute of an object or set/change an individual attribute of an object ### attributes() function describes all attributes of an object ```{r, eval=FALSE} ?attributes ``` An atomic vector ```{r} #vector with name attributes (vector1 <- c(a = 1, b= 2, c= 3, d = 4)) attributes(vector1) #remove all attributes from object attributes(vector1) <- NULL vector1 attributes(vector1) ``` ### attributes() function, attributes of a variable in a dataset \medskip Accessing variable using `[[]]` subset operator - recall `object_name[["element_name"]]` accesses contents of the element - If object is a data frame, `df_name[["var_name"]]` accesses contents of variable - for simple vars like `firstgen` syntax yields an atomic vector ("just the data") - shorthand syntax for `df_name[["var_name"]]` is `df_name$var_name` ```{r} str(wwlist[["firstgen"]]) attributes(wwlist[["firstgen"]]) str(wwlist$firstgen) # same same attributes(wwlist$firstgen) ``` Accessing variable using `[]` subset operator - `object_name["element_name"]` creates object of same type as `object_name` - contains attributes of `object_name`, atomic vector associated with `element_name`, and any attributes associated with `element_name` ```{r, results="hide"} str(wwlist["firstgen"]) attributes(wwlist["firstgen"]) ``` ### Attributes of lists and data frames \medskip ```{r} #attributes of a named list list2 <- list(col_a = c(1,2,3), col_b = c(4,5,6)) str(list2) attributes(list2) #attributes of a data frame list3 <- data.frame(col_a = c(1,2,3), col_b = c(4,5,6)) str(list3) attributes(list3) ``` ### attr() function: get or set specific attributes of an object ```{r, eval=FALSE, include=FALSE} ?attr ``` Syntax - Get: `attr(x, which, exact = FALSE)` - Set: `attr(x, which) <- value` Arguments - `x` an object whose attributes are to be accessed. - `which` a non-empty character string specifying which attribute is to be accessed - `exact` logical: should `which` be matched exactly? default is `exact = FALSE` - `value` an object, new value of attribute, or NULL to remove attribute. \medskip Using `attr()` to __get__ specific attribute of an object ```{r} (vector1 <- c(a = 1, b= 2, c= 3, d = 4)) attributes(vector1) attr(x=vector1, which = "names", exact = FALSE) attr(vector1, "names") attr(vector1, "name") # we don't provide exact name of attribute attr(vector1, "name", exact = TRUE) # don't provide exact name of attribute ``` ### attr() function: get or set specific attributes of an object ```{r, eval=FALSE, include=FALSE} ?attr ``` Syntax - Get: `attr(x, which, exact = FALSE)` - Set: `attr(x, which) <- value` Arguments - `x` an object whose attributes are to be accessed. - `which` a non-empty character string specifying which attribute is to be accessed - `exact` logical: should `which` be matched exactly? default is `exact = FALSE` - `value` an object, new value of attribute, or NULL to remove attribute. \medskip Using `attr()` to __set__ specific attribute of an object (output omitted) ```{r, results = "hide"} (vector1 <- c(a = 1, b= 2, c= 3, d = 4)) attributes(vector1) # see all attributes attr(x=vector1, which = "greeting") <- "Hi!" # create new attribute attr(x=vector1, which = "greeting") # see attribute attr(vector1, "farewell") <- "Bye!" # create attribute attr(x=vector1, which = "names") # see names attribute attr(x=vector1, which = "names") <- NULL # delete names attribute attributes(vector1) # see all attributes ``` ### Applying attr() to data frames \medskip Using `wwlist`, create data frame with three variables ```{r, results = "hide"} wwlist_small <- wwlist[1:25, ] %>% select(hs_state,firstgen,med_inc_zip) str(wwlist_small) attributes(wwlist_small) ``` Get/set attribute of a data frame ```{r, results = "hide"} #get/examine names attribute attr(x=wwlist_small, which = "names") str(attr(x=wwlist_small, which = "names")) # names attribute is character atomic vector, length=3 #add new attribute to data frame attr(x=wwlist_small, which = "new_attribute") <- "contents of new attribute" attributes(wwlist_small) ``` Get/set attribute of a variable in data frame ```{r, results = "hide"} str(wwlist_small$med_inc_zip) attributes(wwlist_small$med_inc_zip) #create attribute for variable med_inc_zip attr(wwlist_small$med_inc_zip, "inc attribute") <- "inc attribute contents" #investigate attribute for variable med_inc_zip attributes(wwlist_small$med_inc_zip) str(wwlist_small$med_inc_zip) attr(wwlist_small$med_inc_zip, "inc attribute") ``` ### Why add attributes to data frame or variables of data frame? Pedagogical reasons - Important to know how you can apply `attributes()` and `attr()` to data frames and to variables within data frames \medskip Example practical application: interactive dashboards - When creating "dashboard" you might want to add "tooltips" - "tooltip" is a message that appears when cursor positioned over an icon - The text in the tooltip message is the contents of an attribute - Example dashboard: [LINK](https://jkcf.shinyapps.io/dashboard/) ### Student exercises 1. Using "wwlist", creat data frame of 30 observations with three variables: "state", "zip5", "pop_total_zip". 2. Describe all attribute of the new data frame; Get the name attribute of the new data frame. 3. Add a new attribute to the data frame: name: "attribute_data", content: "new attribute of data"; then investigate the attribute and get the new name attribute of the data. 4. Get the attribute of the variable pop_total_zip. 5. Add a new attribute to the variable pop_total_zip: name: "attribute_variable", content: "new attribute of variable"; then investigate the attribute and get the new name attribute of the variable. ### Solution to student exercises ```{r, results = "hide"} wwlist_exercise <- wwlist[1:30, ] %>% select(state,zip5,pop_total_zip) attributes(wwlist_exercise) attr(x=wwlist_exercise, which = "names") attr(x=wwlist_exercise, which = "attribute_data") <- "new attribute of data" attributes(wwlist_exercise) attr(wwlist_exercise, which ="attribute_data") attributes(wwlist_exercise$pop_total_zip) attr(wwlist_exercise$pop_total_zip, "attribute_variable") <- "new attribute of variable" attributes(wwlist_exercise$pop_total_zip) attr(wwlist_exercise$pop_total_zip, "attribute_variable") ``` # Object class ### Object class \medskip Every object in R has a __class__ - class is an __attribute__ of an object - Object class controls how functions work; defines rules for how object can be treated by object oriented programming language - e.g., which functions you can apply to object of a particular class - e.g., what the function does to one object class, what it does to another object class Many ways to identify object class - Simplest is `class()` function ```{r} (vector2 <- c(a = 1, b= 2, c= 3, d = 4)) typeof(vector2) class(vector2) ``` When I encounter a new object I often investigate object by applying `typeof()`, `class()`, and `attributes()` functions ```{r} typeof(vector2) class(vector2) attributes(vector2) ``` ### Why is object class important? Functions care about object __class__, not object __type__ \medskip Specific functions usually work with only particular __classes__ of objects - e.g., "date"" functions usually only work on objects with a date class - "string" functions usually only work with on objects with a character class - Functions that do mathematical computation usually work on objects with a numeric class ### Functions care about object __class__, not object __type__ \medskip Example: `sum()` applies to __numeric__, __logical__, or __complex__ class objects ```{r, eval=FALSE, include=FALSE} ?sum ``` - Apply `sum()` to __logical__ and __numeric__ class ```{r} (x <- c(TRUE,FALSE,NA,TRUE)) # class = logical typeof(x) class(x) sum(x, na.rm = TRUE) # class = numeric typeof(wwlist$med_inc_zip) class(wwlist$med_inc_zip) wwlist$med_inc_zip[1:5] sum(wwlist$med_inc_zip[1:5], na.rm = TRUE) ``` - What happens when apply `sum()` to an object with class = __character__? ```{r, eval=FALSE} typeof(wwlist$hs_city) class(wwlist$hs_city) wwlist$hs_city[1:5] sum(wwlist$hs_city[1:5], na.rm = TRUE) ``` ### Functions care about object __class__, not object __type__ Date functions can be applied to objects with a date-time class - date-time objects have __type__ = numeric - date-time objects __class__ = date or date-time Example: `year()` function from `lubridate` package ```{r, include=FALSE, results = "hide"} library(lubridate) ``` ```{r, eval=FALSE, include=FALSE} ?year ``` - apply `year()` to object with __class__ = date ```{r} wwlist$receive_date[1:5] typeof(wwlist$receive_date) class(wwlist$receive_date) year(wwlist$receive_date[1:5]) ``` - apply `year()` to object with __class__ = numeric ```{r, eval=FALSE} typeof(wwlist$med_inc_zip) class(wwlist$med_inc_zip) year(wwlist$med_inc_zip[1:10]) ``` ### Functions care about object __class__, not object __type__ Most string functions are intended to apply to objects with a __character__ class. - __type__ = character - __class__ = character Example: `tolower()` function - syntax: `tolower(x)` - where argument `x` is "a character vector, or an object that can be coerced to character by `as.character()`" ```{r, eval=FALSE, include=FALSE} ?tolower ``` Apply `tolower()` to character class object ```{r} str(wwlist$hs_city) typeof(wwlist$hs_city) class(wwlist$hs_city) wwlist$hs_city[1:6] tolower(wwlist$hs_city[1:6]) ``` ### Class and object-oriented programming R is an object-oriented programming language \medskip Definition of object oriented programming from this [LINK](https://www.webopedia.com/TERM/O/object_oriented_programming_OOP.html) \medskip > "Object-oriented programming (OOP) refers to a type of computer programming in which programmers define not only the data type of a data structure, but also the types of operations (functions) that can be applied to the data structure." \medskip Object __class__ is fundamental to object oriented programming because: - object class determines which functions can be applied to the object - object class also determines what those functions do to the object - e.g., a specific function might do one thing to objects of __class__ A and another thing to objects of __class__ B - What a function does to objects of different class is determined by whoever wrote the function \medskip Many different object classes exist in R - You can also create our own classes - Example: the `labelled` class is an object class created by Hadley Wickham when he created the `haven` package - In this course we will work with classes that have been created by others # Class == factor ### Recoding variable ethn_code from data frame wwlist Want to recode variable `ethn_code` so that it has fewer categories ```{r, results= "hide"} wwlist <- wwlist %>% mutate(ethn_code = recode(ethn_code, "american indian or alaska native" = "nativeam", "asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban" = "latinx", "mexican/mexican american" = "latinx", "not reported" = "not_reported", "other-2 or more" = "multirace", "other spanish/hispanic" = "latinx", "puerto rican" = "latinx", "white" = "white", ) ) str(wwlist$ethn_code) wwlist %>% count(ethn_code) ``` ### Factors __Factors__ are an object _class_ used to display categorical data (e.g., marital status) - A factor is an __augmented vector__ built by attaching a "levels" attribute to an (atomic) integer vectors Usually, we would prefer a categorical variable (e.g., race, school type) to be a factor variable rather than a character variable - So far in the course I have made all categorical variables character variables because we had not introduced factors yet Create factor version of character variable `ethn_code` using base R `factor()` function ```{r} str(wwlist$ethn_code) class(wwlist$ethn_code) # create factor var; tidyverse approach wwlist <- wwlist %>% mutate(ethn_code_fac = factor(ethn_code)) #wwlist$ethn_code_fac <- factor(wwlist$ethn_code) # base r approach str(wwlist$ethn_code) str(wwlist$ethn_code_fac) ``` ### Factors A factor is an __augmented vector__ built by attaching a "levels" attribute to an (atomic) integer vector Compare (character) `ethn_code` to (factor) `ethn_code_fac` (output omitted) - Character variable `ethn_code` ```{r, results="hide"} typeof(wwlist$ethn_code) class(wwlist$ethn_code) attributes(wwlist$ethn_code) str(wwlist$ethn_code) ``` - Factor variable `ethn_code_fac` ```{r, results="hide"} typeof(wwlist$ethn_code_fac) class(wwlist$ethn_code_fac) attributes(wwlist$ethn_code_fac) str(wwlist$ethn_code_fac) ``` Main things to note about variable `ethn_code_fac` - has `type=integer` - has `class=factor` because the variable has a "levels" attribute - Underlying data are integers but levels attribute used to display the data. ```{r} wwlist$ethn_code_fac[1:5] # print first few obs of ethn_code_fac ``` ### Working with factor variables When displaying data, factor variables display values of __level attribute__ rather than underlying integer values of variable ```{r} wwlist %>% count(ethn_code_fac) ``` ### Working with factor variables Apply `as.integer()` function to to display underlying integer values of factor variable ```{r, eval=FALSE, include=FALSE} ?as.integer ``` Investigate `as.integer()` function ```{r} typeof(wwlist$ethn_code_fac) class(wwlist$ethn_code_fac) typeof(as.integer(wwlist$ethn_code_fac)) class(as.integer(wwlist$ethn_code_fac)) ``` Display underling integer values of variable `ethn_code_fac` ```{r} wwlist %>% count(as.integer(ethn_code_fac)) ``` ### Working with factor variables \medskip Refer to categories of a factor (e.g., when filtering obs) using values of __level attribute__ rather than underlying values of variable - values of __level attribute__ for `ethn_code_fac` (output omitted) ```{r, results='hide'} attributes(wwlist$ethn_code_fac) ``` \medskip __Task__: count the number of prospects in `wwlist` who identify as "white" ```{r} # referring to variable value; this doesn't work wwlist %>% filter(ethn_code_fac==7) %>% count() #referring to value of level attribute; this works wwlist %>% filter(ethn_code_fac=="white") %>% count() ``` ### Working with factor variables __Task__: count the number of prospects in `wwlist` who identify as "white" - To refer to underlying integer values, apply `as.integer()` function to factor variable ```{r} attributes(wwlist$ethn_code_fac) wwlist %>% filter(as.integer(ethn_code_fac)==7) %>% count ``` ### How to identify the variable values associated with factor levels Create a factor version of the character variable `psat_range` ```{r, results="hide", warning = FALSE} wwlist %>% count(psat_range) wwlist <- wwlist %>% mutate(psat_range_fac = factor(psat_range)) wwlist %>% count(psat_range_fac) attributes(wwlist$psat_range_fac) ``` Investigate values associated with factor levels using `levels()` and `nlevels()` ```{r, results="hide"} levels(wwlist$psat_range_fac) #starts at 1 nlevels(wwlist$psat_range_fac) #7 levels total levels(wwlist$psat_range_fac)[1:3] #prints levels 1-3 ``` Once values associated with factor levels known: - Can filter based on underling integer values using `as.integer()` ```{r} wwlist %>% filter(as.integer(psat_range_fac)==4) %>% count() ``` - Or filter based on value of __factor levels__ ```{r} wwlist %>% filter(psat_range=="1270-1520") %>% count() ``` ### Creating factor variables from character variables or from integer variables See Appendix ### Factor student exercise 1. After running the code below, use `typeof`, `class`, `str`, and `attributes` functions to check the new variable `receive_year` 2. Create a factor variable from the input variable `receive_year` and name it `receive_year_fac` 3. Run the same functions (`typeof`, `class`, etc.) from the first question using the new variable you created 4. Get a count of `receive_year_fac`. __hint:__ you could also run this in the console to see values associated with each factor Run this code to create a year variable from the input variable "receive_date" ```{r, results="hide", message=FALSE} #wwlist %>% glimpse() library(lubridate) #load library if you haven't already wwlist <- wwlist %>% mutate(receive_year = year(receive_date)) #creating year variable with the lubridate package #Check variable wwlist %>% count(receive_year) wwlist %>% group_by(receive_year) %>% count(receive_date) ``` ### Factor student exercise solutions 1. Use `typeof`, `class`, `str`, and `attributes` functions to check the new variable `receive_year` ```{r} typeof(wwlist$receive_year) class(wwlist$receive_year) str(wwlist$receive_year) attributes(wwlist$receive_year) ``` ### Factor student exercise solutions 2. Now create a factor variable from the input variable `receive_year` and name it `receive_year_fac` ```{r} # create factor var; tidyverse approach wwlist <- wwlist %>% mutate(receive_year_fac = factor(receive_year)) ``` ### Factor student exercise solutions 3. Run the same functions (`typeof`, `class`, etc.) from the first question using the new variable you created ```{r} typeof(wwlist$receive_year_fac) class(wwlist$receive_year_fac) str(wwlist$receive_year_fac) attributes(wwlist$receive_year_fac) ``` ### Factor student exercise solutions 4. Get a count of `receive_year_fac`. __hint:__ you could also run this in the console to see values associated with each factor ```{r} wwlist %>% count(receive_year_fac) ``` # Class == labelled ### Data we will use to introduce `labelled` class High school longitudinal surveys from National Center for Education Statistics (NCES) - Follow U.S. students from high school through college, labor market \medskip We will be working with [High School Longitudinal Study of 2009 (HSLS:09)](https://nces.ed.gov/surveys/hsls09/index.asp) - Follows 9th graders from 2009 - Data collection waves - Base Year (2009) - First Follow-up (2012) - 2013 Update (2013) - High School Transcripts (2013-2014) - Second Follow-up (2016) ### Using `haven` package to read SAS/SPSS/Stata datasets into R [`haven`](https://haven.tidyverse.org/), which is part of __tidyverse__, "enables R to read and write various data formats" from the following statistical packages: - SAS - SPSS - Stata \medskip When using `haven` to read data, resulting R objects have these characteristics: - Data frames are __tibbles__, Tidyverse's preferred __class__ of data frames - Transform variables with "value labels" into the `labelled()` class - `labelled` is an object class, just like `factor` is an object class - `labelled` is an object __class__ created by folks who created `haven` package - `labelled` and `factor` classes are both viable alternatives for categorical variables - Helpful description of `labelled` class [HERE](https://haven.tidyverse.org/articles/semantics.html) - Dates and times converted to R date/time classes - Character vectors not converted to factors ### Using `haven` package to read SAS/SPSS/Stata datasets into R Use `read_dta()` function from `haven` package to import Stata dataset into R ```{r, results="hide"} hsls <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/hsls/hsls_stu_small.dta") ``` __Must__ run this code chunk; permanently changes uppercase variable names to lowercase ```{r, results="hide"} names(hsls) names(hsls) <- tolower(names(hsls)) # convert names to lowercase names(hsls) # names now lowercase str(hsls) # ugh ``` Investigate variable `s3classes` from data frame `hsls` - identifies whether respondent taking postsecondary classes as of 11/1/2013 ```{r, results="hide"} typeof(hsls$s3classes) class(hsls$s3classes) str(hsls$s3classes) ``` Investigate attributes of `s3classes` ```{r, results="hide"} attributes(hsls$s3classes) # all attributes #specific attributes: using syntax: attr(x, which, exact = FALSE) attr(x=hsls$s3classes, which = "label") # label attribute attr(x=hsls$s3classes, which = "labels") # labels attribute ``` ### What is object class = `labelled`? \medskip __value labels__ [in Stata] are labels attached to specific values of a variable, e.g.: - var value `1` attached to value label "married", `2`="single", `3`="divorced" \medskip `labelled` is object class for importing vars with __value labels__ from SAS/SPSS/Stata - `labelled` object class created by `haven` package - Characteristics of variables in R data frame with `class==labelled`: - data `type` can be numeric(double) or character - To see `value labels` associated with each value: - `attr(df_name$var_name,"labels")` - e.g., `attr(hsls$s3classes,"labels")` Investigate the attributes of `hsls$s3classes` ```{r, results="hide"} typeof(hsls$s3classes) class(hsls$s3classes) str(hsls$s3classes) attributes(hsls$s3classes) ``` use `attr(object_name,"attribute_name")` to refer to each attribute ```{r, results="hide"} attr(hsls$s3classes,"label") attr(hsls$s3classes,"format.stata") attr(hsls$s3classes,"class") attr(hsls$s3classes,"labels") ``` ### `labelled` package Purpose of the `labelled` package is to work with data imported from SPSS/Stata/SAS using the `haven` package. - `labelled` package contains functions to work with objects that have `labelled` class - From package documentation: - "purpose of the `labelled` package is to provide functions to manipulate _metadata_ as variable labels, value labels and defined missing values using the `labelled` class and the `label` attribute introduced in `haven` package. - More info on the `labelled` package: [LINK](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html) Functions in `labelled` package - [Full list](https://www.rdocumentation.org/packages/labelled/versions/1.1.0) ```{r, eval=FALSE, include=FALSE} ?labelled ``` - A couple relevant functions - `val_labels`: get or set variable _value labels_ - `var_label`: get or set a _variable label_ ```{r, results="hide"} attributes(hsls$s3classes) hsls %>% select(s3classes) %>% var_label() hsls %>% select(s3classes) %>% val_labels() ``` ### Working with `labelled` class data \medskip Variable `type` and `class` ```{r} typeof(hsls$s3classes) class(hsls$s3classes) ``` \medskip Show variable labels using `var_label()` function ```{r} hsls %>% select(s3classes,s3clglvl) %>% var_label ``` \medskip Show value labels associated with variable values using `val_labels()` (output omitted) ```{r, results="hide"} hsls %>% select(s3classes,s3clglvl) %>% val_labels ``` ### Working with `labelled` class data \medskip Create frequency tables with `labelled` class variables using `count()` - Default setting is to show variable __values__ not __value labels__ ```{r} hsls %>% count(s3classes) ``` \medskip To make frequency table show __value labels__ add `%>% as_factor()` to pipe - `as_factor()` is function from `haven` that converts an object to a factor ```{r} hsls %>% count(s3classes) %>% as_factor() ``` ### Working with `labelled` class data To isolate values of `labelled` class variables in `filter()` function: - refer to variable __value__, not the __value label__ __Task__ - how many observations in var `s3classes` associated with "Unit non-response" - how many observations in var `s3classes` associated with "Yes" General steps to follow: 1. investigate object 1. use filter to isolate desired observations Investigate object ```{r, results="hide"} class(hsls$s3classes) hsls %>% select(s3classes) %>% var_label #show variable label hsls %>% select(s3classes) %>% val_labels #show value label hsls %>% count(s3classes) # freq table, values hsls %>% count(s3classes) %>% as_factor() # freq table, value labels ``` filter specific values ```{r, results="hide"} hsls %>% filter(s3classes==-8) %>% count() # -8 = unit non-response hsls %>% filter(s3classes==1) %>% count() # 1 = yes ``` ## Set variable and value labels ### Functions to set variable labels and value labels `set_variable_labels()` function from the `labelled` package sets variable labels ```{r, eval=FALSE} ?set_variable_labels ``` \bigskip `set_value_labels()` function from `labelled` package sets value labels ```{r, eval=FALSE} ?set_value_labels ``` ### Set variable and value labels Let's create a tibble first ```{r} df <- tribble( ~id, ~edu, ~sch, #--|--|---- 1, 2, 2, 2, 1, 1, 3, 3, 2, 4, 4, 2, 5, 1, 2 ) df str(df) ``` ### Set variable labels Use `set_variable_labels` function to manually set variable labels - syntax: `set_variable_labels(variable = "Variable label")` ```{r} str(df) class(df$sch) #set variabel labels df <- df %>% set_variable_labels( id = "Unique identification number", edu = "Education level", sch = "Type of school attending" ) str(df) class(df$sch) ``` ### Set value labels Use `set_value_labels` function to manually set value labels - syntax: `set_value_labels(var_name = c("val label" = 1, "val label" = 2))` ```{r} class(df$sch) df <- df %>% set_value_labels( edu = c("High School" = 1, "AA degree" = 2, "BA degree" = 3, "MA or higher" = 4), sch = c("Private" = 1, "Public" = 2)) attributes(df$sch) ``` ### Set value labels Now we can analyze data using tools we already introduced \medskip Create frequency tables with `labelled` class variables using `count()` - Default setting is to show variable __values__ not __value labels__ ```{r} df %>% count(edu) df %>% select(edu) %>% val_labels ``` \medskip To make frequency table show __value labels__ add `%>% as_factor()` to pipe - `as_factor()` is function from `haven` that converts an object to a factor ```{r} df %>% count(edu) %>% as_factor() ``` ### Examples: Set Variable labels ```{r} # see variable labels wwlist %>% select(pop_total_zip, pop_total_state) %>% var_label # set variable labels wwlist <- wwlist %>% set_variable_labels( pop_total_zip = "total population in zip", pop_total_state ="total population in state" ) # attribute of variable attributes(wwlist$pop_total_zip) attributes(wwlist$pop_total_state) ``` ### Examples: Set value labels ```{r} # see value labels str(wwlist$sex) wwlist %>% select(sex) %>% val_labels # set value labels to sex varaible wwlist <- wwlist %>% set_value_labels( sex = c("Female" = "F", "Male" = "M" ) ) # attribute of sex variable attributes(wwlist$sex) ``` ### Labelled student exercise 1. Get variable and value labels of `s3hs` 2. Get a count of the variable showing the values and the value labels. __hint__ use factor() 3. Filter if value is associated with "Missing" 4. Filter if value is associated with "Missing" or "Unit non-response" 5. Add variable label of `pop_asian_zip` &`pop_asian_state` in data frame "wwlist" 6. Add value label of `ethn_code` in data frame "wwlist" ### Labelled student exercise solutions 1. Get variable and value labels of `s3hs` ```{r} hsls %>% select(s3hs) %>% var_label() hsls %>% select(s3hs) %>% val_labels() ``` ### Labelled student exercise solutions 2. Get a count of the variable `s3hs` showing the value labels. __hint__ use factor() ```{r} hsls %>% count(s3hs) hsls %>% count(s3hs) %>% as_factor() ``` ### Labelled student exercise solutions 3. Filter if value is associated with "Missing" ```{r} hsls %>% filter(s3hs== -9) %>% count() ``` ### Labelled student exercise solutions 4. Filter if value is associated with "Missing" or "Unit non-response" ```{r} hsls %>% filter(s3hs== -9 | s3hs== -8) %>% count() ``` ### Labelled student exercise solutions 5. Add variable lable of `pop_asian_zip` &`pop_asian_state` in data frame "wwlist" ```{r} # variable labels wwlist %>% select(pop_asian_zip, pop_asian_state) %>% var_label # set variable labels wwlist <- wwlist %>% set_variable_labels( pop_asian_zip = "total asian population in zip", pop_asian_state ="total asian population in state" ) # attribute of variable attributes(wwlist$pop_asian_zip) attributes(wwlist$pop_asian_state) ``` ### Labelled student exercise solutions 6. Add value lable of `ethn_code` in data frame "wwlist" ```{r, results="hide"} # count wwlist %>% count(ethn_code) # value labels wwlist %>% select(ethn_code) %>% val_labels # set value labels to ethn_code varaible wwlist <- wwlist %>% set_value_labels( ethn_code = c("asian or native hawaiian or other pacific islander" = "api", "black or african american" = "black", "cuban or mexican/mexican american or other spanish/hispanic or puerto rican" = "latinx", "other-2 or more" = "multirace", "american indian or alaska native" = "nativeam", "not reported" = "not_reported", "white" = "white" ) ) ``` # Comparing labelled class to factor class ### Comparing `class==labelled` to `class==factor` | | `class==labelled` | `class==factor` |---|----------|-------------| | data type | numeric or character | integer | | name of value label attribute | labels | levels | | refer to data using | variable values | levels attribute | \bigskip So should you work with `class==labelled` or `class==factor`? - no right or wrong; this is a subjective decision - personally, I prefer `labelled' class - easier to control underlying variable value - feels more suited to working with survey data variables, in which several different values that represent different kinds of "missing" ### Converting `class==labelled` to `class==factor` The `as_factor()` function from `haven` package converts variables with `class==labelled` to `class==factor` - Can be used for descriptive statistics ```{r, results="hide"} hsls %>% select(s3classes) %>% count(s3classes) hsls %>% select(s3classes) %>% count(s3classes) %>% as_factor() ``` - Can create object with some or all `labelled` vars converted to `factor` ```{r} hsls_f <- as_factor(hsls,only_labelled = TRUE) ``` Let's examine this object ```{r, results="hide"} glimpse(hsls_f) hsls_f %>% select(s3classes,s3clglvl) %>% str() typeof(hsls_f$s3classes) class(hsls_f$s3classes) attributes(hsls_f$s3classes) hsls_f %>% select(s3classes) %>% var_label() hsls_f %>% select(s3classes) %>% val_labels() ``` ### Working with `class==factor` data Showing factor levels associated with a factor variable ```{r} hsls_f %>% count(s3classes) ``` Showing variable values associated with a factor variable ```{r} hsls_f %>% count(as.integer(s3classes)) ``` ### Working with `class==factor` data When sub-setting observations (e.g., filtering), refer `level` attribute not variable value ```{r} hsls_f %>% filter(s3classes=="Yes") %>% count(s3classes) ``` ### Converting `class==factor` to `class==labelled` I haven't figured out how to do this yet!!! ```{r, eval=FALSE} glimpse(hsls) glimpse(hsls_f) hsls_f_l <- to_labelled(x = hsls_f) to_labelled(to_factor(hsls)) ``` ```{r, eval=FALSE} glimpse(hsls_f_l) ``` # Appendix. Creating factor variables ### Create factors [from string variables] To create a factor variable from string variable 1. create a character vector containing underlying data 1. create a vector containing valid levels 3. Attach levels to the data using the `factor()` function ```{r} #underlying data: months my fam is born x1 <- c("Jan", "Aug", "Apr", "Mar") #create vector with valid levels month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") #attach levels to data x2 <- factor(x1, levels = month_levels) ``` Note how attributes differ ```{r} str(x1) str(x2) ``` Sorting differs ```{r} sort(x1) sort(x2) ``` ### Create factors [from string variables] Let's create a character version of variable `hs_state` and then turn it into a factor ```{r, eval=FALSE} #wwlist %>% # count(hs_state) #Subset obs to West Coast states wwlist_temp <- wwlist %>% filter(hs_state %in% c("CA", "OR", "WA")) #Create character version of high school state for West Coast states only wwlist_temp$hs_state_char <- as.character(wwlist_temp$hs_state) #investigate character variable str(wwlist_temp$hs_state_char) table(wwlist_temp$hs_state_char) #create new variable that assigns levels wwlist_temp$hs_state_fac <- factor(wwlist_temp$hs_state_char, levels = c("CA","OR","WA")) str(wwlist_temp$hs_state_fac) attributes(wwlist_temp$hs_state_fac) #wwlist_temp %>% # count(hs_state_fac) rm(wwlist_temp) ``` ### Create factors [from string variables] How the `levels` argument works when underlying data is character - Matches value of underlying data to value of the level attribute - Converts underlying data to integer, with level attribute attached \medskip See chapter 15 of Wickham for more on factors (e.g., modifying factor order, modifying factor levels) ### Creating factors [from integer vectors] Factors are just integer vectors with level attributes attached to them. So, to create a factor: 1. create a vector for the underlying data 1. create a vector that has level attributes 3. Attach levels to the data using the `factor()` function ```{r} a1 <- c(1,1,1,0,1,1,0) #a vector of data a2 <- c("zero","one") #a vector of labels #attach labels to values a3 <- factor(a1, labels = a2) a3 str(a3) ``` Note: By default, `factor()` function attached "zero" to the lowest value of vector `a1` because "zero" was the first element of vector `a2` ### Creating factors [from integer vectors] Let's turn an integer variable into a factor variable in the `wwlist` data frame Create integer version of `receive_year` ```{r} #typeof(wwlist_temp$receive_year) wwlist$receive_year_int <- as.integer(wwlist$receive_year) str(wwlist$receive_year_int) typeof(wwlist$receive_year_int) ``` Assign levels to values of integer variable ```{r, eval=FALSE} wwlist$receive_year_fac <- factor(wwlist$receive_year_int, labels=c("Twenty-sixteen","Twenty-seventeen","Twenty-eighteen")) str(wwlist$receive_year_fac) str(wwlist$receive_year) #Check variable wwlist %>% count(receive_year_fac) wwlist %>% count(receive_year) ```