--- title: "Lecture 7: Guidelines for Investigating, cleaning, and creating variables" # potentially push to header subtitle: "Managing and Manipulating Data Using R" author: date: classoption: dvipsnames # for colors fontsize: 8pt urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: tango # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` # Introduction ### What we will do today \tableofcontents ### Libraries "Load" the package we will use today (output omitted) - __you must run this code chunk after installing these packages__ ```{r, message=FALSE} library(tidyverse) library(haven) library(labelled) ``` __If package not yet installed__, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("tidyverse")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ### Data Use `read_dta()` function from `haven` to import Stata dataset into R ```{r, results="hide"} hsls <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/hsls/hsls_stu_small.dta") ``` Let's examine the data [you __must__ run this code chunk] ```{r, results="hide"} hsls %>% names() hsls %>% names() %>% str() hsls %>% names() %>% tolower() %>% str() names(hsls) <- tolower(names(hsls)) # convert names to lowercase names(hsls) str(hsls) # ugh str(hsls$s3classes) attributes(hsls$s3classes) typeof(hsls$s3classes) class(hsls$s3classes) ``` Download the HSLS Codebook: https://nces.ed.gov/pubs2014/2014361_AppendixI.pdf # Exploratory data analysis (EDA) ### What is exploratory data analysis (EDA)? The [Towards Data Science](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15) website has a nice definition of EDA: > "Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics" __This course focuses on "data management":__ - investigating and cleaning data for the purpose of creating analysis variables - Basically, everything that happens __before__ you conduct analyses __I think about "exploratory data analysis for data quality"__ - Investigating values and patterns of variables from "input data" - Identifying and cleaning errors or values that need to be changed - Creating analysis variables - Checking values of analysis variables agains values of input variables ### How we will teach exploratory data analysis Will teach exploratory data analysis (EDA) in two sub-sections: 1. Introduce "Tools of EDA": - Demonstrate code to investigate variables and relatioship between variables - Most of these tools are just the application of programming skills you have already learned 1. Provide "Guidelines for EDA" - Less about coding, more about practices you should follow and mentality necessary to ensure high data quality ## Tools for EDA ### Tools of EDA __To do EDA for data quality, must master the following tools:__ - \medskip __Select, sort, filter, and print__ in order to see data patterns, anomolies - Select and sort particular values of particular variables - Print particular values of particular variables - __One-way descriptive analyses__ (i.e,. focus on one variable) - Descriptive analyses for continuous variables - Descriptive analyses for discreet/categorical variables - __Two-way descriptive analyses__ (relationship between two variables) - Categorical by categorical - Categorical by continuous - Continuous by continuous Whenever using any of these tools, __pay close attention to missing values and how they are coded__ - Often, the "input" variables don't code missing values as `NA` - Especially when working with survey data, missing values coded as a negative number (e.g., `-9`,`-8`,`-4`) with different negative values representing different reasons for data being missing - sometimes missing values coded as very high positive numbers - Therefore, important to investigate input vars prior to creating analysis vars ### Tools of EDA First, Let's create a smaller version of the HSLS:09 dataset ```{r} #hsls %>% var_label() hsls_small <- hsls %>% select(stu_id,x3univ1,x3sqstat,x4univ1,x4sqstat,s3classes, s3work,s3focus,s3clgft,s3workft,s3clgid,s3clgcntrl, s3clglvl,s3clgsel,s3clgstate,s3proglevel,x4evrappclg, x4evratndclg,x4atndclg16fb,x4ps1sector,x4ps1level, x4ps1ctrl,x4ps1select,x4refsector,x4reflevel,x4refctrl, x4refselect, x2sex,x2race,x2paredu,x2txmtscor,x4x2ses,x4x2sesq5) ``` ```{r, results="hide"} names(hsls_small) hsls_small %>% var_label() ``` ### Tools of EDA: select, sort, filter, and print We've already know `select()`, `arrange()`, `filter()` \medskip Select, sort, and print specific vars ```{r, results="hide"} #sort and print hsls_small %>% arrange(desc(stu_id)) %>% select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl) #investigate variable attributes hsls_small %>% arrange(desc(stu_id)) %>% select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl) %>% str() #print observations with value labels rather than variable values hsls_small %>% arrange(desc(stu_id)) %>% select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl) %>% as_factor() ``` Sometimes helpful to increase the number of observations printed ```{r, results="hide"} class(hsls_small) #it's a tibble, which is the "tidyverse" version of a data frame options(tibble.print_min=50) # execute this in console hsls_small %>% arrange(desc(stu_id)) %>% select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl) options(tibble.print_min=10) # set default printing back to 10 lines ``` ### One-way descriptive stats for continuous vars, Base R approach [SKIP] ```{r, results="hide"} mean(hsls_small$x2txmtscor) sd(hsls_small$x2txmtscor) #Careful: summary stats include value of -8! min(hsls_small$x2txmtscor) max(hsls_small$x2txmtscor) ``` Be careful with `NA` values ```{r, results="hide"} #Create variable replacing -8 with NA hsls_small_temp <- hsls_small %>% mutate(x2txmtscorv2=ifelse(x2txmtscor==-8,NA,x2txmtscor)) hsls_small_temp %>% filter(is.na(x2txmtscorv2)) %>% count(x2txmtscorv2) mean(hsls_small_temp$x2txmtscorv2) mean(hsls_small_temp$x2txmtscorv2, na.rm=TRUE) rm(hsls_small_temp) ``` ### One-way descriptive stats for continuous vars, Tidyverse approach Use `summarise_at()`, a variation of `summarise()`, to make descriptive stats - `.args=list(na.rm=TRUE)`= a named list of additional arguments to be added to all function calls __Task__: - calculate descriptive stats for `x2txmtscor`, math test score ```{r, warning=FALSE} #?summarise_at hsls_small %>% select(x2txmtscor) %>% var_label() hsls_small %>% summarise_at( .vars = vars(x2txmtscor), .funs = funs(mean, sd, min, max, .args=list(na.rm=TRUE)) ) ``` ### One-way descriptive stats for continuous vars, Tidyverse approach Can calculate descriptive stats for more than one variable at a time __Task__: - calculate descriptive stats for `x2txmtscor`, math test score, and `x4x2ses`, socioeconomic index score ```{r} hsls_small %>% select(x2txmtscor,x4x2ses) %>% var_label() hsls_small %>% summarise_at( .vars = vars(x2txmtscor,x4x2ses), .funs = funs(mean, sd, min, max, .args=list(na.rm=TRUE)) ) ``` ### One-way descriptive stats for continuous vars, Tidyverse approach "Input vars" in survey data often have negative values for missing/skips ```{r, results="hide"} hsls_small %>% filter(x2txmtscor<0) %>% count(x2txmtscor) ``` R includes those negative values when calculating stats; you don't want this - Solution: create version of variable that replaces negative values with `NA` ```{r} hsls_small %>% mutate(x2txmtscor_na=ifelse(x2txmtscor<0,NA,x2txmtscor)) %>% summarise_at( .vars = vars(x2txmtscor_na), .funs = funs(mean, sd, min, max, .args=list(na.rm=TRUE)) ) ``` What if you didn't include `.args=list(na.rm=TRUE)`? ```{r} hsls_small %>% mutate(x2txmtscor_na=ifelse(x2txmtscor<0,NA,x2txmtscor)) %>% summarise_at( .vars = vars(x2txmtscor_na), .funs = funs(mean, sd, min, max)) ``` ### One-way descriptive stats for continuous vars, Tidyverse approach How to identify these missing/skip values if you don't have a codebook? - `count()` combined with `filter()` helpful for finding extreme values of continuous vars, which are often associated with missing or skip ```{r} #variable x2txmtscor hsls_small %>% filter(x2txmtscor<0) %>% count(x2txmtscor) #variable s3clglvl hsls_small %>% select(s3clglvl) %>% var_label() hsls_small %>% filter(s3clglvl<0) %>% count(s3clglvl) ``` ### One-way descriptive stats student exercise 1. Using the object `hsls`, identify variable type, variable class, and check the variable values and value labels of `x4ps1start` - variable `x4ps1start` identifies month and year student first started postsecondary education - **Note**: This variable is a bit counterintuitive. - e.g., the value `201105` refers to May 2011 2. Get a frequency count of the variable `x4ps1start` 3. Get a frequency count of the variable, but this time only observations that have negative values __hint__: use filter() 4. Create a new version of the variable `x4ps1start_na` that replaces negative values with NAs and use `summarise_at()` to get the min and max value. ### One-way descriptive stats student exercise solutions \medskip 1. Using the object `hsls`, identify variable type, variable class, and check the variable vakyes and value labels of `x4ps1start` ```{r} typeof(hsls$x4ps1start) class(hsls$x4ps1start) hsls %>% select(x4ps1start) %>% var_label() hsls %>% select(x4ps1start) %>% val_labels() ``` ### One-way descriptive stats student exercise solutions 2. Get a frequency count of the variable `x4ps1start` ```{r} hsls %>% count(x4ps1start) ``` ### One-way descriptive stats student exercise solutions 3. Get a frequency count of the variable, but this time only observations that have negative values __hint__: use filter() ```{r} hsls %>% filter(x4ps1start<0) %>% count(x4ps1start) ``` ### One-way descriptive stats student exercise solutions 4. Create a new version `x4ps1start_na` of the variable `x4ps1start` that replaces negative values with NAs and use `summarise_at()` to get the min and max value. ```{r} hsls %>% mutate(x4ps1start_na=ifelse(x4ps1start<0,NA,x4ps1start)) %>% summarise_at( .vars = vars(x4ps1start_na), .funs = funs(min, max, .args=list(na.rm=TRUE)) ) ``` ### One-way descriptive stats for discrete/categorical vars, Tidyverse approach Use `count()` to investigate values of discrete or categorical variables For variables where `class==labelled` ```{r, results="hide"} class(hsls_small$s3classes) attributes(hsls_small$s3classes) #show counts of variable values hsls_small %>% count(s3classes) #print in console to show both #show counts of value labels hsls_small %>% count(s3classes) %>% as_factor() ``` - I like `count()` because the default setting is to show `NA` values too! ```{r, results="hide"} hsls_small %>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>% count(s3classes_na) ``` Simultaneously show both values and value labels on count tables for `class==labelled` if entered into console - This requires some concepts/functions we haven't introduced [SKIP] ```{r, eval=FALSE} hsls_small %>% count(s3classes) y <- hsls_small %>% count(s3classes) %>% as_factor() bind_cols(x[,1], y) #wont show in updated R ``` ### Relationship between variables, categorical by categorical Two-way frequency table, called "cross tabulation", important for data quality - When you create categorical analysis var from single categorical "input" var - Two-way tables show us whether we did this correctly - Two-way tables helpful for understanding skip patterns in surveys __key to syntax__ - `df_name %>% group_by(var1) %>% count(var2)` **OR** - `df_name %>% count(var1,var2)` - play around with which variable is `var1` and which variable is `var2` ### Relationship between variables, categorical by categorical __Task__: Create a two-way table between `s3classes` and `s3clglvl` - Investigate variables ```{r, results="hide"} hsls_small %>% select(s3classes,s3clglvl) %>% var_label() hsls_small %>% select(s3classes,s3clglvl) %>% val_labels() ``` - Create two-way table ```{r, results = "hide"} hsls_small %>% group_by(s3classes) %>% count(s3clglvl) # show values hsls_small %>% count(s3classes,s3clglvl) #hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% as_factor() # show value labels ``` - Are these objects the same? ```{r} hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% glimpse() hsls_small %>% count(s3classes,s3clglvl) %>% glimpse() ``` ### Relationship between variables, categorical by categorical Two-way frequency table, also called "cross tabulation" __Task__: - Create a version of `s3classes` called `s3classes_na` that changes negative values to `NA` - Create a two-way table between `s3classes_na` and `s3clglvl` ```{r, results="hide"} hsls_small %>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>% group_by(s3classes_na) %>% count(s3clglvl) hsls_small %>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>% count(s3classes_na, s3clglvl) #example where we create some NA obs in the second variable hsls_small %>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes), s3clglvl_na=ifelse(s3clglvl==-7,NA,s3clglvl)) %>% group_by(s3classes_na) %>% count(s3clglvl_na) hsls_small %>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes), s3clglvl_na=ifelse(s3clglvl==-7,NA,s3clglvl)) %>% count(s3classes_na, s3clglvl_na) ``` ### Relationship between variables, categorical by categorical [SKIP] Tables above are pretty ugly Use the `spread()` function from `tidyr` package to create table with one variable as columns and the other variable as rows - The variable you place in `spread()` will be columns - We learn `spread()` function next week ```{r, results="hide"} hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% spread(s3classes, n) hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% as_factor() %>% spread(s3classes, n) hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% as_factor() %>% spread(s3clglvl, n) ``` ### Relationship between variables, categorical by continuous Investigating relationship between multiple variables is a little tougher when at least one of the variables is continuous __Conditional mean__ (like regression with continuous Y and one categorical X): - Shows average values of continous variables within groups - Groups are defined by your categorical variable(s) __key to syntax__ - `group_by(categorical_var) %>% summarise_at(.vars = vars(continuous_var)` ### Relationship between variables, categorical by continuous __Task__ - Calculate mean math score, `x2txmtscor`, for each value of parental education, `x2paredu` ```{r, results="hide"} #first, investigate parental education [print in console] hsls_small %>% count(x2paredu) ``` ```{r} # using dplyr to get average math score by parental education level [print in console] hsls_small %>% group_by(x2paredu) %>% summarise_at(.vars = vars(x2txmtscor), .funs = funs(mean, .args = list(na.rm = TRUE))) ``` ### Relationship between variables, categorical by continuous __Task__ - Calculate mean math score, `x2txmtscor`, for each value of `x2paredu` For checking data quality, helpful to calculate other stats besides mean ```{r, results="hide"} hsls_small %>% group_by(x2paredu) %>% #[print in console] summarise_at(.vars = vars(x2txmtscor), .funs = funs(mean, min, max, .args = list(na.rm = TRUE))) ``` Always Investigate presence of missing/skip values ```{r, results="hide"} hsls_small %>% filter(x2paredu<0) %>% count(x2paredu) hsls_small %>% filter(x2txmtscor<0) %>% count(x2txmtscor) ``` Replace `-8` with `NA` and re-calculate conditional stats ```{r, results="hide"} hsls_small %>% mutate(x2paredu_na=ifelse(x2paredu<0,NA,x2paredu), x2txmtscor_na=ifelse(x2txmtscor<0,NA,x2txmtscor)) %>% group_by(x2paredu_na) %>% summarise_at(.vars = vars(x2txmtscor_na), .funs = funs(mean, min, max, .args = list(na.rm = TRUE))) %>% as_factor() hsls_small %>% count(s3classes,s3clglvl) %>% as_factor ``` ### Student exercise Can use same approach to calculate conditional mean by multiple `group_by()` variables - Just add additional variables within `group_by()` 1. Calculate mean math test score (`x2txmtscor`), for each combination of parental education (`x2paredu`) and sex (`x2sex`). ### Student exercise solution 1. Calculate mean math test score (`x2txmtscor`), for each combination of parental education (`x2paredu`) and sex (`x2sex`) ```{r, results="hide"} #hsls_small %>% count(x2sex) hsls_small %>% group_by(x2paredu,x2sex) %>% summarise_at(.vars = vars(x2txmtscor), .funs = funs(mean, .args = list(na.rm = TRUE))) %>% as_factor() ``` ## Guidelines for EDA ### Guidelines for "EDA for data quality" Assme that your goal in "EDA for data quality" is to investigate "input" data sources and create "analysis variables" - Usually, your analysis dataset will incorporate multiple sources of input data, including data you collect (primary data) and/or data collected by others (secondary data) While this is not a linear process, these are the broad steps I follow 1. Understand how input data sources were created - e.g., when working with survey data, have survey questionnaire and codebooks on hand 1. For each input data source, identify the "unit of analysis" and which combination of variables uniquely identify observations 1. Investigate patterns in input variables 1. Create analysis variable from input variable(s) 1. Verify that analysis variable is created correctly through descriptive statistics that compare values of input variable(s) against values of the analysis variable __Always be aware of missing values__ - They will not always be coded as `NA` in input variables ### "Unit of analysis" and which variables uniquely identify observations "Unit of analysis" refers to "what does each observation represent" in an input data source - If each obs represents a student, you have "student level data" - If each obs represents a student-course, you have "student-course level data" - If each obs represents a school, you have "school-level data" - If each obs represents a school-year, you have "school-year level data" How to identify unit of analysis - data documentation - investigating the data set We will go over syntax for identifying unit of analysis in subsequent weeks ### Rules for variable creation Rules I follow for variable creation 1. \medskip Never modify "input variable"; instead create new variable based on input variable(s) - Always keep input variables used to create new variables 1. Investigate input variable(s) and relationship between input variables 1. Developing a plan for creation of analysis variable - e.g., for each possible value of input variables, what should value of analysis variable be? 1. Write code to create analysis variable 1. Run descriptive checks to verify new variables are constructed correctly - Can "comment out" these checks, but don't delete them 1. Document new variables with notes and labels ### Rules for variable creation __Task__: - Create analysis for variable ses qunitile called `sesq5` based on `x4x2sesq5` that converts negative values to `NAs` ```{r, results="hide"} #investigate input variable hsls_small %>% select(x4x2sesq5) %>% var_label() hsls_small %>% select(x4x2sesq5) %>% val_labels() hsls_small %>% select(x4x2sesq5) %>% count(x4x2sesq5) hsls_small %>% select(x4x2sesq5) %>% count(x4x2sesq5) %>% as_factor() #create analysis variable hsls_small_temp <- hsls_small %>% mutate(sesq5=ifelse(x4x2sesq5==-8,NA,x4x2sesq5)) # approach 1 hsls_small_temp <- hsls_small %>% mutate(sesq5=ifelse(x4x2sesq5<0,NA,x4x2sesq5)) # approach 2 #verify hsls_small_temp %>% group_by(x4x2sesq5) %>% count(sesq5) ``` ## Skip patterns in survey data ### What are skip patterns Pretty easy to create an analysis variable based on a single input variable Harder to create analysis variables based on multiple input variables - When working with survey data, even seemingly simple analysis variables require multiple input variables due to "skip patterns" What are "skip patterns"? - Response on a particular survey item determines whether respondent answers some set of subsequent questions - What are some examples of this? Key to working with skip patterns - Have the survey questionnaire on hand - Sometimes it appears that analysis variable requires only one input variable, but really depends on several input variables because of skip patterns - Don't just blindly turn "missing" and "skips" from survey data to `NAs` in your analysis variable - Rather, trace why these "missing" and "skips" appear and decide how they should be coded in your analysis variable # Problem Set 7 ### Overview of problem set due next week __Assignment__: - create GPA from postsecondary transcript student-course level data __Data source__: [National Longitudinal Study of 1972 (NLS72)](https://nces.ed.gov/surveys/nls72/) - Follows 12th graders from 1972 - Base year: 1972 - Follow-up surveys in: 1973, 1974, 1976, 1979, 1986 - Postsecondary transcripts collected in 1984 __Why use such an old survey for this assignment?__ - NLS72 predates data privacy agreements; transcript data publicly available __What we do to make assignment more manageable__ - last week's problem set created the input var: numgrade - we give you some hints/guidelines - but you are responsible for developing plan to create GPA vars and for executing plan (rather than us giving you step-by-step quations) __Why this assignment?__ 1. Give you more practice investigating data, cleaning data, creating variables that require processing across rows 1. Real world example of "simple" task with complex data management needs -->