--- title: "Data validation and exploration" author: Abhijit Dasgupta date: BIOF 339 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment = '',warning=T, message=F) library(tidyverse) library(here) library(magick) imgdir <- here('slides/lectures/img') datadir <- here('slides/lectures/data') source(here('lib/R/update_header.R')) library(xaringanExtra) use_extra_styles(hover_code_line=TRUE, mute_unhighlighted_code = TRUE) use_tile_view() use_share_again() style_share_again(share_buttons = 'none') ``` ```{r, echo=FALSE, results='asis'} update_header() ``` --- ## Plan today - Dynamic exploration of data - Data validation - Missing data evaluation --- class: middle, center # Why go back to this? --- ## This is important!! + Most of the time in an analysis is spent understanding and cleaning data + Recognize that unless you've ended up with good-quality data, the rest of the analyses are moot + This is tedious, careful, non-sexy work - Hard to tell your boss you're still fixing the data - No real results yet - But essential to understanding the appropriate analyses and the tweaks you may need. --- ## What does a dataset look like? .pull-left[ ```{r v1, eval = F, echo = T} library(tidyverse) library(visdat) beaches <- rio::import('../data/sydneybeaches3.csv') vis_dat(beaches) ``` ] .pull-right[ ```{r 12-CleaningExploring-1, eval = T, echo = F, ref.label = "v1"} ``` ] --- ## What does a dataset look like? .pull-left[ ```{r v2, eval = F, echo = T} brca <- rio::import('../data/clinical_data_breast_cancer_modified.csv') vis_dat(brca) ``` ] .pull-right[ ```{r 12-CleaningExploring-2, eval = T, echo = F, ref.label = "v2"} ``` ] --- ## What does a dataset look like? .pull-left[ ```{r v3, eval = F, echo = T} vis_dat(airquality) ``` These plots give a nice insight into 1. data types 1. Missing data patterns (more on this later) ] .pull-right[ ```{r 12-CleaningExploring-3, eval = T, echo = F, ref.label = "v3"} ``` ] --- class: middle, center # Let's get a bit more quantitative --- ## `summary` and `str`/`glimpse` are a first pass .pull-left[ ```{r 12-CleaningExploring-4} summary(airquality) ``` ] .pull-right[ ```{r 12-CleaningExploring-5} glimpse(airquality) ``` ] --- ## Validating data values + We can certainly be reactive by just describing the data and looking for anomalies. + For larger data or multiple data files it makes sense to be proactive and catch errors that you want to avoid, before exploring for new errors. + The `assertthat` package provides nice tools to do this -- > **Note to self:** I don't do this enough. This is a good defensive programming technique that can catch crucial problems that aren't always automatically flagged as errors --- ## Being assertive ```{r 12-CleaningExploring-6, error=T} library(assertthat) assert_that(all(between(airquality$Day, 1, 31) )) assert_that(is.factor(mpg$manufacturer)) assert_that(all(beaches$season_name %in% c('Summer','Winter','Spring', 'Fall'))) ``` --- ## Being assertive + `assert_that` generates an error, which will stop things + `see_if` does the same validation, but just generates a `TRUE/FALSE`, which you can capture ```{r 12-CleaningExploring-7} see_if(is.factor(mpg$manufacturer)) ``` + `validate_that` generates `TRUE` if the assertion is true, otherwise generates a string explaining the error ```{r 12-CleaningExploring-8} validate_that(is.factor(mpg$manufacturer)) validate_that(is.character(mpg$manufacturer)) ``` --- ## Being assertive You can even write your own validation functions and custom messages ```{r 12-CleaningExploring-9, error=TRUE} is_odd <- function(x){ assert_that(is.numeric(x), length(x)==1) x %% 2 == 1 } assert_that(is_odd(2)) on_failure(is_odd) <- function(call, env) { paste0(deparse(call$x), " is even") # This is a R trick } assert_that(is_odd(2)) is_odd(1:2) ``` --- class: middle,center # Missing data --- ## Missing data R denotes missing data as `NA`, and supplies several functions to deal with missing data. The most fundamental is `is.na`, which gives a TRUE/FALSE answer ```{r 12-CleaningExploring-10} is.na(NA) is.na(25) ``` --- ## Missing data When we get a new dataset, it's useful to get a sense of the missingness ```{r 12-CleaningExploring-11} mpg %>% summarize(across(everything(), function(x) sum(is.na(x)))) ``` --- ## Missing data The `naniar` package allows a tidyverse-compatible way to deal with missing data ```{r 12-CleaningExploring-12} library(naniar) weather <- rio::import('../data/weather.csv') all_complete(mpg) all_complete(weather) weather %>% summarize_all(pct_complete) ``` --- ## Missing data ```{r 12-CleaningExploring-13} gg_miss_case(weather, show_pct = T) ``` --- ## Missing data ```{r 12-CleaningExploring-14} gg_miss_var(weather, show_pct=T) ``` --- ## Are there patterns to the missing data + Most analyses assume that data is either - Missing completely at random (MCAR) - Missing at random (MAR) + MCAR means - The missing data is just a random subset of the data + MAR means - Whether data is missing is related to values of some other variable(s) - If we control for those variable(s), the missing data would form a random subset of each of those data subsets defined by unique values of these variables --- ## Are there patterns to the missing data #### MAR or MCAR allows us to ignore the missing data, since it doesn't bias our analyses #### If data are **not** MCAR or MAR, we really need to understand the issing data mechanism and how that might affect our results. --- ## Co-patterns of missingness .pull-left[ ```{r v4, eval = T, echo = T} gg_miss_upset(airquality) ``` ] .pull-right[ ```{r v5, eval = T, echo = T} gg_miss_upset(riskfactors) ``` ] --- ## Co-patterns of missingness .pull-left[ ```{r d1, fig.height=3, message=T} ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_point() ``` ] .pull-right[ ```{r 12-CleaningExploring-15, fig.height=3} ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() ``` ] --- ## Co-patterns of missingness ```{r 12-CleaningExploring-16, warning=F} gg_miss_fct(x = riskfactors, fct = marital) ``` --- ## Replacing missing data `tidyr` has a function `replace_na` which will replace all missing values with some particular value. In the weather dataset, values are missing generally because there wasn't recorded rainfall on a day. So these values should really be 0 ```{r 12-CleaningExploring-17} weather1 <- weather %>% mutate(d1 = replace_na(d1, 0)) pct_miss(weather1$d1) ``` --- ### Question: How would you replace all the missing values with 0? -- ```{r 12-CleaningExploring-18, results = 'hide'} weather %>% mutate(across(everything(),function(x) replace_na(x, 0))) ``` -- ### How would you replace the missing values with the mean of the variable? -- ```{r 12-CleaningExploring-19, results = 'hide'} weather %>% mutate(across(where(is.numeric), function(x) replace_na(x, mean(x, na.rm=T)))) ``` ---