--- title: "Missing Data (Part 1)" description: | A look at how visualization can help characterize missing data. author: - name: Kris Sankaran affiliation: UW Madison date: 02-21-2021 output: distill::distill_article: self_contained: false --- _[Reading](https://naniar.njtierney.com/articles/getting-started-w-naniar.html), [Recording](https://mediaspace.wisc.edu/media/Week%205%20%5B1%5D%20Missing%20Data%20(Part%201)/1_xjmq3zmu), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-30-week5-1/week5-1.Rmd)_ ```{r setup, include=FALSE} knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` ```{r} library("dplyr") library("ggplot2") library("naniar") library("UpSetR") theme_set(theme_bw()) ``` 1. Visualization is about more than simply communicating results at the end of a data analysis. It can be used to improve the process of data analysis itself. This is the subject of this week’s notes. 2. When data science workflows are opaque — when sequences of commands are followed blindly — that’s when mistakes are most likely to be made^[Prof. Broman has noted a compendium of [horror stories](http://www.eusprig.org/horror-stories.htm)]. Visualization can help improve the transparency of different steps across the workflow. 3. To make this idea concrete, let’s consider missing data. We can encounter missing data for a variety of reasons. Perhaps a sensor was broken or no response was submitted. Maybe several datasets were merged, but some fields are only present in the more recent versions. 4. If there are missing data, and if we fail to account for it, we might accidentally draw incorrect conclusions. Or, if we are going to attempt to impute the missing values, we should verify the assumptions of whatever imputation algorithm we intend to use. 5. A first plot to make is to look at the proportion of missing data within each field. Let's consider the `riskfactors` data, which is a subset from the [Behavioral Risk Factor Surveillance System](https://www.rdocumentation.org/packages/naniar/versions/0.6.0/topics/riskfactors). ```{r} gg_miss_var(riskfactors) ``` 6. This histogram doesn’t tell us whether missing data tend to co-occur across a few different dimensions. For that, we can count the different missingness patterns. ```{r} gg_miss_upset(riskfactors) ``` 7. Neither of these approaches tell us whether missing data tend to occur in clumps along rows of the data. We might expect this if our sensor broke down within a particular time interval, for example. For this, we can make a type of heatmap across all the data. ```{r} vis_miss(riskfactors) ``` There doesn't seem to be much clumping in that dataset, but consider the `airquality` dataset below, whose rows are sorted by time. ```{r} vis_miss(airquality) ``` 8. This visualization actually has a formal name, the shadow matrix^[Which when you think about it sounds pretty dystopian / scifi.], and it comes up in theoretical analysis of missing data.