--- title: "Visualizing Longitudinal Data Using brolgar" author: "Dr. Hua Zhou @ UCLA" date: "Jan 28, 2020" subtitle: Biostat 203B output: # ioslides_presentation: default html_document: toc: true toc_depth: 4 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Adapted form the tutorial at . ## Longitudinal data [broglar](http://brolgar.njtierney.com) is an R package that helps **br**owse **o**ver **l**ongitudinal **d**ata **g**raphically and **a**nalytically in **R**. ## Installing `brolgar` Install from GitHub: ```{r} # install.packages("remotes") remotes::install_github("njtierney/brolgar") ``` ## An example longitudinal data Load `tidyverse`, `brolgar` and `gghighlight`: ```{r} library(tidyverse) library(brolgar) library(gghighlight) ``` The csv file `wages_pp.csv` contains a data set from the textbook _Applied Longitudical Data Analysis_ (2003) by Singer and Willett. This data contains measurements on hourly wages by years in the workforce, with education and race as covariates. ```{bash} head wages_pp.csv ``` Read following variables into a tibble: - `id`: 1-888, for each subject. - `lnw`: natural log of wages, adjusted for inflation, to 1990 dollars. - `exper`: Experience - the length of time in the workforce (in years). The number of time points and values of time points for each subject can differ. - `ged`: when/if a graduate equivalency diploma is obtained. - `postexp`: change in experience since getting a ged (if they get one). - `black`: categorical indicator of race = black. - `hispanic`: categorical indicator of race = hispanic. - `hgc`: highest grade completed. - `uerate`: unemployment rates in the local geographic region at each measurement time. ```{r} wages <- read_csv("wages_pp.csv") %>% select(1:8, 10) wages %>% print(width = Inf) ``` ## Turn data frame into a `tsibble` We turn a regular data frame/tibble into a tidy time series data frame `tsibble` by identifying 1. The **key** variable is the identifier of individuals. 2. The **index** variable is the time component of your data. 3. The **regularity** of the time interveral (index). Longitudinal data typically has irregular time periods between measurements, but can have regular measurements. ```{r} wages <- as_tsibble(x = wages, key = id, index = exper, regular = FALSE) wages ``` ## Display trajectories of a few individuals `sample_n_keys()` and `sample_frac_keys()` to take a random sample of keys. ```{r} set.seed(203) wages %>% sample_n_keys(size = 5) %>% ggplot(aes(x = exper, y = lnw, group = id)) + geom_line() ``` `facet_sample()` allows you to specify the number of keys per facet, and the number of facets with `n_per_facet` and `n_facets`. By default, it splits the data into 12 facets with 3 per facet: ```{r} set.seed(203) ggplot(wages, aes(x = exper, y = lnw, group = id)) + geom_line() + facet_sample() ``` ## Find features in longitudinal data Five number summaries of wages for each individual: ```{r} wages %>% features(lnw, feat_five_num) ``` Other feature functions include `feat_monotonic`, `feat_ranges`, `feat_spread`, `feat_three_num`, `n_obs`. For example, find those whose wages only increase or decrease with feat_monotonic: ```{r} wages %>% features(lnw, feat_monotonic) ``` Find the number of observations per individual and summarize by bar plot: ```{r} wages %>% features(lnw, n_obs) %>% ggplot(aes(x = n_obs)) + geom_bar() ``` ## Linking individuals back to the data Once you have created these features, you can join them back to the data with a `left_join`: ```{r} wages %>% features(lnw, feat_monotonic) %>% left_join(wages, by = "id") %>% ggplot(aes(x = exper, y = lnw, group = id)) + geom_line() + gghighlight(increase) ```