--- title: "Biostat 203B Homework 2" subtitle: Due ~~Feb 10~~ Feb 15 @ 11:59PM author: YOUR NAME and UID format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false knitr: opts_chunk: cache: false echo: true fig.align: 'center' fig.width: 6 fig.height: 4 message: FALSE --- Display machine information for reproducibility: ```{r} #| eval: false sessionInfo() ``` Load necessary libraries (you can add more as needed). ```{r setup} library(data.table) library(lubridate) library(R.utils) library(tidyverse) ``` MIMIC data location ```{r} mimic_path <- "~/mimic" ``` In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the [MIMIC-IV](https://mimic.mit.edu/docs/iv/) data introduced in [homework 1](https://ucla-biostat-203b.github.io/2023winter/hw/hw1/hw1.html) and to build a cohort of ICU stays. Display the contents of MIMIC data folder. ```{r} system(str_c("ls -l ", mimic_path, "/"), intern = TRUE) system(str_c("ls -l ", mimic_path, "/core"), intern = TRUE) system(str_c("ls -l ", mimic_path, "/hosp"), intern = TRUE) system(str_c("ls -l ", mimic_path, "/icu"), intern = TRUE) ``` ## Q1. `read.csv` (base R) vs `read_csv` (tidyverse) vs `fread` (data.table) There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, `admissions.csv.gz`, by three programs: `read.csv` in base R, `read_csv` in tidyverse, and `fread` in the popular data.table package. Which function is fastest? Is there difference in the (default) parsed data types? (Hint: R function `system.time` measures run times.) For later questions, we stick to the `read_csv` in tidyverse. ## Q2. ICU stays `icustays.csv.gz` () contains data about Intensive Care Units (ICU) stays. The first 10 lines are ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/icu/icustays.csv.gz"), " | head" ), intern = TRUE ) ``` 1. Import `icustatys.csv.gz` as a tibble `icustays_tble`. 2. How many unique `subject_id`? Can a `subject_id` have multiple ICU stays? 3. Summarize the number of ICU stays per `subject_id` by graphs. 4. For each `subject_id`, let's only keep the first ICU stay in the tibble `icustays_tble`. (Hint: `slice_min` and `slice_max` may take long. Think alternative ways to achieve the same function.) ## Q3. `admission` data Information of the patients admitted into hospital is available in `admissions.csv.gz`. See for details of each field in this file. The first 10 lines are ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/core/admissions.csv.gz"), " | head" ), intern = TRUE ) ``` 1. Import `admissions.csv.gz` as a tibble `admissions_tble`. 2. Let's only keep the admissions that have a match in `icustays_tble` according to `subject_id` and `hadmi_id`. 3. Summarize the following variables by graphics. - admission year - admission month - admission month day - admission week day - admission hour (anything unusual?) - admission minute (anything unusual?) - length of hospital stay (anything unusual?) ## Q4. `patients` data Patient information is available in `patients.csv.gz`. See for details of each field in this file. The first 10 lines are ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/core/patients.csv.gz"), " | head" ), intern = TRUE ) ``` 1. Import `patients.csv.gz` () as a tibble `patients_tble` and only keep the patients who have a match in `icustays_tble` (according to `subject_id`). 2. Summarize variables `gender` and `anchor_age`, and explain any patterns you see. ## Q5. Lab results `labevents.csv.gz` () contains all laboratory measurements for patients. The first 10 lines are ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/hosp/labevents.csv.gz"), " | head" ), intern = TRUE ) ``` `d_labitems.csv.gz` is the dictionary of lab measurements. ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/hosp/d_labitems.csv.gz"), " | head" ), intern = TRUE ) ``` 1. Find how many rows are in `labevents.csv.gz`. 2. We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of `labevents.csv.gz` only containing these items for the patients in `icustays_tble` as a tibble `labevents_tble`. Hint: `labevents.csv.gz` is a data file too big to be read in by the `read_csv` function in its default setting. Utilize the `col_select` option in the `read_csv` function to reduce the memory burden. It took my computer 5-10 minutes to ingest this file. If your computer really has trouble importing `labevents.csv.gz`, you can import from the reduced data file `labevents_filtered_itemid.csv.gz`. 3. Further restrict `labevents_tble` to the first lab measurement during the ICU stay. 4. Summarize the lab measurements by appropriate numerics and graphics. ## Q6. Vitals from charted events `chartevents.csv.gz` () contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The `itemid` variable indicates a single measurement type in the database. The `value` variable is the value measured for `itemid`. The first 10 lines of `chartevents.csv.gz` are ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/icu/chartevents.csv.gz"), " | head"), intern = TRUE ) ``` `d_items.csv.gz` () is the dictionary for the `itemid` in `chartevents.csv.gz`. ```{r} system( str_c( "zcat < ", str_c(mimic_path, "/icu/d_items.csv.gz"), " | head"), intern = TRUE ) ``` 1. We are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of `chartevents.csv.gz` only containing these items for the patients in `icustays_tble` as a tibble `chartevents_tble`. Hint: `chartevents.csv.gz` is a data file too big to be read in by the `read_csv` function in its default setting. Utilize the `col_select` option in the `read_csv` function to reduce the memory burden. It took my computer >15 minutes to ingest this file. If your computer really has trouble importing `chartevents.csv.gz`, you can import from the reduced data file `chartevents_filtered_itemid.csv.gz`. 2. Further restrict `chartevents_tble` to the first vital measurement during the ICU stay. 3. Summarize these vital measurements by appropriate numerics and graphics. ## Q7. Putting things together Let us create a tibble `mimic_icu_cohort` for all ICU stays, where rows are the first ICU stay of each unique adult (age at admission > 18) and columns contain at least following variables - all variables in `icustays.csv.gz` - all variables in `admission.csv.gz` - all variables in `patients.csv.gz` - first lab measurements during ICU stay - first vital measurements during ICU stay - an indicator variable `thirty_day_mort` whether the patient died within 30 days of hospital admission (30 day mortality) ## Q8. Exploratory data analysis (EDA) Summarize following information using appropriate numerics or graphs. - `thirty_day_mort` vs demographic variables (ethnicity, language, insurance, marital_status, gender, age at hospital admission) - `thirty_day_mort` vs first lab measurements - `thirty_day_mort` vs first vital measurements - `thirty_day_mort` vs first ICU unit