---
title: "Biostat 203B Homework 2"
subtitle: Due ~~Feb 10~~ Feb 15 @ 11:59PM
author: YOUR NAME and UID
format:
html:
theme: cosmo
number-sections: true
toc: true
toc-depth: 4
toc-location: left
code-fold: false
knitr:
opts_chunk:
cache: false
echo: true
fig.align: 'center'
fig.width: 6
fig.height: 4
message: FALSE
---
Display machine information for reproducibility:
```{r}
#| eval: false
sessionInfo()
```
Load necessary libraries (you can add more as needed).
```{r setup}
library(data.table)
library(lubridate)
library(R.utils)
library(tidyverse)
```
MIMIC data location
```{r}
mimic_path <- "~/mimic"
```
In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the [MIMIC-IV](https://mimic.mit.edu/docs/iv/) data introduced in [homework 1](https://ucla-biostat-203b.github.io/2023winter/hw/hw1/hw1.html) and to build a cohort of ICU stays.
Display the contents of MIMIC data folder.
```{r}
system(str_c("ls -l ", mimic_path, "/"), intern = TRUE)
system(str_c("ls -l ", mimic_path, "/core"), intern = TRUE)
system(str_c("ls -l ", mimic_path, "/hosp"), intern = TRUE)
system(str_c("ls -l ", mimic_path, "/icu"), intern = TRUE)
```
## Q1. `read.csv` (base R) vs `read_csv` (tidyverse) vs `fread` (data.table)
There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, `admissions.csv.gz`, by three programs: `read.csv` in base R, `read_csv` in tidyverse, and `fread` in the popular data.table package.
Which function is fastest? Is there difference in the (default) parsed data types? (Hint: R function `system.time` measures run times.)
For later questions, we stick to the `read_csv` in tidyverse.
## Q2. ICU stays
`icustays.csv.gz` () contains data about Intensive Care Units (ICU) stays. The first 10 lines are
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/icu/icustays.csv.gz"),
" | head"
),
intern = TRUE
)
```
1. Import `icustatys.csv.gz` as a tibble `icustays_tble`.
2. How many unique `subject_id`? Can a `subject_id` have multiple ICU stays?
3. Summarize the number of ICU stays per `subject_id` by graphs.
4. For each `subject_id`, let's only keep the first ICU stay in the tibble `icustays_tble`. (Hint: `slice_min` and `slice_max` may take long. Think alternative ways to achieve the same function.)
## Q3. `admission` data
Information of the patients admitted into hospital is available in `admissions.csv.gz`. See for details of each field in this file. The first 10 lines are
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/core/admissions.csv.gz"),
" | head"
),
intern = TRUE
)
```
1. Import `admissions.csv.gz` as a tibble `admissions_tble`.
2. Let's only keep the admissions that have a match in `icustays_tble` according to `subject_id` and `hadmi_id`.
3. Summarize the following variables by graphics.
- admission year
- admission month
- admission month day
- admission week day
- admission hour (anything unusual?)
- admission minute (anything unusual?)
- length of hospital stay (anything unusual?)
## Q4. `patients` data
Patient information is available in `patients.csv.gz`. See for details of each field in this file. The first 10 lines are
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/core/patients.csv.gz"),
" | head"
),
intern = TRUE
)
```
1. Import `patients.csv.gz` () as a tibble `patients_tble` and only keep the patients who have a match in `icustays_tble` (according to `subject_id`).
2. Summarize variables `gender` and `anchor_age`, and explain any patterns you see.
## Q5. Lab results
`labevents.csv.gz` () contains all laboratory measurements for patients. The first 10 lines are
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/hosp/labevents.csv.gz"),
" | head"
),
intern = TRUE
)
```
`d_labitems.csv.gz` is the dictionary of lab measurements.
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/hosp/d_labitems.csv.gz"),
" | head"
),
intern = TRUE
)
```
1. Find how many rows are in `labevents.csv.gz`.
2. We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of `labevents.csv.gz` only containing these items for the patients in `icustays_tble` as a tibble `labevents_tble`.
Hint: `labevents.csv.gz` is a data file too big to be read in by the `read_csv` function in its default setting. Utilize the `col_select` option in the `read_csv` function to reduce the memory burden. It took my computer 5-10 minutes to ingest this file. If your computer really has trouble importing `labevents.csv.gz`, you can import from the reduced data file `labevents_filtered_itemid.csv.gz`.
3. Further restrict `labevents_tble` to the first lab measurement during the ICU stay.
4. Summarize the lab measurements by appropriate numerics and graphics.
## Q6. Vitals from charted events
`chartevents.csv.gz` () contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The `itemid` variable indicates a single measurement type in the database. The `value` variable is the value measured for `itemid`. The first 10 lines of `chartevents.csv.gz` are
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/icu/chartevents.csv.gz"),
" | head"),
intern = TRUE
)
```
`d_items.csv.gz` () is the dictionary for the `itemid` in `chartevents.csv.gz`.
```{r}
system(
str_c(
"zcat < ",
str_c(mimic_path, "/icu/d_items.csv.gz"),
" | head"),
intern = TRUE
)
```
1. We are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of `chartevents.csv.gz` only containing these items for the patients in `icustays_tble` as a tibble `chartevents_tble`.
Hint: `chartevents.csv.gz` is a data file too big to be read in by the `read_csv` function in its default setting. Utilize the `col_select` option in the `read_csv` function to reduce the memory burden. It took my computer >15 minutes to ingest this file. If your computer really has trouble importing `chartevents.csv.gz`, you can import from the reduced data file `chartevents_filtered_itemid.csv.gz`.
2. Further restrict `chartevents_tble` to the first vital measurement during the ICU stay.
3. Summarize these vital measurements by appropriate numerics and graphics.
## Q7. Putting things together
Let us create a tibble `mimic_icu_cohort` for all ICU stays, where rows are the first ICU stay of each unique adult (age at admission > 18) and columns contain at least following variables
- all variables in `icustays.csv.gz`
- all variables in `admission.csv.gz`
- all variables in `patients.csv.gz`
- first lab measurements during ICU stay
- first vital measurements during ICU stay
- an indicator variable `thirty_day_mort` whether the patient died within 30 days of hospital admission (30 day mortality)
## Q8. Exploratory data analysis (EDA)
Summarize following information using appropriate numerics or graphs.
- `thirty_day_mort` vs demographic variables (ethnicity, language, insurance, marital_status, gender, age at hospital admission)
- `thirty_day_mort` vs first lab measurements
- `thirty_day_mort` vs first vital measurements
- `thirty_day_mort` vs first ICU unit