---
title: "Biostat 203B Homework 3"
subtitle: "Due Feb 23 @ 11:59PM"
author: "Your Name and UID"
format:
html:
theme: cosmo
embed-resources: true
number-sections: false
toc: true
toc-depth: 4
toc-location: left
code-fold: false
execute:
eval: false
---
Display machine information for reproducibility:
```{r}
sessionInfo()
```
Load necessary libraries (you can add more as needed).
```{r setup}
library(arrow)
library(gtsummary)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
```
Display your machine memory.
```{r}
memuse::Sys.meminfo()
```
In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the [MIMIC-IV](https://mimic.mit.edu/docs/iv/) data introduced in [homework 1](https://ucla-biostat-203b.github.io/2024winter/hw/hw1/hw1.html) and to build a cohort of ICU stays.
## Q1. Visualizing patient trajectory
Visualizing a patient's encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient's ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.
### Q1.1 ADT history
A patient's ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with `subject_id` 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient's demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with `subject_id` 10013310 using ggplot.
Hint: We need to pull information from data files `patients.csv.gz`, `admissions.csv.gz`, `transfers.csv.gz`, `labevents.csv.gz`, `procedures_icd.csv.gz`, `diagnoses_icd.csv.gz`, `d_icd_procedures.csv.gz`, and `d_icd_diagnoses.csv.gz`. For the big file `labevents.csv.gz`, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder `labevents_pq` available at the current working directory `hw3`, for example, by a symbolic link. Make your code reproducible.
### Q1.2 ICU stays
ICU stays are a subset of ADT history. This figure shows the vitals of the patient `10001217` during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient `10013310`.
## Q2. ICU stays
`icustays.csv.gz` () contains data about Intensive Care Units (ICU) stays. The first 10 lines are
```{bash}
zcat < ~/mimic/icu/icustays.csv.gz | head
```
### Q2.1 Ingestion
Import `icustays.csv.gz` as a tibble `icustays_tble`.
### Q2.2 Summary and visualization
How many unique `subject_id`? Can a `subject_id` have multiple ICU stays? Summarize the number of ICU stays per `subject_id` by graphs.
## Q3. `admissions` data
Information of the patients admitted into hospital is available in `admissions.csv.gz`. See for details of each field in this file. The first 10 lines are
```{bash}
zcat < ~/mimic/hosp/admissions.csv.gz | head
```
### Q3.1 Ingestion
Import `admissions.csv.gz` as a tibble `admissions_tble`.
### Q3.2 Summary and visualization
Summarize the following information by graphics and explain any patterns you see.
- number of admissions per patient
- admission hour (anything unusual?)
- admission minute (anything unusual?)
- length of hospital stay (from admission to discharge) (anything unusual?)
According to the [MIMIC-IV documentation](https://mimic.mit.edu/docs/iv/about/concepts/#date-shifting),
> All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.
## Q4. `patients` data
Patient information is available in `patients.csv.gz`. See for details of each field in this file. The first 10 lines are
```{bash}
zcat < ~/mimic/hosp/patients.csv.gz | head
```
### Q4.1 Ingestion
Import `patients.csv.gz` () as a tibble `patients_tble`.
### Q4.2 Summary and visualization
Summarize variables `gender` and `anchor_age` by graphics, and explain any patterns you see.
## Q5. Lab results
`labevents.csv.gz` () contains all laboratory measurements for patients. The first 10 lines are
```{bash}
zcat < ~/mimic/hosp/labevents.csv.gz | head
```
`d_labitems.csv.gz` () is the dictionary of lab measurements.
```{bash}
zcat < ~/mimic/hosp/d_labitems.csv.gz | head
```
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of `labevents.csv.gz` that only containing these items for the patients in `icustays_tble`. Further restrict to the last available measurement (by `storetime`) before the ICU stay. The final `labevents_tble` should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make `labevents_pq` folder available at the current working directory `hw3`, for example, by a symbolic link.
## Q6. Vitals from charted events
`chartevents.csv.gz` () contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The `itemid` variable indicates a single measurement type in the database. The `value` variable is the value measured for `itemid`. The first 10 lines of `chartevents.csv.gz` are
```{bash}
zcat < ~/mimic/icu/chartevents.csv.gz | head
```
`d_items.csv.gz` () is the dictionary for the `itemid` in `chartevents.csv.gz`.
```{bash}
zcat < ~/mimic/icu/d_items.csv.gz | head
```
We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of `chartevents.csv.gz` only containing these items for the patients in `icustays_tble`. Further restrict to the first vital measurement within the ICU stay. The final `chartevents_tble` should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make `chartevents_pq` folder available at the current working directory, for example, by a symbolic link.
## Q7. Putting things together
Let us create a tibble `mimic_icu_cohort` for all ICU stays, where rows are all ICU stays of adults (age at `intime` >= 18) and columns contain at least following variables
- all variables in `icustays_tble`
- all variables in `admissions_tble`
- all variables in `patients_tble`
- the last lab measurements before the ICU stay in `labevents_tble`
- the first vital measurements during the ICU stay in `chartevents_tble`
The final `mimic_icu_cohort` should have one row per ICU stay and columns for each variable.

## Q8. Exploratory data analysis (EDA)
Summarize the following information about the ICU stay cohort `mimic_icu_cohort` using appropriate numerics or graphs:
- Length of ICU stay `los` vs demographic variables (race, insurance, marital_status, gender, age at intime)
- Length of ICU stay `los` vs the last available lab measurements before ICU stay
- Length of ICU stay `los` vs the first vital measurements within the ICU stay
- Length of ICU stay `los` vs first ICU unit