---
title: "Biostat 203B Homework 4"
subtitle: Due ~~Mar 20~~ Mar 22 @ 11:59PM
output:
  # ioslides_presentation: default
  html_document:
    toc: true
    toc_depth: 4
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Task

In this assignment, you are to write a report analyzing the electronic health record (EHR) data MIMIC-III. You report will demostrate your knowledge of working with PostgreSQL database, data visualization, and commonly used analytical methods such as logistic regression and neural network. Your report should include at least following parts:  

1. An informative title. For example, _30-Day Mortality Rate of Myocardia Infarction Patients Admitted to CCU_.  

2. Introduction. Describe the MIMIC-III data set and what research hypothesis/goal you are to address using this data.

3. Data preparation. Create a study cohort from MIMIC-III corresponding to your research hypothesis/goal. See the examplary code below. Use a CONSORT flow diagram to summarize your steps to create the cohort.

4. Data visualization. Use visualization to summarize the cohort you created. 

5. Analytics. Use at least two analytical approaches to address your research hypothesis/goal. For example, we can use (1) logistic regression and (2) neural network to build a predictive model for the 30-day mortality rate of patients admitted into CCU and compare their predictive performance. Summarize your results in graphs.

6. Conclusions. 

# Learning resources about analyzing EHR data

- _Secondary Analysis of Electronic Health Records_: <https://link.springer.com/book/10.1007/978-3-319-43742-2> 

- _The Book of OHDSI_: <https://ohdsi.github.io/TheBookOfOhdsi/>. 

- The GitHub repository <https://github.com/MIT-LCP/mimic-code> contains some code examples for working with the MIMIC-III data. Following sample code derives from <https://github.com/MIT-LCP/mimic-code/blob/master/tutorials/dplyr-frontend/intro.md>. 

# Examplary code

## Connect to PostgresSQL database

Load database libraries and the tidyverse frontend:
```{r}
library(DBI)
library(RPostgreSQL)
library(tidyverse)
library(lubridate)
```

Credentials for using PostgreSQL database. We are going to use username `postgres` with password `postgres` to access the `mimic` database in the schemee `mimiciii`. 
```{r}
# Load configuration settings
dbdriver <- 'PostgreSQL'
#host  <- '127.0.0.1'
#port  <- '5432'
user  <- 'postgres'
password <- 'postgres'
dbname <- 'mimic'
schema <- 'mimiciii'
# Connect to the database using the configuration settings
con <- dbConnect(RPostgreSQL::PostgreSQL(), 
                 dbname = dbname, 
                 #host = host, 
                 #port = port, 
                 user = user, 
                 password = password)
# Set the default schema
dbExecute(con, paste("SET search_path TO ", schema, sep=" "))
con
```
List tables in the `mimic` database:
```{r}
dbListTables(con)
```
An example SQL query:
```{r}
sql_query <- "SELECT i.subject_id, i.hadm_id, i.los
              FROM icustays i;"
(data <- dbGetQuery(con, sql_query)) %>% as_tibble()
```
```{r}
#This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document.
(avg_los <- median(data$los, na.rm=TRUE))
(rounded_avg_los <- round(avg_los, digits = 2))
```

## Connect to a table in database

To connect to the `patients` table in the database:
```{r}
patients <- tbl(con, "patients")
patients %>% print(width = Inf)
```
We can see this connection is lazy (does not load the whole table into computer memory)
```{r}
str(patients)
```
To load the entire table into memory we may use
```{r}
collect(patients)
```
But keep in mind that the point of using a database software is that the data tables are potentially large and we prefer to use database to do on disk computations as much as possible. So in this assignment we will avoid loading whole tables into memory as much as we can. 

## Query and subsetting

In this section, we demo how to create a cohort of patients who were directly admitted into CCU and were diagnosed with heart attack.

First we create a (query) table of patients who were directly admitted into CCU.
```{r}
tbl(con, "transfers") %>%
  select(subject_id, hadm_id, prev_careunit, curr_careunit) %>%
  filter(is.na(prev_careunit) & curr_careunit == "CCU") %>%
  select(subject_id, hadm_id) %>%
  distinct() %>%
  print() -> ccu_admissions
```

Now we want to restrict to heart attack patients. To find all possible ICD-9 codes related to heart attack, we search for string `myocardial infarction` in the `long_title` of table `d_icd_diagnoses`:
```{r}
tbl(con, "d_icd_diagnoses") %>%
  filter(str_detect(tolower(long_title), "myocardial infarction")) %>%
  print() -> mi_codes
```

`diagnoses_icd` table stores the diagnosis of each admission. We use `semi_join()` to keep the rows in `diagnoses_icd` that match the ICD-9 codes related to heart attack:
```{r}
tbl(con, "diagnoses_icd") %>%
  semi_join(mi_codes, by = "icd9_code") %>%
  print() -> mi_admissions
```

MI may not be listed as the principal diagnosis; as explained in [the documentation for the `patients` table](https://mimic.physionet.org/mimictables/diagnoses_icd/), the `seq_num` field is a priority ranking for the diagnoses generated at the end of stay. In order to focus on patients for whom MI was central to their hospitalization, we will include records with MI in any of the first five diagnosis positions, according to the `seq_num` field. To avoid duplicate admissions, we use `group_by()` and `top_n()` to limit the query to the first MI diagnosis for each admission.
```{r}
mi_admissions %>%
  filter(seq_num <= 5) %>%
  group_by(subject_id, hadm_id) %>%
  # top_n(1, wt = seq_num) %>% #  not working. bug? use following as workaround
  filter(min_rank(seq_num) <= 1) %>%
  ungroup() %>%
  select(subject_id, hadm_id, icd9_code, seq_num) %>%
  print() -> mi_admissions
```
Now we `inner_join` the table of admissions to CCU and the table of admissions that include MI diagnosis.
```{r}
ccu_admissions %>%
  inner_join(mi_admissions, by = c("subject_id", "hadm_id")) %>%
  print() -> study_admissions
```

## Transform and augment query tables

Now we create a logical variable indicating the MI is the principal diagonosis or not (according to `seq_num`).
```{r}
study_admissions %>%
  mutate(principal_dx = seq_num == 1) %>%
  select(-seq_num) %>%
  print() -> study_admissions
```

We want to add information about the severity of patients’ ailments. The `drgcodes` table contains, for `DRG` codes from the All Payers Registry (APR), severity and mortality indicators. We pull the drug severity information and right-join it to our query table.
```{r}
tbl(con, "drgcodes") %>%
  filter(str_detect(drg_type, "APR")) %>%
  select(subject_id, hadm_id, drg_severity) %>%
  right_join(study_admissions, by = c("subject_id", "hadm_id")) %>%
  mutate(drg_severity = ifelse(is.na(drg_severity), 1, drg_severity)) %>%
  print() -> study_admissions
```

Pull the admission time `admittime`, discharge time `dischtime`, date of birth `dob`, and date of death `dod`. We are interested in the  mortaility rate 30 days after discharge. So we only keep patients who didn't die in hospital.
```{r}
study_admissions %>%
  left_join(
    select(tbl(con, "admissions"),
           subject_id, hadm_id, admittime, dischtime, hospital_expire_flag
    ), by = c("subject_id", "hadm_id")
  ) %>%
  filter(hospital_expire_flag == 0) %>% # patients who did not die in hospital
  select(-hospital_expire_flag) %>%
  left_join(
    select(tbl(con, "patients"), subject_id, dob, dod),
    by = "subject_id"
  ) %>%
  print(width = Inf) -> study_admissions
```

To add `age` (at admission) variable into the table. [The documentation for the patients table](https://mimic.physionet.org/mimictables/patients/) explains that patients of 90 years and older had their ages artificially inflated, so we remove these patients from the analysis.
```{r}
study_admissions %>%
  mutate(tt_death = date_part("day", dod) - date_part("day", dischtime)) %>%
  mutate(mortality = tt_death <= 30) %>%
  mutate(age = date_part("year", admittime) - date_part("year", dob)) %>%
  filter(age < 90) %>%
  mutate(age = age - ifelse(
    date_part("month", admittime) < date_part("month", dob) |
      (
        date_part("month", admittime) == date_part("month", dob) &
          date_part("day", admittime) < date_part("day", dob)
      ),
    1,
    0
  )) %>%
  select(-admittime, -dischtime, -dob, -dod, -tt_death) %>%
  select(subject_id, hadm_id, age, mortality, everything()) %>%
  print() -> study_admissions
```
Many mortality indicators are missing, due to neither the hospital database nor the social security database having a record of these patients’ deaths. We could convert these to `FALSE` values, but it may be helpful to retain in the analytic table this information on whether deaths were recorded at all, e.g. for validation or sensitivity testing.

Finally, let's merge some demographic information (ethnicity, gender) into our study `study_admissions`.
```{r}
tbl(con, "admissions") %>%
  select(subject_id, ethnicity) %>%
  distinct() %>%
  print() -> study_subjects
```
```{r}
tbl(con, "patients") %>%
  select(subject_id, gender) %>%
  distinct() %>%
  full_join(study_subjects, by = "subject_id") %>%
  print() -> study_subjects
```
```{r}
study_subjects %>%
  semi_join(study_admissions, by = "subject_id") %>%
  print() -> study_subjects
```
Let's resolves ome diversity and inconsistency in the `ethnicity` field:
```{r}
unknown_ethnicity <- c(
  "OTHER",
  "UNABLE TO OBTAIN",
  "UNKNOWN/NOT SPECIFIED",
  "MULTI RACE ETHNICITY",
  "PATIENT DECLINED TO ANSWER",
  "UNKNOWN"
)

study_subjects %>%
  collect() %>%
  mutate(ethnic_group = case_when(
    str_detect(ethnicity, "^ASIAN") ~ "ASIAN",
    str_detect(ethnicity, "^BLACK") ~ "BLACK",
    str_detect(ethnicity, "^HISPANIC") ~ "HISPANIC",
    str_detect(ethnicity, "^WHITE") ~ "WHITE",
    ethnicity %in% unknown_ethnicity ~ NA_character_,
    TRUE ~ NA_character_
  )) %>%
  select(subject_id, gender, ethnic_group) %>%
  print() -> study_subjects
```

Some patients are coded as belonging to more than one ethnic group. To resolve these inconsistencies, we define a helper function to pick the modal value from a vector of values in R, which can be used by the `summarize()` function to choose one ethnic group for each patient.
```{r}
most <- function(x) {
  if (all(is.na(x))) return(NA_character_)
  y <- table(x, useNA = "no")
  if (length(which(y == max(y))) > 1) return(NA_character_)
  return(names(y)[which.max(y)])
}

study_subjects %>%
  group_by(subject_id) %>%
  summarize(ethnic_group = most(ethnic_group)) %>%
  ungroup() %>%
  mutate(ethnic_group = ifelse(is.na(ethnic_group), "UNKNOWN", ethnic_group)) %>%
  print() -> subject_ethnic_groups
```
```{r}
study_subjects %>%
  select(subject_id, gender) %>%
  left_join(subject_ethnic_groups, by = "subject_id") %>%
  print() -> study_subjects
```
Now we add the demographic information `gender` and `ethnicity` into our `study_admissions` table:
```{r}
study_admissions %>%
  left_join(study_subjects, by = "subject_id", copy = TRUE) %>%
  print() -> study_admissions
```

## Close the connection to a database

Close the connection:
```{r}
dbDisconnect(con)
```

## CONSORT Flow Diagrams

CONSORT Flow Diagrams can be used to plot the flow of data selection of a patient cohort.   
For more details, see:
[The CONSORT Flow Diagram](http://www.consort-statement.org/consort-statement/flow-diagram). Following code shows an example. 

```{r plot}
library(shape)
library(diagram)

# set margins and multiplot
par(mfrow = c(1, 1))
par(mar = c(0, 0, 0, 0))

# initialise a plot device
openplotmat()

# position of boxes
# 1st column indicates x axis position between 0 and 1
# 2nd column indicates y axis position between 0 and 1
# automatically assigns vertical position
num_of_boxes <- 6
auto_coords = coordinates(num_of_boxes)
vert_pos = rev(auto_coords[,1])
box_pos <- matrix(nrow = num_of_boxes, ncol = 2, data = 0)
box_pos[1,] = c(0.20, vert_pos[1]) # 1st box
box_pos[2,] = c(0.70, vert_pos[2]) # 2nd box
box_pos[3,] = c(0.70, vert_pos[3]) # 3rd box
box_pos[4,] = c(0.70, vert_pos[4]) # etc...
box_pos[5,] = c(0.70, vert_pos[5])
box_pos[6,] = c(0.20, vert_pos[6])

# content of boxes
box_content <- matrix(nrow = num_of_boxes, ncol = 1, data = 0)
box_content[1] = "All patients in MIMIC-III \n n = 58,976" # 1st box
box_content[2] = "Exclude patients of age < 18 \n n = 8,180" # 2nd box
box_content[3] = "Exclude patients with no ICU admissions \n n = 1,071" # 3rd box
box_content[4] = "Exclude patients with diabetes \n n = 1,324" # etc...
box_content[5] = "Exclude patients with sepsis \n n = 4,804"
box_content[6] = "Study cohort \n n = 43,597"

# adjust the size of boxes to fit content
box_x <- c(0.20, 0.25, 0.25, 0.25, 0.25, 0.20)
box_y <- c(0.07, 0.07, 0.07, 0.07, 0.07, 0.07)

# Draw the arrows
straightarrow(from = c(box_pos[1,1],box_pos[2,2]), to = box_pos[2,], lwd = 1)  
straightarrow(from = c(box_pos[1,1],box_pos[3,2]), to = box_pos[3,], lwd = 1)  
straightarrow(from = c(box_pos[1,1],box_pos[4,2]), to = box_pos[4,], lwd = 1)  
straightarrow(from = c(box_pos[1,1],box_pos[5,2]), to = box_pos[5,], lwd = 1)  
straightarrow(from = box_pos[1,], to = box_pos[6,], lwd = 1)

# Draw the boxes
for (i in 1:num_of_boxes) {
  textrect(mid = box_pos[i,], radx = box_x[i], rady = box_y[i], 
           lab = box_content[i], 
           shadow.col = "grey")
  }
```

## Install PostgreSQL and `RpostgreSQL` package on CentOS

This is a note to myself (Dr. Hua Zhou). The postgres in yum is an old version. We want to install the most recent postgres and then build the `RpostgreSQL` package based on it. 

1. Follow instructions in <https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-centos-7> to install PostgreSQL 11 on CentOS 7. 

2. Issue following command to install `RPostgreSQL` in R:
```{bash, eval = F}
sudo R -e 'Sys.setenv(PG_INCDIR="/usr/pgsql-11/include/"); Sys.setenv(PG_LIBDIR="/usr/pgsql-11/lib/"); install.packages("RPostgreSQL")'
```