--- title: "Biostat 203B Homework 4" subtitle: Due ~~Mar 20~~ Mar 22 @ 11:59PM output: # ioslides_presentation: default html_document: toc: true toc_depth: 4 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Task In this assignment, you are to write a report analyzing the electronic health record (EHR) data MIMIC-III. You report will demostrate your knowledge of working with PostgreSQL database, data visualization, and commonly used analytical methods such as logistic regression and neural network. Your report should include at least following parts: 1. An informative title. For example, _30-Day Mortality Rate of Myocardia Infarction Patients Admitted to CCU_. 2. Introduction. Describe the MIMIC-III data set and what research hypothesis/goal you are to address using this data. 3. Data preparation. Create a study cohort from MIMIC-III corresponding to your research hypothesis/goal. See the examplary code below. Use a CONSORT flow diagram to summarize your steps to create the cohort. 4. Data visualization. Use visualization to summarize the cohort you created. 5. Analytics. Use at least two analytical approaches to address your research hypothesis/goal. For example, we can use (1) logistic regression and (2) neural network to build a predictive model for the 30-day mortality rate of patients admitted into CCU and compare their predictive performance. Summarize your results in graphs. 6. Conclusions. # Learning resources about analyzing EHR data - _Secondary Analysis of Electronic Health Records_: - _The Book of OHDSI_: . - The GitHub repository contains some code examples for working with the MIMIC-III data. Following sample code derives from . # Examplary code ## Connect to PostgresSQL database Load database libraries and the tidyverse frontend: ```{r} library(DBI) library(RPostgreSQL) library(tidyverse) library(lubridate) ``` Credentials for using PostgreSQL database. We are going to use username `postgres` with password `postgres` to access the `mimic` database in the schemee `mimiciii`. ```{r} # Load configuration settings dbdriver <- 'PostgreSQL' #host <- '127.0.0.1' #port <- '5432' user <- 'postgres' password <- 'postgres' dbname <- 'mimic' schema <- 'mimiciii' # Connect to the database using the configuration settings con <- dbConnect(RPostgreSQL::PostgreSQL(), dbname = dbname, #host = host, #port = port, user = user, password = password) # Set the default schema dbExecute(con, paste("SET search_path TO ", schema, sep=" ")) con ``` List tables in the `mimic` database: ```{r} dbListTables(con) ``` An example SQL query: ```{r} sql_query <- "SELECT i.subject_id, i.hadm_id, i.los FROM icustays i;" (data <- dbGetQuery(con, sql_query)) %>% as_tibble() ``` ```{r} #This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document. (avg_los <- median(data$los, na.rm=TRUE)) (rounded_avg_los <- round(avg_los, digits = 2)) ``` ## Connect to a table in database To connect to the `patients` table in the database: ```{r} patients <- tbl(con, "patients") patients %>% print(width = Inf) ``` We can see this connection is lazy (does not load the whole table into computer memory) ```{r} str(patients) ``` To load the entire table into memory we may use ```{r} collect(patients) ``` But keep in mind that the point of using a database software is that the data tables are potentially large and we prefer to use database to do on disk computations as much as possible. So in this assignment we will avoid loading whole tables into memory as much as we can. ## Query and subsetting In this section, we demo how to create a cohort of patients who were directly admitted into CCU and were diagnosed with heart attack. First we create a (query) table of patients who were directly admitted into CCU. ```{r} tbl(con, "transfers") %>% select(subject_id, hadm_id, prev_careunit, curr_careunit) %>% filter(is.na(prev_careunit) & curr_careunit == "CCU") %>% select(subject_id, hadm_id) %>% distinct() %>% print() -> ccu_admissions ``` Now we want to restrict to heart attack patients. To find all possible ICD-9 codes related to heart attack, we search for string `myocardial infarction` in the `long_title` of table `d_icd_diagnoses`: ```{r} tbl(con, "d_icd_diagnoses") %>% filter(str_detect(tolower(long_title), "myocardial infarction")) %>% print() -> mi_codes ``` `diagnoses_icd` table stores the diagnosis of each admission. We use `semi_join()` to keep the rows in `diagnoses_icd` that match the ICD-9 codes related to heart attack: ```{r} tbl(con, "diagnoses_icd") %>% semi_join(mi_codes, by = "icd9_code") %>% print() -> mi_admissions ``` MI may not be listed as the principal diagnosis; as explained in [the documentation for the `patients` table](https://mimic.physionet.org/mimictables/diagnoses_icd/), the `seq_num` field is a priority ranking for the diagnoses generated at the end of stay. In order to focus on patients for whom MI was central to their hospitalization, we will include records with MI in any of the first five diagnosis positions, according to the `seq_num` field. To avoid duplicate admissions, we use `group_by()` and `top_n()` to limit the query to the first MI diagnosis for each admission. ```{r} mi_admissions %>% filter(seq_num <= 5) %>% group_by(subject_id, hadm_id) %>% # top_n(1, wt = seq_num) %>% # not working. bug? use following as workaround filter(min_rank(seq_num) <= 1) %>% ungroup() %>% select(subject_id, hadm_id, icd9_code, seq_num) %>% print() -> mi_admissions ``` Now we `inner_join` the table of admissions to CCU and the table of admissions that include MI diagnosis. ```{r} ccu_admissions %>% inner_join(mi_admissions, by = c("subject_id", "hadm_id")) %>% print() -> study_admissions ``` ## Transform and augment query tables Now we create a logical variable indicating the MI is the principal diagonosis or not (according to `seq_num`). ```{r} study_admissions %>% mutate(principal_dx = seq_num == 1) %>% select(-seq_num) %>% print() -> study_admissions ``` We want to add information about the severity of patients’ ailments. The `drgcodes` table contains, for `DRG` codes from the All Payers Registry (APR), severity and mortality indicators. We pull the drug severity information and right-join it to our query table. ```{r} tbl(con, "drgcodes") %>% filter(str_detect(drg_type, "APR")) %>% select(subject_id, hadm_id, drg_severity) %>% right_join(study_admissions, by = c("subject_id", "hadm_id")) %>% mutate(drg_severity = ifelse(is.na(drg_severity), 1, drg_severity)) %>% print() -> study_admissions ``` Pull the admission time `admittime`, discharge time `dischtime`, date of birth `dob`, and date of death `dod`. We are interested in the mortaility rate 30 days after discharge. So we only keep patients who didn't die in hospital. ```{r} study_admissions %>% left_join( select(tbl(con, "admissions"), subject_id, hadm_id, admittime, dischtime, hospital_expire_flag ), by = c("subject_id", "hadm_id") ) %>% filter(hospital_expire_flag == 0) %>% # patients who did not die in hospital select(-hospital_expire_flag) %>% left_join( select(tbl(con, "patients"), subject_id, dob, dod), by = "subject_id" ) %>% print(width = Inf) -> study_admissions ``` To add `age` (at admission) variable into the table. [The documentation for the patients table](https://mimic.physionet.org/mimictables/patients/) explains that patients of 90 years and older had their ages artificially inflated, so we remove these patients from the analysis. ```{r} study_admissions %>% mutate(tt_death = date_part("day", dod) - date_part("day", dischtime)) %>% mutate(mortality = tt_death <= 30) %>% mutate(age = date_part("year", admittime) - date_part("year", dob)) %>% filter(age < 90) %>% mutate(age = age - ifelse( date_part("month", admittime) < date_part("month", dob) | ( date_part("month", admittime) == date_part("month", dob) & date_part("day", admittime) < date_part("day", dob) ), 1, 0 )) %>% select(-admittime, -dischtime, -dob, -dod, -tt_death) %>% select(subject_id, hadm_id, age, mortality, everything()) %>% print() -> study_admissions ``` Many mortality indicators are missing, due to neither the hospital database nor the social security database having a record of these patients’ deaths. We could convert these to `FALSE` values, but it may be helpful to retain in the analytic table this information on whether deaths were recorded at all, e.g. for validation or sensitivity testing. Finally, let's merge some demographic information (ethnicity, gender) into our study `study_admissions`. ```{r} tbl(con, "admissions") %>% select(subject_id, ethnicity) %>% distinct() %>% print() -> study_subjects ``` ```{r} tbl(con, "patients") %>% select(subject_id, gender) %>% distinct() %>% full_join(study_subjects, by = "subject_id") %>% print() -> study_subjects ``` ```{r} study_subjects %>% semi_join(study_admissions, by = "subject_id") %>% print() -> study_subjects ``` Let's resolves ome diversity and inconsistency in the `ethnicity` field: ```{r} unknown_ethnicity <- c( "OTHER", "UNABLE TO OBTAIN", "UNKNOWN/NOT SPECIFIED", "MULTI RACE ETHNICITY", "PATIENT DECLINED TO ANSWER", "UNKNOWN" ) study_subjects %>% collect() %>% mutate(ethnic_group = case_when( str_detect(ethnicity, "^ASIAN") ~ "ASIAN", str_detect(ethnicity, "^BLACK") ~ "BLACK", str_detect(ethnicity, "^HISPANIC") ~ "HISPANIC", str_detect(ethnicity, "^WHITE") ~ "WHITE", ethnicity %in% unknown_ethnicity ~ NA_character_, TRUE ~ NA_character_ )) %>% select(subject_id, gender, ethnic_group) %>% print() -> study_subjects ``` Some patients are coded as belonging to more than one ethnic group. To resolve these inconsistencies, we define a helper function to pick the modal value from a vector of values in R, which can be used by the `summarize()` function to choose one ethnic group for each patient. ```{r} most <- function(x) { if (all(is.na(x))) return(NA_character_) y <- table(x, useNA = "no") if (length(which(y == max(y))) > 1) return(NA_character_) return(names(y)[which.max(y)]) } study_subjects %>% group_by(subject_id) %>% summarize(ethnic_group = most(ethnic_group)) %>% ungroup() %>% mutate(ethnic_group = ifelse(is.na(ethnic_group), "UNKNOWN", ethnic_group)) %>% print() -> subject_ethnic_groups ``` ```{r} study_subjects %>% select(subject_id, gender) %>% left_join(subject_ethnic_groups, by = "subject_id") %>% print() -> study_subjects ``` Now we add the demographic information `gender` and `ethnicity` into our `study_admissions` table: ```{r} study_admissions %>% left_join(study_subjects, by = "subject_id", copy = TRUE) %>% print() -> study_admissions ``` ## Close the connection to a database Close the connection: ```{r} dbDisconnect(con) ``` ## CONSORT Flow Diagrams CONSORT Flow Diagrams can be used to plot the flow of data selection of a patient cohort. For more details, see: [The CONSORT Flow Diagram](http://www.consort-statement.org/consort-statement/flow-diagram). Following code shows an example. ```{r plot} library(shape) library(diagram) # set margins and multiplot par(mfrow = c(1, 1)) par(mar = c(0, 0, 0, 0)) # initialise a plot device openplotmat() # position of boxes # 1st column indicates x axis position between 0 and 1 # 2nd column indicates y axis position between 0 and 1 # automatically assigns vertical position num_of_boxes <- 6 auto_coords = coordinates(num_of_boxes) vert_pos = rev(auto_coords[,1]) box_pos <- matrix(nrow = num_of_boxes, ncol = 2, data = 0) box_pos[1,] = c(0.20, vert_pos[1]) # 1st box box_pos[2,] = c(0.70, vert_pos[2]) # 2nd box box_pos[3,] = c(0.70, vert_pos[3]) # 3rd box box_pos[4,] = c(0.70, vert_pos[4]) # etc... box_pos[5,] = c(0.70, vert_pos[5]) box_pos[6,] = c(0.20, vert_pos[6]) # content of boxes box_content <- matrix(nrow = num_of_boxes, ncol = 1, data = 0) box_content[1] = "All patients in MIMIC-III \n n = 58,976" # 1st box box_content[2] = "Exclude patients of age < 18 \n n = 8,180" # 2nd box box_content[3] = "Exclude patients with no ICU admissions \n n = 1,071" # 3rd box box_content[4] = "Exclude patients with diabetes \n n = 1,324" # etc... box_content[5] = "Exclude patients with sepsis \n n = 4,804" box_content[6] = "Study cohort \n n = 43,597" # adjust the size of boxes to fit content box_x <- c(0.20, 0.25, 0.25, 0.25, 0.25, 0.20) box_y <- c(0.07, 0.07, 0.07, 0.07, 0.07, 0.07) # Draw the arrows straightarrow(from = c(box_pos[1,1],box_pos[2,2]), to = box_pos[2,], lwd = 1) straightarrow(from = c(box_pos[1,1],box_pos[3,2]), to = box_pos[3,], lwd = 1) straightarrow(from = c(box_pos[1,1],box_pos[4,2]), to = box_pos[4,], lwd = 1) straightarrow(from = c(box_pos[1,1],box_pos[5,2]), to = box_pos[5,], lwd = 1) straightarrow(from = box_pos[1,], to = box_pos[6,], lwd = 1) # Draw the boxes for (i in 1:num_of_boxes) { textrect(mid = box_pos[i,], radx = box_x[i], rady = box_y[i], lab = box_content[i], shadow.col = "grey") } ``` ## Install PostgreSQL and `RpostgreSQL` package on CentOS This is a note to myself (Dr. Hua Zhou). The postgres in yum is an old version. We want to install the most recent postgres and then build the `RpostgreSQL` package based on it. 1. Follow instructions in to install PostgreSQL 11 on CentOS 7. 2. Issue following command to install `RPostgreSQL` in R: ```{bash, eval = F} sudo R -e 'Sys.setenv(PG_INCDIR="/usr/pgsql-11/include/"); Sys.setenv(PG_LIBDIR="/usr/pgsql-11/lib/"); install.packages("RPostgreSQL")' ```