--- title: "Working with Databases - Part II" author: "Dr. Hua Zhou @ UCLA" date: "Feb 20, 2020" subtitle: Biostat 203B output: html_document: toc: true #toc_depth: 2 #ioslides_presentation: # smaller: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width = 5, fig.height = 3.5, fig.align = 'center', cache = FALSE, cache.lazy = FALSE) ``` ## Introduction In this lecture we will learn: - Import data from bigish csv files (MIMIC-III). - Tidy data (TODO). - Deposit data into an SQLite database. - Query SQLite database. - Transform in database and plot in R. ## Machine information ```{r} sessionInfo() ``` ```{r} library(tidyverse) ``` ## CSV file The `/home/203bdata/mimic-iii` folder on the teaching server contains the MIMIC-III data. On my Mac, it's at `/Users/huazhou/Google Drive/mimic/mimic-iii`. ```{bash, eval = (Sys.info()[["sysname"]] == "Linux")} # On teaching server ls -l /home/203bdata/mimic-iii ``` ```{bash, eval = (Sys.info()[["sysname"]] == "Darwin")} # On my MacBook ls -l /Users/huazhou/Google\ Drive/mimic/mimic-iii ``` ```{r} datafolder <- ifelse(Sys.info()[["sysname"]] == "Linux", "/home/203bdata/mimic-iii", "/Users/huazhou/Google\ Drive/mimic/mimic-iii") list.files(datafolder, full.names = TRUE, patter = "*.csv") ``` ## Read csv file: `read.csv()` vs `read_csv()` (tidyverse) vs data.table First let's compare different approaches for reading CSV files. - `read.csv()` function in base R takes about 1 minute to read in the 500MB csv file `DATETIMEEVENTS.csv`: ```{r} csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv") system.time(read.csv(csvfile)) ``` - `data.table` is an R package for reading large data sets. It's much faster to read the same file (about 8 seconds). The result is of type `data.table` and `data.frame`. ```{r} library(data.table) system.time({dtevents <- fread(csvfile)}) class(dtevents) dtevents ``` - `read_csv()` function in tidyverse takes about 6-7 seconds: ```{r} system.time({dtevents <- read_csv(csvfile)}) class(dtevents) dtevents %>% print(width = Inf) ``` ## Read all CSVs and deposit to a SQLite database Create a SQLite database file called `mimiciii.sqlite`. ```{r} library("DBI") library("RSQLite") con = dbConnect(RSQLite::SQLite(), "mimiciii.sqlite") str(con) ``` Because of excellent efficiency of `data.table` packages, we use it to read all CSVs and deposit into `mimiciii.sqlite`. **Do not run this chunk by yourself on teaching server.** This takes a couple minutes. ```{r, eval = (!file.exists("mimiciii.sqlite") | file.size("mimiciii.sqlite") == 0)} datafolder csv_names <- list.files(datafolder) csv_names <- csv_names[str_detect(csv_names, "csv")] csv_names system.time({ for (table_name in csv_names) { csv_name <- str_c(datafolder, "/", table_name) tmp <- fread(csv_name) table_name <- str_remove(table_name, ".csv") dbWriteTable(con, table_name, tmp, overwrite = TRUE) }}) ``` ```{r} # list tables in database dbListTables(con) # disconnect from database dbDisconnect(con) ``` The resultant database file `mimiciii.sqlite` has size 10.4GB. ```{bash} ls -l mimiciii.sqlite ``` ## Read data from database Connect to the database `mimiciii.sqlite` and list the tables: ```{r} con <- dbConnect(RSQLite::SQLite(), dbname = "./mimiciii.sqlite") dbListTables(con) ``` Read the table `CHARTEVENTS`: ```{r} chartevents <- tbl(con, "CHARTEVENTS") class(chartevents) chartevents %>% print(width = Inf) ``` ## Use dplyr with SQLite Retrieve chart events of `SUBJECT_ID` beetween 23 and 164: ```{r} chartevents_subset <- chartevents %>% select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>% filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) chartevents_subset ``` ## SQL query ```{r} class(chartevents_subset) show_query(chartevents_subset) ``` ## SQL grouping ```{r} chartevents %>% group_by(SUBJECT_ID) %>% summarise(n = n()) ``` ## SQL query ```{r} chartevents %>% group_by(SUBJECT_ID) %>% summarise(n = n()) %>% show_query() ``` ## SQL translation dbplyr package (a dplyr backend for databases) has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL: ```{r} library("dbplyr") translate_sql(x == 1 & (y < 2 | z > 3)) translate_sql(x ^ 2 < 10) translate_sql(x %% 2 == 10) translate_sql(paste(x, y)) translate_sql(mean(x)) translate_sql(mean(x, na.rm = TRUE)) ``` ## Timings Let's compare the timings of dplyr (in-momory) and dbplyr (on disk database). - dplyr using tibble: ```{r} csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv") dtevents_tibble <- read_csv(csvfile) system.time( dtevents_tibble %>% select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>% # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>% group_by(SUBJECT_ID) %>% summarize(n = n()) ) ``` - dplyr using SQLite: ```{r} dtevents_sql <- tbl(con, "DATETIMEEVENTS") system.time( dtevents_sql %>% select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>% # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>% group_by(SUBJECT_ID) %>% summarize(n = n()) ) ``` SQLite (0.08 seconds) was much faster than tibble (0.25 seconds). But SQLitee is disk-based, while the tibble is in memory. Why is the discrepancy? ## Laziness dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends. - When building a query, we don’t want the entire table, often we want just enough to check if our query is working. - Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together. - Therefore, by default dplyr - won’t connect and query the database until absolutely necessary (e.g. show output), - and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like ```{r} dtevents_sql %>% select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>% # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>% group_by(SUBJECT_ID) %>% summarize(n = n()) ``` ## Full query To force a full query and return a complete table it is necessary to use the `collect` function. ```{r} system.time( dtevents_sql %>% select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>% # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>% group_by(SUBJECT_ID) %>% summarize(n = n()) %>% collect() ) ``` ## Plotting ### Bar plot Suppose we want the bar plot of number of datetime events on each `SUBJECT_ID`. ```{r} chartevents %>% count(SUBJECT_ID) %>% ggplot() + geom_histogram(mapping = aes(x = n)) ``` Apparently current version of ggplot2 (v3.2.1) works for `tbl_dbi/tbl_sql/tbl_lazy` now. Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per bin by going through all the rows. But here `chartevents` is just a pointer to the SQLite table. We had to use the **transform in database, plot in R** strategy. ```{r} chartevents %>% count(SUBJECT_ID) %>% collect() %>% ggplot() + geom_histogram(mapping = aes(x = n)) ``` ### Other plots For example of making histogram and raster plot, read tutorial . ## Close connection to database ```{r} dbDisconnect(con) ```