---
title: "Working with Databases - Part II"
author: "Dr. Hua Zhou @ UCLA"
date: "Feb 19, 2019"
subtitle: Biostat M280
output:
  html_document:
    toc: true
    #toc_depth: 2
  #ioslides_presentation:
  #  smaller: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width = 5, fig.height = 3.5, fig.align = 'center',
  cache = FALSE, cache.lazy = FALSE)
```

## Introduction

In this lecture we follow a [tutorial](http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html) by Dr. Colin Rundel to handle a bigish data set. We will learn:

- Import data from a bigish csv file.

- Tidy data. 

- Deposit data into an SQLite database.

- Query SQLite database. 

- Transform in database and plot in R.

## Machine information

```{r}
sessionInfo()
```

## CSV file

The `/home/m280data/nyc_parking/NYParkingViolations.csv` file on teaching server contains information about parking tickets in NYC. You can create a symbolic link to the data file by Bash command
```{bash, eval = (Sys.info()[["sysname"]] == "Linux")}
ln -sf /home/m280data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv
```

The CSV file is about 1.7GB
```{bash}
ls -l NYParkingViolations.csv
```

How many rows?
```{bash}
wc -l < NYParkingViolations.csv
```

First few lines:
```{bash}
head -5 NYParkingViolations.csv
```

## Read csv file

- `read.csv()` function in base R takes nearly 3-4 minutes to read in the 1.7GB csv file:
    ```{r}
    system.time(read.csv("NYParkingViolations.csv"))
    ```

- data.table is an R package for reading large data sets:
    ```{r}
    library(data.table)
    system.time({nyc = fread("NYParkingViolations.csv")})
    class(nyc)
    nyc = as.data.frame(nyc)
    class(nyc)    
    ```

- `read_csv()` function in tidyverse is 4-5 times faster than base R:
    ```{r}
    library("tidyverse")
    system.time({nyc = read_csv("NYParkingViolations.csv")})
    nyc
    ```

## Fix column names

```{r}
names(nyc) <- str_replace_all(names(nyc), " ", "_")
nyc
```

## Simplifying

We make a smaller data set with selected variables:
```{r}
# magrittr for %<>%
library("magrittr")
(nyc %<>% 
  select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         Number:Intersecting_Street, Vehicle_Color))
```

## Fix dates

Hadley Wickham's lubridate package facilitates handling date and time. Read [Chapter 16](http://r4ds.had.co.nz/dates-and-times.html) of _R for Data Science_ for more information.

Make variable `Issue_Date` of date-time type:
```{r}
library("lubridate")
class(nyc$Issue_Date)
nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)
nyc
```

Only keep data in years 2013-4.
```{r}
range(nyc$Issue_Date)
nyc$Issue_Date %>% year() %>% table()
filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))
```

## Putting it all together

```{r}
system.time({
  nyc <- read_csv("NYParkingViolations.csv") %>%
    setNames(str_replace_all(names(.)," ", "_")) %>%
    select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         Number:Intersecting_Street, Vehicle_Color) %>%
    mutate(Issue_Date = mdy(Issue_Date)) %>% 
    mutate(Issue_Day = day(Issue_Date),
           Issue_Month = month(Issue_Date),
           Issue_Year = year(Issue_Date),
           Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
    filter(Issue_Year %in% 2013:2014)  
})
nyc
```

## Ticket frequency

```{r}
nyc %>% 
  group_by(Issue_Date) %>% 
  summarize(n = n()) %>%
  ggplot(aes(x = Issue_Date, y = n)) + 
    geom_line() + 
    xlim(mdy("7/1/2013"), mdy("6/30/2014"))
```

## Create an SQLite database

Import the `nyc` data set into the SQLite database:
```{r}
library("DBI")
library("RSQLite")
library("tidyverse")
```

```{r, eval=FALSE}
if (Sys.info()[["sysname"]] == "Linux") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "/home/m280data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)
dbDisconnect(db)
```
You can create a symbolic link to the database file by Bash command
```{bash, eval = (Sys.info()[["sysname"]] == "Linux")}
ln -sf /home/m280data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite
```

Size of the database file
```{bash}
ls -l NYParkingViolations.sqlite
```

## Read data from database

Connect to table in database:
```{r}
db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)
nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)
nyc_sql %>% print(width = Inf)
```

## Use dplyr with SQLite

```{r}
addr <- nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
  filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr
```

## SQL query

```{r}
class(addr)
show_query(addr)
```

## Mutate and summarise

```{r}
addr %>% mutate(address = paste(Number, Street))
```
```{r}
addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))
```

## SQL grouping

```{r}
addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n())
```

## SQL query

```{r}
addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n()) %>% 
  show_query()
```

## SQL translation

dbplyr package (a dplyr backend for databases) has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL:
```{r}
library("dbplyr")
translate_sql(x == 1 & (y < 2 | z > 3))
translate_sql(x ^ 2 < 10)
translate_sql(x %% 2 == 10)
translate_sql(paste(x, y))
translate_sql(mean(x))
translate_sql(mean(x, na.rm = TRUE))
```

## Timings

dplyr using tibble:
```{r}
system.time(
  nyc %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)
```
dplyr using SQLite:
```{r}
system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)
```

`nyc_sql` was 30x times faster than `nyc`, but the former is disk based while the latter is in memory. Why is the discrepancy?

## Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

- When building a query, we don’t want the entire table, often we want just enough to check if our query is working.

- Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.

- Therefore, by default dplyr

    - won’t connect and query the database until absolutely necessary (e.g. show output),

    - and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
    
```{r}
nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
  filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
  group_by(Issuing_Agency, Violation_Precinct) %>%
  summarize(n = n())
```

## Full query

To force a full query and return a complete table it is necessary to use the `collect` function.
```{r}
system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n()) %>%
    collect()
)
```

## Plotting

### Bar plot

Suppose we want the bar plot of number of tickets on each weekday. We encounter error (???) with the usual ggplot2 command:
```{r, error = TRUE}
nyc_sql %>%
  ggplot() + 
    geom_bar(aes(x = Issue_WDay))
```
Apparently current version of ggplot2 (v3.1.0) works for `tbl_dbi/tbl_sql/tbl_lazy` now. 

Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per category by going through all the rows. But here `nyc_sql` is just a pointer to the SQLite table. We have to use the **transform in database, plot in R** strategy.
```{r}
nyc_sql %>%
  group_by(Issue_WDay) %>%
  count() %>%
  collect() %>%
  ggplot() + 
    geom_col(aes(x = Issue_WDay, y = n))
```

### Other plots

For example of making histogram and raster plot, read tutorial  <https://db.rstudio.com/best-practices/visualization/>.

## Close connection to database

```{r}
dbDisconnect(db)
```