---
title: 'R: Working with Databases'
author: "Christina Maimone"
date: '`r Sys.Date()`'
output:
  html_document:
    toc: yes
    toc_depth: '3'
editor_options: 
  chunk_output_type: console
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Overall Note

R is generally better suited to selecting data from databases than for creating database tables or entering data into a database.  But there are functions to do all operations.

# Connection

The `DBI` package has the core functionality of connecting R to database servers.  There are then packages that implement the core functionality of `DBI` for each specific implementation of SQL.  A package for PostgreSQL is `RPostgres`. 

```{r, eval=FALSE}
if(!'RPostgres' %in% installed.packages()){
  install.packages("RPostgres")
}
```

```{r}
library(RPostgres)
```

We connect with a function call like the following.

Note: this code was generated on my local machine connected to a local copy of the database.  Your connection details will be different.  Note I also have permissions to modify this database.

```{r}
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```

We will need a connection like this for any of the methods of connecting below.  You can have multiple connection objects to different databases -- just call the variables something different.

Note that the above example doesn't have a username or password because the database is local and doesn't require one.  If you're using a remote database, you don't want to hard core your credentials into your code.  There are ways to set environment variables or, for PostgreSQL specifically, use a .pgpass file to store this information.  But RStudio also gives you the option to pop up a box to enter the information:

```{r, eval=FALSE}
con2 <- dbConnect(RPostgres::Postgres(), 
                  host="localhost", 
                  dbname="dvdrental",
                  user=rstudioapi::askForPassword("Database username"),
                  password=rstudioapi::askForPassword("Database password"))
```


# Using DBI

We can use the basic functions in the DBI library:

## Get Database Information
Note that the following `db-` prefixed functions are exported from the DBI and RPostgreSQL namespaces.

```{r}
dbListTables(con)
dbListFields(con, "actor")
```


## Execute Queries

```{r}
actor_subset <- dbGetQuery(con, "select * from actor where actor_id > 50")
head(actor_subset)
```

Note that we don't need a `;` at the end of the query.

If we want an entire table, there's a function for that:

```{r}
actor <- dbReadTable(con, "actor")
head(actor)
```

If you want part of your query to be determined by a variable -- especially if it's a variable supplied or defined as input or by a user (not you) -- you should guard against SQL injection (someone trying to attack your database by tricking your code into running malicious SQL statements) by using a parameterized query:

```{r}
# YES
myquery <- dbSendQuery(con, "select * from actor where actor_id = $1")
dbBind(myquery, list(4))
dbFetch(myquery)
```

The `$1` stands in for a value you'll substitute in.  For multiple, you can use `$2`, etc.  (The `$1` notation is for PostgreSQL -- other types of databases use `?` or other symbols.)

When you're done with the results from a prepared query, clear the result:

```{r}
dbClearResult(myquery)
```


```{r, eval=FALSE}
# NO! - at least not if the variable can be manipulated by a user
dbSendQuery(con, paste0("select * from actor where actor_id=", myvar))
```

For more, see http://db.rstudio.com/best-practices/run-queries-safely/.

Parameterizing statements also lets you reuse them with different values.

You may also need to use `dbClearResult()` after other calls to `dbSendQuery()` that return a result (select statements, table creations, inserts, updates) -- but not to `dbGetQuery()`.  Otherwise you may get some warning messages like:

```{asis}
Warning message:
In result_create(conn@ptr, statement) :
  Closing open result set, cancelling previous query
```

Which are ok, but could get annoying.

## Modifying a Database

If you're not a superuser on the `dvdrental` database, just try connecting to a database you can modify. Then the basic function is `dbSendQuery` for any command you want to execute where you aren't retrieving results.

Note that by default, statements take effect immediately - they are not in a transaction that you need to commit.  To use transactions, see below.

```{r, eval=FALSE}
res <- dbSendQuery(con, statement="update actor set first_name='Jenn' where actor_id=4")
print(res)  # contains info on result of update
dbClearResult(res)  # prevent warning messages
```

To create a table, you can give it a data frame

```{r, eval=FALSE}
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
dbWriteTable(con, "mynewtable", mytbl)
```

or you could specify the table with SQL, and execute with `dbSendQuery` but this can get cumbersome.

To remove a table

```{r, eval=FALSE}
dbRemoveTable(con, "mynewtable")
```


## Transactions

There are also methods for managing transactions if you need: `dbBegin`, `dbRollback`, `dbCommit`. Transactions are key for when you need to be sure that a sequence of SQL commands (e.g. `UPDATE`, `CREATE`, `DROP`, `DELETE`, etc.) execute correctly before they're made permanent (i.e. "committed").


```{r, eval=FALSE}
dbBegin(con)
dbWriteTable(con, "mynewtable", mytbl)
dbRollback(con)
dbGetQuery(con, "SELECT * FROM mynewtable")
```

The above will produce error:

```
Error in result_create(conn@ptr, statement) : 
  Failed to prepare query: ERROR:  relation "mynewtable" does not exist
LINE 1: SELECT * FROM mynewtable
```

because the transaction was rolled back, not committed.

## Close Connection

Connections will get closed when you quit R, but it's good practice to explicitly close them.

```{r}
dbDisconnect(con)
```




# Use `dplyr`

For more complete info, see the [RStudio databases site](http://db.rstudio.com/dplyr/).

```{r, eval=FALSE}
needToInstall <- c("tidyverse")
needToInstall <- needToInstall[which(!needToInstall %in% installed.packages())]
if(length(needToInstall) > 0){
  sapply(needToInstall, install.packages)
}
```


```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```

First, connect like normal

```{r, echo=TRUE}
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```

Get a reference to a table:

```{r, message=FALSE, warning=FALSE}
actortbl <- tbl(con, "actor")
```

If we look at this object, it doesn't have data in it:

```{r}
str(actortbl)
```

It just has connection information.  `dplyr` will try to perform operations within the database where it can, instead of pulling all of the data into R.

Yet you can print the object and see observations:

```{r, echo=TRUE}
actortbl
```

It retrieves them as needed, and also gives you a nice display in notebooks (a special kind of RMarkdown file) -- output looks a litle different in the console and in RMarkdown files like this.

You can use `dplyr` verbs to work with the table objects from the database, as if they were data frames (or tibbles).

```{r, echo=TRUE}
actortbl %>% 
  select(actor_id, first_name, last_name) %>% 
  filter(actor_id > 150)
```

The above generates and executes the SQL needed to get the result.  It turns `filter` into a select statement with the appropriate where clause.

```{r, echo=TRUE}
rentaltbl <- tbl(con, "rental")
rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n())
```

What does the above correspond to as a SQL query?  `select customer_id, count(*) from rental group by customer_id;`

```{r, echo=TRUE}
rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n()) %>% 
  show_query()
```

You can use `collect` to pull down all of the data (tell `dplyr` to stop being lazy).

```{r, echo=TRUE}
# First, without collecting
df1 <- rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n()) 
df1
```

Looks OK, except:

```{r, eval=FALSE}
df1[1,]
```

Gives you:

`Error in df1[1, ] : incorrect number of dimensions`

It's the wrong dimensions because `df1` isn't actually a data.frame:

```{r}
str(df1)
```

It is telling us we need to collect the data first to actually pull it into R.

```{r, echo=TRUE}
# Then with collecting
df2 <- rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n()) %>% 
  collect()
df2
```

```{r}
df2[1,]
```

You can also use `dplyr`'s commands to join:

```{r, echo=TRUE}
custtbl <- tbl(con, "customer")
addrtbl <- tbl(con, "address")
custtbl %>% 
  inner_join(addrtbl, by="address_id") %>%
  filter(postal_code == '52137') %>%
  select(first_name, last_name, postal_code)
```


You could create a table with `copy_to` (if you have the correct permissions)

```{r, scho=TRUE, eval=FALSE}
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
copy_to(con, mytbl, "mynewtable")
```

By default, it creates a **temporary** table.  But this is a setting you can change, and you can also specify what columns to index on the table.  


Disconnect like we normally do

```{r, eval=FALSE}
dbDisconnect(con)
```



# R Markdown 

R Markdown lets you execute SQL queries directly.  You first set up a `DBI` connection like above, and then, instead of having R chunks of code, you can have SQL chunks of code:

````r
`r ''````{r}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
````

````sql
`r ''````{sql, connection=con}
select * from actor where actor_id > 75;
```
````

````r
`r ''````{r}
dbDisconnect(con)
```
````


Here is the above, actually executed in RMarkdown:

```{r, eval=FALSE}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```

```{sql, connection=con}
select * from actor where actor_id > 75;
```

```{r}
dbDisconnect(con)
```


For more details, see [knitr Language Engines: SQL](http://rmarkdown.rstudio.com/authoring_knitr_engines.html#sql).

# PL/R

Database administrators can install functionality in a PostgreSQL database to allow you to write R functions directly in the database, and then call them with normal SQL queries.  This is done with [PL/R](https://github.com/postgres-plr/plr).  Enabling this functionality on systems can be risky, because R potentially gives users access to files on the database server.  Database admins are usually conservative in allowing PL/R on the system, but it can be very useful in production systems.  You can use to to generate reports, compute statistical methods, and even create plots.

We aren't covering PL/R (or even writing SQL functions more generally), but it's good to know this functionality exists if you're ever working with a large production system.