--- title: 'R: Working with Databases' author: "Christina Maimone" date: '`r Sys.Date()`' output: html_document: toc: yes toc_depth: '3' editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Overall Note R is generally better suited to selecting data from databases than for creating database tables or entering data into a database. But there are functions to do all operations. # Connection The `DBI` package has the core functionality of connecting R to database servers. There are then packages that implement the core functionality of `DBI` for each specific implementation of SQL. A package for PostgreSQL is `RPostgres`. ```{r, eval=FALSE} if(!'RPostgres' %in% installed.packages()){ install.packages("RPostgres") } ``` ```{r} library(RPostgres) ``` We connect with a function call like the following. Note: this code was generated on my local machine connected to a local copy of the database. Your connection details will be different. Note I also have permissions to modify this database. ```{r} con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental") ``` We will need a connection like this for any of the methods of connecting below. You can have multiple connection objects to different databases -- just call the variables something different. Note that the above example doesn't have a username or password because the database is local and doesn't require one. If you're using a remote database, you don't want to hard core your credentials into your code. There are ways to set environment variables or, for PostgreSQL specifically, use a .pgpass file to store this information. But RStudio also gives you the option to pop up a box to enter the information: ```{r, eval=FALSE} con2 <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental", user=rstudioapi::askForPassword("Database username"), password=rstudioapi::askForPassword("Database password")) ``` # Using DBI We can use the basic functions in the DBI library: ## Get Database Information Note that the following `db-` prefixed functions are exported from the DBI and RPostgreSQL namespaces. ```{r} dbListTables(con) dbListFields(con, "actor") ``` ## Execute Queries ```{r} actor_subset <- dbGetQuery(con, "select * from actor where actor_id > 50") head(actor_subset) ``` Note that we don't need a `;` at the end of the query. If we want an entire table, there's a function for that: ```{r} actor <- dbReadTable(con, "actor") head(actor) ``` If you want part of your query to be determined by a variable -- especially if it's a variable supplied or defined as input or by a user (not you) -- you should guard against SQL injection (someone trying to attack your database by tricking your code into running malicious SQL statements) by using a parameterized query: ```{r} # YES myquery <- dbSendQuery(con, "select * from actor where actor_id = $1") dbBind(myquery, list(4)) dbFetch(myquery) ``` The `$1` stands in for a value you'll substitute in. For multiple, you can use `$2`, etc. (The `$1` notation is for PostgreSQL -- other types of databases use `?` or other symbols.) When you're done with the results from a prepared query, clear the result: ```{r} dbClearResult(myquery) ``` ```{r, eval=FALSE} # NO! - at least not if the variable can be manipulated by a user dbSendQuery(con, paste0("select * from actor where actor_id=", myvar)) ``` For more, see http://db.rstudio.com/best-practices/run-queries-safely/. Parameterizing statements also lets you reuse them with different values. You may also need to use `dbClearResult()` after other calls to `dbSendQuery()` that return a result (select statements, table creations, inserts, updates) -- but not to `dbGetQuery()`. Otherwise you may get some warning messages like: ```{asis} Warning message: In result_create(conn@ptr, statement) : Closing open result set, cancelling previous query ``` Which are ok, but could get annoying. ## Modifying a Database If you're not a superuser on the `dvdrental` database, just try connecting to a database you can modify. Then the basic function is `dbSendQuery` for any command you want to execute where you aren't retrieving results. Note that by default, statements take effect immediately - they are not in a transaction that you need to commit. To use transactions, see below. ```{r, eval=FALSE} res <- dbSendQuery(con, statement="update actor set first_name='Jenn' where actor_id=4") print(res) # contains info on result of update dbClearResult(res) # prevent warning messages ``` To create a table, you can give it a data frame ```{r, eval=FALSE} mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10]) dbWriteTable(con, "mynewtable", mytbl) ``` or you could specify the table with SQL, and execute with `dbSendQuery` but this can get cumbersome. To remove a table ```{r, eval=FALSE} dbRemoveTable(con, "mynewtable") ``` ## Transactions There are also methods for managing transactions if you need: `dbBegin`, `dbRollback`, `dbCommit`. Transactions are key for when you need to be sure that a sequence of SQL commands (e.g. `UPDATE`, `CREATE`, `DROP`, `DELETE`, etc.) execute correctly before they're made permanent (i.e. "committed"). ```{r, eval=FALSE} dbBegin(con) dbWriteTable(con, "mynewtable", mytbl) dbRollback(con) dbGetQuery(con, "SELECT * FROM mynewtable") ``` The above will produce error: ``` Error in result_create(conn@ptr, statement) : Failed to prepare query: ERROR: relation "mynewtable" does not exist LINE 1: SELECT * FROM mynewtable ``` because the transaction was rolled back, not committed. ## Close Connection Connections will get closed when you quit R, but it's good practice to explicitly close them. ```{r} dbDisconnect(con) ``` # Use `dplyr` For more complete info, see the [RStudio databases site](http://db.rstudio.com/dplyr/). ```{r, eval=FALSE} needToInstall <- c("tidyverse") needToInstall <- needToInstall[which(!needToInstall %in% installed.packages())] if(length(needToInstall) > 0){ sapply(needToInstall, install.packages) } ``` ```{r, message=FALSE, warning=FALSE} library(tidyverse) ``` First, connect like normal ```{r, echo=TRUE} con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental") ``` Get a reference to a table: ```{r, message=FALSE, warning=FALSE} actortbl <- tbl(con, "actor") ``` If we look at this object, it doesn't have data in it: ```{r} str(actortbl) ``` It just has connection information. `dplyr` will try to perform operations within the database where it can, instead of pulling all of the data into R. Yet you can print the object and see observations: ```{r, echo=TRUE} actortbl ``` It retrieves them as needed, and also gives you a nice display in notebooks (a special kind of RMarkdown file) -- output looks a litle different in the console and in RMarkdown files like this. You can use `dplyr` verbs to work with the table objects from the database, as if they were data frames (or tibbles). ```{r, echo=TRUE} actortbl %>% select(actor_id, first_name, last_name) %>% filter(actor_id > 150) ``` The above generates and executes the SQL needed to get the result. It turns `filter` into a select statement with the appropriate where clause. ```{r, echo=TRUE} rentaltbl <- tbl(con, "rental") rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) ``` What does the above correspond to as a SQL query? `select customer_id, count(*) from rental group by customer_id;` ```{r, echo=TRUE} rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) %>% show_query() ``` You can use `collect` to pull down all of the data (tell `dplyr` to stop being lazy). ```{r, echo=TRUE} # First, without collecting df1 <- rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) df1 ``` Looks OK, except: ```{r, eval=FALSE} df1[1,] ``` Gives you: `Error in df1[1, ] : incorrect number of dimensions` It's the wrong dimensions because `df1` isn't actually a data.frame: ```{r} str(df1) ``` It is telling us we need to collect the data first to actually pull it into R. ```{r, echo=TRUE} # Then with collecting df2 <- rentaltbl %>% group_by(customer_id) %>% summarize(count=n()) %>% collect() df2 ``` ```{r} df2[1,] ``` You can also use `dplyr`'s commands to join: ```{r, echo=TRUE} custtbl <- tbl(con, "customer") addrtbl <- tbl(con, "address") custtbl %>% inner_join(addrtbl, by="address_id") %>% filter(postal_code == '52137') %>% select(first_name, last_name, postal_code) ``` You could create a table with `copy_to` (if you have the correct permissions) ```{r, scho=TRUE, eval=FALSE} mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10]) copy_to(con, mytbl, "mynewtable") ``` By default, it creates a **temporary** table. But this is a setting you can change, and you can also specify what columns to index on the table. Disconnect like we normally do ```{r, eval=FALSE} dbDisconnect(con) ``` # R Markdown R Markdown lets you execute SQL queries directly. You first set up a `DBI` connection like above, and then, instead of having R chunks of code, you can have SQL chunks of code: ````r `r ''````{r} library(RPostgres) con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental") ``` ```` ````sql `r ''````{sql, connection=con} select * from actor where actor_id > 75; ``` ```` ````r `r ''````{r} dbDisconnect(con) ``` ```` Here is the above, actually executed in RMarkdown: ```{r, eval=FALSE} library(RPostgres) con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental") ``` ```{sql, connection=con} select * from actor where actor_id > 75; ``` ```{r} dbDisconnect(con) ``` For more details, see [knitr Language Engines: SQL](http://rmarkdown.rstudio.com/authoring_knitr_engines.html#sql). # PL/R Database administrators can install functionality in a PostgreSQL database to allow you to write R functions directly in the database, and then call them with normal SQL queries. This is done with [PL/R](https://github.com/postgres-plr/plr). Enabling this functionality on systems can be risky, because R potentially gives users access to files on the database server. Database admins are usually conservative in allowing PL/R on the system, but it can be very useful in production systems. You can use to to generate reports, compute statistical methods, and even create plots. We aren't covering PL/R (or even writing SQL functions more generally), but it's good to know this functionality exists if you're ever working with a large production system.