dbplyr
library(tidyverse)
set.seed(1234)
theme_set(theme_minimal())
So far we’ve only worked with data stored locally in-memory as data frames. However there are also situations where you want to work with data stored in an external database. Databases are generally stored remotely on-disk, as opposed to in memory. If your data is already stored in a database, or if you have too much data to fit it all into memory simultaneously, you need a way to access it. Fortunately for you, dplyr
offers support for on-disk databases through dbplyr
.
Structured Query Language (SQL) is a means of communicating with a relational database management system. There are different types of SQL databases which offer varying functionality:
Databases can also be stored across many platforms. Some types of databases (such as SQLite) can be stored as a single file and loaded in-memory like a data frame. However for large or extremely complex databases, a local computer is insufficient. Instead, one uses a distributed computing platform to store their database in the cloud. Examples include the UChicago Research Computing Center (RCC), Amazon Web Services, and Google Cloud Platform. Note that hosting platforms not typically free (though you can request an account with RCC as a student).
First you need to install dbplyr
:
install.packages("dbplyr")
Depending on the type of database, you also need to install the appropriate database interface (DBI) package. The DBI package provides the necessary interface between the database and dplyr
. Five commonly used backends are:
RMySQL
connects to MySQL and MariaDBRPostgreSQL
connects to Postgres and Redshift.RSQLite
embeds a SQLite database.odbc
connects to many commercial databases via the open database connectivity protocol.bigrquery
connects to Google’s BigQuery.Let’s create a local SQLite database using the flights
data.
library(dplyr)
my_db <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
The first argument to DBI::dbConnect()
is the database backend. SQLite only requires one other argument: the path to the database. Here, we use :memory:
to create a temporary in-memory database.
To add data to the database, use copy_to()
. Let’s stock the database with nycflights13::flights
:
library(nycflights13)
copy_to(my_db,
flights,
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum"
)
)
Now that we copied the data, we can use tbl()
to reference a specific table inside the database:
flights_db <- tbl(my_db, "flights")
flights_db
## # Source: table<flights> [?? x 19]
## # Database: sqlite 3.19.3 []
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dbl>
select(flights_db, year:day, dep_delay, arr_delay)
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.19.3 []
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with more rows
filter(flights_db, dep_delay > 240)
## # Source: lazy query [?? x 19]
## # Database: sqlite 3.19.3 []
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 848 1835 853 1001
## 2 2013 1 1 1815 1325 290 2120
## 3 2013 1 1 1842 1422 260 1958
## 4 2013 1 1 2115 1700 255 2330
## 5 2013 1 1 2205 1720 285 46
## 6 2013 1 1 2343 1724 379 314
## 7 2013 1 2 1332 904 268 1616
## 8 2013 1 2 1412 838 334 1710
## 9 2013 1 2 1607 1030 337 2003
## 10 2013 1 2 2131 1512 379 2340
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dbl>
arrange(flights_db, year, month, day)
## # Source: table<flights> [?? x 19]
## # Database: sqlite 3.19.3 []
## # Ordered by: year, month, day
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dbl>
mutate(flights_db, speed = air_time / distance)
## # Source: lazy query [?? x 20]
## # Database: sqlite 3.19.3 []
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dbl>, speed <dbl>
summarise(flights_db, delay = mean(dep_time))
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.19.3 []
## delay
## <dbl>
## 1 1349.11
The commands are generally the same as you would use in dplyr
. The only difference is that dplyr
converts your R commands into SQL syntax:
select(flights_db, year:day, dep_delay, arr_delay) %>%
show_query()
## <SQL>
## SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `dep_delay` AS `dep_delay`, `arr_delay` AS `arr_delay`
## FROM `flights`
dbplyr
is also lazy:
c1 <- filter(flights_db, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)
Nothing has actually gone to the database yet.
c4
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3 []
## # Ordered by: year, month, day, carrier
## year month day carrier dep_delay air_time distance speed
## <int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 9E 0 189 1029 326.6667
## 2 2013 1 1 9E -9 57 228 240.0000
## 3 2013 1 1 9E -3 68 301 265.5882
## 4 2013 1 1 9E -6 57 209 220.0000
## 5 2013 1 1 9E -8 66 264 240.0000
## 6 2013 1 1 9E 0 40 184 276.0000
## 7 2013 1 1 9E 6 146 740 304.1096
## 8 2013 1 1 9E 0 139 665 287.0504
## 9 2013 1 1 9E -8 150 765 306.0000
## 10 2013 1 1 9E -6 41 187 273.6585
## # ... with more rows
Now we finally communicate with the database, but only retrieved the first 10 rows (notice the ??
in query [?? x 8]
). This is a built-in feature to avoid downloading an extremely large data frame our machine cannot handle. To obtain the full results, use collect()
:
collect(c4)
## # A tibble: 842 x 8
## year month day carrier dep_delay air_time distance speed
## <int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 9E 0 189 1029 326.6667
## 2 2013 1 1 9E -9 57 228 240.0000
## 3 2013 1 1 9E -3 68 301 265.5882
## 4 2013 1 1 9E -6 57 209 220.0000
## 5 2013 1 1 9E -8 66 264 240.0000
## 6 2013 1 1 9E 0 40 184 276.0000
## 7 2013 1 1 9E 6 146 740 304.1096
## 8 2013 1 1 9E 0 139 665 287.0504
## 9 2013 1 1 9E -8 150 765 306.0000
## 10 2013 1 1 9E -6 41 187 273.6585
## # ... with 832 more rows
Google Bigquery is a distributed cloud platform for data warehousing and analytics. It can scan terabytes of data in seconds and petabytes in minutes. It has flexible pricing that scales depending on your demand on their resources, and could cost as little as pennies, though depending on your computation may cost more.
dplyr
Google Bigquery hosts several public (and free) datasets. One is the NYC Taxi and Limousine Trips dataset, which contains trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The dataset itself is hundreds of gigabytes and could never be loaded on a desktop machine. But fortunately we can harness the power of the cloud.
To connect to the database, we use the bigrquery
library and bigrquery::bigquery()
:
library(bigrquery)
taxi <- DBI::dbConnect(bigrquery::bigquery(),
project = "nyc-tlc",
dataset = "yellow",
billing = getOption("bigquery_id"))
taxi
## <BigQueryConnection>
## Dataset: nyc-tlc:yellow
project
- the project that is hosting the datadataset
- the specific database to be accessedbilling
- your unique id to access the data (and be charged if you run too many queries or use to much computing power). You need to create an account in order to use BigQuery, even if you want to access the free datasets. I stored mine in .Rprofile
using options()
.1First lets determine in 2014, how many trips per taken each month in yellow cabs? The SQL syntax is:
SELECT
LEFT(STRING(pickup_datetime), 7) month,
COUNT(*) trips
FROM
[nyc-tlc:yellow.trips]
WHERE
YEAR(pickup_datetime) = 2014
GROUP BY
1
ORDER BY
1
In dbplyr
, we use:
system.time({
trips_by_month <- taxi %>%
tbl("trips") %>%
filter(year(pickup_datetime) == 2014) %>%
mutate(month = month(pickup_datetime)) %>%
count(month) %>%
arrange(month) %>%
collect()
})
## user system elapsed
## 0.270 0.013 3.657
trips_by_month
## # A tibble: 12 x 2
## month n
## <int> <int>
## 1 1 13782492
## 2 2 13063791
## 3 3 15428127
## 4 4 14618759
## 5 5 14774041
## 6 6 13813029
## 7 7 13106365
## 8 8 12688877
## 9 9 13374016
## 10 10 14232487
## 11 11 13218216
## 12 12 13014161
What about the average speed per hour of day in yellow cabs?
system.time({
speed_per_hour <- taxi %>%
tbl("trips") %>%
mutate(hour = hour(pickup_datetime),
trip_duration = (dropoff_datetime - pickup_datetime) /
3600000000) %>%
mutate(speed = trip_distance / trip_duration) %>%
filter(fare_amount / trip_distance >= 2,
fare_amount / trip_distance <= 10) %>%
group_by(hour) %>%
summarize(speed = mean(speed)) %>%
arrange(hour) %>%
collect()
})
## user system elapsed
## 0.359 0.015 1.402
ggplot(speed_per_hour, aes(hour, speed)) +
geom_line() +
labs(title = "Average Speed of NYC Yellow Taxis",
x = "Hour of day",
y = "Average speed, in MPH")
Finally, what is the average speed by day of the week?
system.time({
speed_per_day <- taxi %>%
tbl("trips") %>%
mutate(hour = hour(pickup_datetime),
day = dayofweek(pickup_datetime),
trip_duration = (dropoff_datetime - pickup_datetime) /
3600000000) %>%
mutate(speed = trip_distance / trip_duration) %>%
filter(fare_amount / trip_distance >= 2,
fare_amount / trip_distance <= 10,
hour >= 8,
hour <= 18) %>%
group_by(day) %>%
summarize(speed = mean(speed)) %>%
arrange(day) %>%
collect()
})
## user system elapsed
## 0.384 0.015 1.098
speed_per_day
## # A tibble: 7 x 2
## day speed
## <int> <dbl>
## 1 1 14.29703
## 2 2 12.21557
## 3 3 11.11933
## 4 4 10.93281
## 5 5 10.97011
## 6 6 11.24917
## 7 7 13.09473
devtools::session_info()
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-04-24
##
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
## bindrcpp 0.2.2.9000 2018-04-08 Github (krlmlr/bindrcpp@bd5ae73)
## bit 1.1-12 2014-04-09 CRAN (R 3.4.0)
## bit64 0.9-7 2017-05-08 CRAN (R 3.4.0)
## blob 1.1.1 2018-03-25 CRAN (R 3.4.4)
## broom 0.4.4 2018-03-29 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.3 2017-12-07 local
## DBI 0.8 2018-03-02 CRAN (R 3.4.3)
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4.9003 2018-04-08 Github (tidyverse/dplyr@b7aaa95)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1.9000 2018-04-24 Github (tidyverse/ggplot2@3c9c504)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.1 2018-01-18 CRAN (R 3.4.3)
## hms 0.4.2 2018-03-10 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.4 2018-04-11 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-137 2018-04-07 CRAN (R 3.4.4)
## parallel 3.4.3 2017-12-07 local
## pillar 1.2.1 2018-02-27 CRAN (R 3.4.3)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.8.3.3 2018-03-30 CRAN (R 3.4.4)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
## rlang 0.2.0.9001 2018-04-24 Github (r-lib/rlang@82b2727)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## RSQLite 2.1.0 2018-03-29 CRAN (R 3.4.4)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0.9000 2018-04-24 Github (hadley/scales@d767915)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.4.3)
## tidyr * 0.8.0 2018-01-29 CRAN (R 3.4.3)
## tidyselect 0.2.4 2018-02-26 CRAN (R 3.4.3)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.2 2018-04-24 Github (jimhester/withr@79d7b0d)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
This may not make sense to you. You will learn more about storing credentials next week in our unit on accessing data from the web.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.