---
title: Tidyverse and Some SQLite
author: "Author: Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: true
fontsize: 14pt
bibliography: bibtex.bib
weight: 18
type: docs
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
warning=FALSE, message=FALSE)
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(limma)
library(ggplot2) })
```
Source code downloads:
[ [Slides](https://girke.bioinformatics.ucr.edu/GEN242/slides/slides_22/) ]
[ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/dplyr/dplyr.Rmd) ]
[ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/dplyr/dplyr.R) ]
## Overview
Modern object classes and methods for handling `data.frame` like structures
are provided by the `dplyr` (`tidyr`) and `data.table` packages. A related example is Bioconductor's
`DataTable` object class [@noauthor_undated-kc]. This tutorial provide a short introduction to the usage and
functionalities of the `dplyr` and related packages.
### Related documentation
More detailed tutorials on this topic can be found here:
* [dplyr: A Grammar of Data Manipulation](https://rdrr.io/cran/dplyr/)
* [Introduction to `dplyr`](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)
* [Tutorial on `dplyr`](http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)
* [Cheatsheet for Joins from Jenny Bryan](http://stat545.com/bit001_dplyr-cheatsheet.html)
* [Tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)
* [Intro to `data.table` package](https://www.r-bloggers.com/intro-to-the-data-table-package/)
* [Big data with `dplyr` and `data.table`](https://www.r-bloggers.com/working-with-large-datasets-with-dplyr-and-data-table/)
* [Fast lookups with `dplyr` and `data.table`](https://www.r-bloggers.com/fast-data-lookups-in-r-dplyr-vs-data-table/)
### Installation
The `dplyr` (`tidyr`) environment has evolved into an ecosystem of packages. To simplify
package management, one can install and load the entire collection via the
`tidyverse` package. For more details on `tidyverse` see
[here](http://tidyverse.org/).
```{r tidyverse_install, eval=FALSE}
install.packages("tidyverse")
```
### Construct a `tibble` (`tibble`)
```{r data_frame_tbl1, eval=TRUE}
library(tidyverse)
as_tibble(iris) # coerce data.frame to tibble tbl
```
### Reading and writing tabular files
While the base R read/write utilities can be used for `data.frames`, best time
performance with the least amount of typing is achieved with the export/import
functions from the `readr` package. For very large files the `fread` function from
the `data.table` package achieves the best time performance.
#### Import with `readr`
Import functions provided by `readr` include:
* `read_csv()`: comma separated (CSV) files
* `read_tsv()`: tab separated files
* `read_delim()`: general delimited files
* `read_fwf()`: fixed width files
* `read_table()`: tabular files where colums are separated by white-space.
* `read_log()`: web log files
Create a sample tab delimited file for import
```{r tabular_sample, eval=TRUE}
write_tsv(iris, "iris.txt") # Creates sample file
```
Import with `read_tsv`
```{r tabular_import1, eval=TRUE}
iris_df <- read_tsv("iris.txt") # Import with read_tbv from readr package
iris_df
```
To import Google Sheets directly into R, see [here](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rbasics/rbasics/#reading-and-writing-external-data).
#### Fast table import with `fread`
The `fread` function from the `data.table` package provides the best time performance for reading large
tabular files into R.
```{r tabular_import2, eval=TRUE}
library(data.table)
iris_df <- as_tibble(fread("iris.txt")) # Import with fread and conversion to tibble
iris_df
```
Note: to ignore lines starting with comment signs, one can pass on to `fread` a shell
command for preprocessing the file. The following example illustrates this option.
```{r tabular_import_ignore, eval=FALSE}
fread("grep -v '^#' iris.txt")
```
#### Export with `readr`
Export function provided by `readr` inlcude
* `write_delim()`: general delimited files
* `write_csv()`: comma separated (CSV) files
* `write_excel_csv()`: excel style CSV files
* `write_tsv()`: tab separated files
For instance, the `write_tsv` function writes a `data.frame` or `tibble` to a tab delimited file with much nicer
default settings than the base R `write.table` function.
```{r tabular_export_readr, eval=FALSE}
write_tsv(iris_df, "iris.txt")
```
### Column and row binds
The equivalents to base R's `rbind` and `cbind` are `bind_rows` and `bind_cols`, respectively.
```{r dplyr_bind, eval=TRUE}
bind_cols(iris_df, iris_df)
bind_rows(iris_df, iris_df)
```
### Extract column as vector
The subsetting operators `[[` and `$`can be used to extract from a `tibble` single columns as vector.
```{r plyr_get_cols, eval=TRUE}
iris_df[[5]][1:12]
iris_df$Species[1:12]
```
### Important `dplyr` functions
1. `filter()` and `slice()`
2. `arrange()`
3. `select()` and `rename()`
4. `distinct()`
5. `mutate()` and `transmute()`
6. `summarise()`
7. `sample_n()` and `sample_frac()`
### Slice and filter functions
#### Filter function
```{r plyr_filter, eval=TRUE}
filter(iris_df, Sepal.Length > 7.5, Species=="virginica")
```
#### Base R code equivalent
```{r plyr_filter_base, eval=TRUE}
iris_df[iris_df[, "Sepal.Length"] > 7.5 & iris_df[, "Species"]=="virginica", ]
```
#### Including boolean operators
```{r plyr_filter_boolean, eval=TRUE}
filter(iris_df, Sepal.Length > 7.5 | Sepal.Length < 5.5, Species=="virginica")
```
#### Subset rows by position
`dplyr` approach
```{r plyr_subset, eval=TRUE}
slice(iris_df, 1:2)
```
Base R code equivalent
```{r plyr_subset_base, eval=TRUE}
iris_df[1:2,]
```
#### Subset rows by names
Since `tibbles` do not contain row names, row wise subsetting via the `[,]` operator cannot be used.
However, the corresponding behavior can be achieved by passing to `select` a row position index
obtained by basic R intersect utilities such as `match`.
Create a suitable test `tibble`
```{r plyr_sample_set2, eval=TRUE}
df1 <- bind_cols(tibble(ids1=paste0("g", 1:10)), as_tibble(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
df1
```
`dplyr` approach
```{r plyr_subset_names, eval=TRUE}
slice(df1, match(c("g10", "g4", "g4"), ids1))
```
Base R equivalent
```{r plyr_subset_names_base, eval=TRUE}
df1_old <- as.data.frame(df1)
rownames(df1_old) <- df1_old[,1]
df1_old[c("g10", "g4", "g4"),]
```
### Sorting with `arrange`
Row-wise ordering based on specific columns
`dplyr` approach
```{r plyr_order1, eval=TRUE}
arrange(iris_df, Species, Sepal.Length, Sepal.Width)
```
For ordering descendingly use `desc()` function
```{r plyr_order2, eval=TRUE}
arrange(iris_df, desc(Species), Sepal.Length, Sepal.Width)
```
Base R code equivalent
```{r plyr_order_base, eval=TRUE}
iris_df[order(iris_df$Species, iris_df$Sepal.Length, iris_df$Sepal.Width), ]
iris_df[order(iris_df$Species, decreasing=TRUE), ]
```
### Select columns with `select`
Select specific columns
```{r plyr_col_select1, eval=TRUE}
select(iris_df, Species, Petal.Length, Sepal.Length)
```
Select range of columns by name
```{r plyr_col_select2, eval=TRUE}
select(iris_df, Sepal.Length : Petal.Width)
```
Drop specific columns (here range)
```{r plyr_col_drop, eval=TRUE}
select(iris_df, -(Sepal.Length : Petal.Width))
```
### Change column order with `relocate`
`dplyr` approach
For details and examples see `?relocate`
```{r plyr_col_order, eval=TRUE}
relocate(iris_df, Species)
```
Base R code approach
```{r baser_col_order, eval=FALSE}
iris[,c(5, 1:4)]
```
### Renaming columns with `rename`
`dplyr` approach
```{r plyr_col_rename, eval=TRUE}
rename(iris_df, new_col_name = Species)
```
Base R code approach
```{r baser_col_rename, eval=FALSE}
colnames(iris_df)[colnames(iris_df)=="Species"] <- "new_col_names"
```
### Obtain unique rows with `distinct`
`dplyr` approach
```{r plyr_unique, eval=TRUE}
distinct(iris_df, Species, .keep_all=TRUE)
```
Base R code approach
```{r baser_unique, eval=TRUE}
iris_df[!duplicated(iris_df$Species),]
```
### Add columns
#### `mutate`
The `mutate` function allows to append columns to existing ones.
```{r plyr_mutate, eval=TRUE}
mutate(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)
```
#### `transmute`
The `transmute` function does the same as `mutate` but drops existing columns
```{r plyr_transmute, eval=TRUE}
transmute(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)
```
#### `bind_cols`
The `bind_cols` function is the equivalent of `cbind` in base R. To add rows, use the corresponding
`bind_rows` function.
```{r plyr_bind_cols, eval=TRUE}
bind_cols(iris_df, iris_df)
```
### Summarize data
Summary calculation on single column
```{r plyr_summarize1, eval=TRUE}
summarize(iris_df, mean(Petal.Length))
```
Summary calculation on many columns
```{r plyr_summarize2, eval=TRUE}
summarize_all(iris_df[,1:4], mean)
```
Summarize by grouping column
```{r plyr_summarize, eval=TRUE}
summarize(group_by(iris_df, Species), mean(Petal.Length))
```
Aggregate summaries
```{r plyr_summarize3, eval=TRUE}
summarize_all(group_by(iris_df, Species), mean)
```
Note: `group_by` does the looping for the user similar to `aggregate` or `tapply`.
### Merging tibbles
The `dplyr` package provides several join functions for merging `tibbles` by a common key column
similar to the `merge` function in base R. These `*_join` functions include:
* `inner_join()`: returns join only for rows matching among both `tibbles`
* `full_join()`: returns join for all (matching and non-matching) rows of two `tibbles`
* `left_join()`: returns join for all rows in first `tibble`
* `right_join()`: returns join for all rows in second `tibble`
* `anti_join()`: returns for first `tibble` only those rows that have no match in the second one
Sample `tibbles` to illustrate `*.join` functions.
```{r plyr_join_sample, eval=TRUE}
df1 <- bind_cols(tibble(ids1=paste0("g", 1:10)), as_tibble(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
df1
df2 <- bind_cols(tibble(ids2=paste0("g", c(2,5,11,12))), as_tibble(matrix(1:16, 4, 4, dimnames=list(1:4, paste0("CB", 1:4)))))
df2
```
#### Inner join
```{r plyr_inner_join, eval=TRUE}
inner_join(df1, df2, by=c("ids1"="ids2"))
```
#### Left join
```{r plyr_left_join, eval=TRUE}
left_join(df1, df2, by=c("ids1"="ids2"))
```
#### Right join
```{r plyr_right_join, eval=TRUE}
right_join(df1, df2, by=c("ids1"="ids2"))
```
#### Full join
```{r plyr_full_join, eval=TRUE}
full_join(df1, df2, by=c("ids1"="ids2"))
```
#### Anti join
```{r plyr_anti_join, eval=TRUE}
anti_join(df1, df2, by=c("ids1"="ids2"))
```
For additional join options users want to cosult the `*_join` help pages.
### Chaining
To simplify chaining of serveral operations, `dplyr` uses the `%>%`
operator from `magrittr`, where `x %>% f(y)` turns into `f(x, y)`. This way one can pipe
together multiple operations by writing them from left-to-right or
top-to-bottom. This makes for easy to type and readable code. Since R-4.1.0, a native
pipe `|>` operator is available that works largely the same as `%>%`.
#### Example 1
Series of data manipulations and export
```{r plyr_chaining1, eval=TRUE}
read_tsv("iris.txt") %>% # Import with read_tbv from readr package
as_tibble() %>% # Declare to use tibble
select(Sepal.Length:Species) %>% # Select columns
filter(Species=="setosa") %>% # Filter rows by some value
arrange(Sepal.Length) %>% # Sort by some column
mutate(Subtract=Petal.Length - Petal.Width) # Calculate and append
# write_tsv("iris.txt") # Export to file, omitted here to show result
```
#### Example 2
Series of summary calculations for grouped data (`group_by`)
```{r plyr_chaining2, eval=TRUE}
iris_df %>% # Declare tibble to use
group_by(Species) %>% # Group by species
summarize(Mean_Sepal.Length=mean(Sepal.Length),
Max_Sepal.Length=max(Sepal.Length),
Min_Sepal.Length=min(Sepal.Length),
SD_Sepal.Length=sd(Sepal.Length),
Total=n())
```
#### Example 3
Combining `dplyr` chaining with `ggplot`
```{r plyr_chaining3, eval=TRUE}
iris_df %>%
group_by(Species) %>%
summarize_all(mean) %>%
reshape2::melt(id.vars=c("Species"), variable.name = "Samples", value.name="Values") %>%
ggplot(aes(Samples, Values, fill = Species)) +
geom_bar(position="dodge", stat="identity")
```
## SQLite Databases
`SQLite` is a lightweight relational database solution. The `RSQLite` package provides an easy to use interface to create, manage and query `SQLite` databases directly from R. Basic instructions
for using `SQLite` from the command-line are available [here](https://www.sqlite.org/cli.html). A short introduction to `RSQLite` is available [here](https://github.com/rstats-db/RSQLite/blob/master/vignettes/RSQLite.Rmd).
## Loading data into SQLite databases
The following loads two `data.frames` derived from the `iris` data set (here `mydf1` and `mydf2`)
into an SQLite database (here `test.db`).
```{r load_sqlite, eval=TRUE}
library(RSQLite)
unlink("test.db") # Delete any existing test.db
mydb <- dbConnect(SQLite(), "test.db") # Creates database file test.db
mydf1 <- data.frame(ids=paste0("id", seq_along(iris[,1])), iris)
mydf2 <- mydf1[sample(seq_along(mydf1[,1]), 10),]
dbWriteTable(mydb, "mydf1", mydf1)
dbWriteTable(mydb, "mydf2", mydf2)
```
### List names of tables in database
```{r list_tables, eval=TRUE}
dbListTables(mydb)
```
### Import table into `data.frame`
```{r import_sqlite_tables, eval=TRUE}
dbGetQuery(mydb, 'SELECT * FROM mydf2')
```
### Query database
```{r query_sqlite_tables, eval=TRUE}
dbGetQuery(mydb, 'SELECT * FROM mydf1 WHERE "Sepal.Length" < 4.6')
```
### Join tables
The two tables can be joined on the shared `ids` column as follows.
```{r join_sqlite_tables, eval=TRUE}
dbGetQuery(mydb, 'SELECT * FROM mydf1, mydf2 WHERE mydf1.ids = mydf2.ids')
```
## Session Info
```{r sessionInfo}
sessionInfo()
```
## References