---
title: "Introduction to R"
author: "Author: Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: true
fontsize: 14pt
bibliography: bibtex.bib
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
warning=FALSE, message=FALSE)
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(limma)
library(ggplot2) })
```
# Overview
## What is R?
[R](http://cran.at.r-project.org) is a powerful statistical environment and
programming language for the analysis and visualization of data. The
associated [Bioconductor](http://bioconductor.org/) and CRAN package
repositories provide many additional R packages for statistical data analysis
for a wide array of research areas. The R software is free and runs on all
common operating systems.
## Why Using R?
* Complete statistical environment and programming language
* Efficient functions and data structures for data analysis
* Powerful graphics
* Access to fast growing number of analysis packages
* Most widely used language in bioinformatics
* Is standard for data mining and biostatistical analysis
* Technical advantages: free, open-source, available for all OSs
## Books and Documentation
* simpleR - Using R for Introductory Statistics (John Verzani, 2004) - [URL](http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf)
* Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Gentleman et al., 2005) - [URL](http://www.bioconductor.org/help/publications/books/bioinformatics-and-computational-biology-solutions/)
* More on this see "Finding Help" section in UCR Manual - [URL](http://manuals.bioinformatics.ucr.edu/home/R\_BioCondManual\#TOC-Finding-Help)
## R Working Environments
R Projects and Interfaces
Some R working environments with support for syntax highlighting and utilities to send code
to the R console:
* [RStudio](https://www.rstudio.com/products/rstudio/features): excellent choice for beginners ([Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf))
* Basic R code editors provided by Rguis
* [gedit](https://wiki.gnome.org/Apps/Gedit), [Rgedit](http://rgedit.sourceforge.net/), [RKWard](https://rkward.kde.org/), [Eclipse](http://www.walware.de/goto/statet), [Tinn-R](http://www.sciviews.org/Tinn-R/), [Notepad++](https://notepad-plus-plus.org/), [NppToR](http://sourceforge.net/projects/npptor/)
* [Vim-R-Tmux](http://manuals.bioinformatics.ucr.edu/home/programming-in-r/vim-r): R working environment based on vim and tmux
* [Emacs](http://www.xemacs.org/Download/index.html) ([ESS add-on package](http://ess.r-project.org/))
### Example: RStudio
New integrated development environment (IDE) for [R](http://www.rstudio.com/ide/download/). Highly functional for both beginners and
advanced.
RStudio IDE
Some userful shortcuts: `Ctrl+Enter` (send code), `Ctrl+Shift+C` (comment/uncomment), `Ctrl+1/2` (switch window focus)
### Example: Nvim-R-Tmux
Terminal-based Working Environment for R: [Nvim-R-Tmux](http://girke.bioinformatics.ucr.edu/GEN242/mydoc_tutorial_02.html#nvim-r-tmux-essentials).
Nvim-R-Tmux IDE for R
# R Package Repositories
* CRAN (>11,000 packages) general data analysis - [URL](http://cran.at.r-project.org/)
* Bioconductor (>1,100 packages) bioscience data analysis - [URL](http://www.bioconductor.org/)
* Omegahat (>90 packages) programming interfaces - [URL](https://github.com/omegahat?tab=repositories)
# Installation of R Packages
1. Install R for your operating system from [CRAN](http://cran.at.r-project.org/).
2. Install RStudio from [RStudio](http://www.rstudio.com/ide/download).
3. Install CRAN Packages from R console like this:
```{r install_cran, eval=FALSE}
install.packages(c("pkg1", "pkg2"))
install.packages("pkg.zip", repos=NULL)
```
4. Install Bioconductor packages as follows:
```{r install_bioc, eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
library(BiocInstaller)
BiocVersion()
biocLite()
biocLite(c("pkg1", "pkg2"))
```
5. For more details consult the [Bioc Install page](http://www.bioconductor.org/install/)
and [BiocInstaller](http://www.bioconductor.org/packages/release/bioc/html/BiocInstaller.html) package.
# Getting Around
## Startup and Closing Behavior
* __Starting R__:
The R GUI versions, including RStudio, under Windows and Mac OS X can be
opened by double-clicking their icons. Alternatively, one can start it by
typing `R` in a terminal (default under Linux).
* __Startup/Closing Behavior__:
The R environment is controlled by hidden files in the startup directory:
`.RData`, `.Rhistory` and `.Rprofile` (optional).
* __Closing R__:
```{r closing_r, eval=FALSE}
q()
```
```
Save workspace image? [y/n/c]:
```
* __Note__:
When responding with `y`, then the entire R workspace will be written to
the `.RData` file which can become very large. Often it is sufficient to just
save an analysis protocol in an R source file. This way one can quickly
regenerate all data sets and objects.
## Navigating directories
Create an object with the assignment operator `<-` or `=`
```{r r_assignment, eval=FALSE}
object <- ...
```
List objects in current R session
```{r r_ls, eval=FALSE}
ls()
```
Return content of current working directory
```{r r_dirshow, eval=FALSE}
dir()
```
Return path of current working directory
```{r r_dirpath, eval=FALSE}
getwd()
```
Change current working directory
```{r r_setwd, eval=FALSE}
setwd("/home/user")
```
# Basic Syntax
General R command syntax
```{r r_syntax, eval=FALSE}
object <- function_name(arguments)
object <- object[arguments]
```
Finding help
```{r r_find_help, eval=FALSE}
?function_name
```
Load a library/package
```{r r_package_load, eval=FALSE}
library("my_library")
```
List functions defined by a library
```{r r_package_functions, eval=FALSE}
library(help="my_library")
```
Load library manual (PDF or HTML file)
```{r r_load_vignette, eval=FALSE}
vignette("my_library")
```
Execute an R script from within R
```{r r_execute_script, eval=FALSE}
source("my_script.R")
```
Execute an R script from command-line (the first of the three options is preferred)
```{sh sh_execute_script, eval=FALSE}
$ Rscript my_script.R
$ R CMD BATCH my_script.R
$ R --slave < my_script.R
```
# Data Types
## Numeric data
Example: `1, 2, 3, ...`
```{r r_numeric_data, eval=TRUE}
x <- c(1, 2, 3)
x
is.numeric(x)
as.character(x)
```
## Character data
Example: `"a", "b", "c", ...`
```{r r_character_data, eval=TRUE}
x <- c("1", "2", "3")
x
is.character(x)
as.numeric(x)
```
## Complex data
Example: mix of both
```{r r_complex_data, eval=TRUE}
c(1, "b", 3)
```
## Logical data
Example: `TRUE` of `FALSE`
```{r r_logical_data, eval=TRUE}
x <- 1:10 < 5
x
!x
which(x) # Returns index for the 'TRUE' values in logical vector
```
# Data Objects
## Object types
## Vectors (1D)
Definition: `numeric` or `character`
```{r r_vector_object, eval=TRUE}
myVec <- 1:10; names(myVec) <- letters[1:10]
myVec[1:5]
myVec[c(2,4,6,8)]
myVec[c("b", "d", "f")]
```
## Factors (1D)
Definition: vectors with grouping information
```{r r_factor_object, eval=TRUE}
factor(c("dog", "cat", "mouse", "dog", "dog", "cat"))
```
## Matrices (2D)
Definition: two dimensional structures with data of same type
```{r r_matrix_object, eval=TRUE}
myMA <- matrix(1:30, 3, 10, byrow = TRUE)
class(myMA)
myMA[1:2,]
myMA[1, , drop=FALSE]
```
## Data Frames (2D)
Definition: two dimensional objects with data of variable types
```{r r_dataframe_object, eval=TRUE}
myDF <- data.frame(Col1=1:10, Col2=10:1)
myDF[1:2, ]
```
## Arrays
Definition: data structure with one, two or more dimensions
## Lists
Definition: containers for any object type
```{r r_list_object, eval=TRUE}
myL <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
myL
myL[[4]][1:2]
```
## Functions
Definition: piece of code
```{r r_function_object, eval=FALSE}
myfct <- function(arg1, arg2, ...) {
function_body
}
```
## Subsetting of data objects
__(1.) Subsetting by positive or negative index/position numbers__
```{r r_subset_by_index, eval=TRUE}
myVec <- 1:26; names(myVec) <- LETTERS
myVec[1:4]
```
__(2.) Subsetting by same length logical vectors__
```{r r_subset_by_logical, eval=TRUE}
myLog <- myVec > 10
myVec[myLog]
```
__(3.) Subsetting by field names__
```{r r_subset_by_names, eval=TRUE}
myVec[c("B", "K", "M")]
```
__(4.) Subset with `$` sign__: references a single column or list component by its name
```{r r_subset_by_dollar, eval=TRUE}
iris$Species[1:8]
```
# Important Utilities
## Combining Objects
The `c` function combines vectors and lists
```{r r_combine_vectors, eval=TRUE}
c(1, 2, 3)
x <- 1:3; y <- 101:103
c(x, y)
iris$Species[1:8]
```
The `cbind` and `rbind` functions can be used to append columns and rows, respecively.
```{r r_cbind_rbind, eval=TRUE}
ma <- cbind(x, y)
ma
rbind(ma, ma)
```
## Accessing Dimensions of Objects
Length and dimension information of objects
```{r r_length_dim, eval=TRUE}
length(iris$Species)
dim(iris)
```
## Accessing Name Slots of Objects
Accessing row and column names of 2D objects
```{r col_row_names, eval=TRUE}
rownames(iris)[1:8]
colnames(iris)
```
Return name field of vectors and lists
```{r name_slots, eval=TRUE}
names(myVec)
names(myL)
```
## Sorting Objects
The function `sort` returns a vector in ascending or descending order
```{r sort_objects, eval=TRUE}
sort(10:1)
```
The function `order` returns a sorting index for sorting an object
```{r order_objects, eval=TRUE}
sortindex <- order(iris[,1], decreasing = FALSE)
sortindex[1:12]
iris[sortindex,][1:2,]
sortindex <- order(-iris[,1]) # Same as decreasing=TRUE
```
Sorting multiple columns
```{r order_columns, eval=TRUE}
iris[order(iris$Sepal.Length, iris$Sepal.Width),][1:2,]
```
# Operators and Calculations
## Comparison Operators
Comparison operators: `==`, `!=`, `<`, `>`, `<=`, `>=`
```{r comparison_operators, eval=TRUE}
1==1
```
Logical operators: AND: `&`, OR: `|`, NOT: `!`
```{r logical_operators, eval=TRUE}
x <- 1:10; y <- 10:1
x > y & x > 5
```
## Basic Calculations
To look up math functions, see Function Index [here](http://cran.at.r-project.org/doc/manuals/R-intro.html#Function-and-variable-index)
```{r logical_calculations, eval=TRUE}
x + y
sum(x)
mean(x)
apply(iris[1:6,1:3], 1, mean)
```
# Reading and Writing External Data
## Import of tabular data
Import of a tab-delimited tabular file
```{r read_delim, eval=FALSE}
myDF <- read.delim("myData.xls", sep="\t")
```
Import of Excel file. Note: working with tab- or comma-delimited files is more flexible and preferred.
```{r read_excel, eval=FALSE}
library(gdata)
myDF <- read.xls"myData.xls")
```
Import of Google Sheets. The following example imports a sample Google Sheet from [here](https://docs.google.com/spreadsheets/d/1U-32UcwZP1k3saKeaH1mbvEAOfZRdNHNkWK2GI1rpPM/edit#gid=472150521).
Detailed instructions for interacting from R with Google Sheets with the required `googlesheets` package are [here](https://github.com/jennybc/googlesheets).
```{r read_gs, eval=FALSE}
library("googlesheets"); library("dplyr"); library(knitr)
gs_auth() # Creates authorizaton token (.httr-oauth) in current directory if not present
sheetid <-"1U-32UcwZP1k3saKeaH1mbvEAOfZRdNHNkWK2GI1rpPM"
gap <- gs_key(sheetid)
mysheet <- gs_read(gap, skip=4)
myDF <- as.data.frame(mysheet)
myDF
```
## Export of tabular data
```{r write_table, eval=FALSE}
write.table(myDF, file="myfile.xls", sep="\t", quote=FALSE, col.names=NA)
```
## Line-wise import
```{r readlines, eval=FALSE}
myDF <- readLines("myData.txt")
```
## Line-wise export
```{r writelines, eval=FALSE}
writeLines(month.name, "myData.txt")
```
## Copy and paste into R
On Windows/Linux systems
```{r paste_windows, eval=FALSE}
read.delim("clipboard")
```
On Mac OS X systems
```{r paste_osx, eval=FALSE}
read.delim(pipe("pbpaste"))
```
## Copy and paste from R
On Windows/Linux systems
```{r copy_windows, eval=FALSE}
write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F)
```
On Mac OS X systems
```{r copy_osx, eval=FALSE}
zz <- pipe('pbcopy', 'w')
write.table(iris, zz, sep="\t", col.names=NA, quote=F)
close(zz)
```
## Homework 3A
Homework 3A: [Object Subsetting Routines and Import/Export](http://girke.bioinformatics.ucr.edu/GEN242/mydoc_homework_03.html)
# Useful R Functions
## Unique entries
Make vector entries unique with `unique`
```{r unique, eval=TRUE}
length(iris$Sepal.Length)
length(unique(iris$Sepal.Length))
```
## Count occurrences
Count occurrences of entries with `table`
```{r table, eval=TRUE}
table(iris$Species)
```
## Aggregate data
Compute aggregate statistics with `aggregate`
```{r aggregate, eval=TRUE}
aggregate(iris[,1:4], by=list(iris$Species), FUN=mean, na.rm=TRUE)
```
## Intersect data
Compute intersect between two vectors with `%in%`
```{r intersect, eval=TRUE}
month.name %in% c("May", "July")
```
## Merge data frames
Join two data frames by common field entries with `merge` (here row names `by.x=0`). To obtain only the common rows, change `all=TRUE` to `all=FALSE`. To merge on specific columns, refer to them by their position numbers or their column names.
```{r merge, eval=TRUE}
frame1 <- iris[sample(1:length(iris[,1]), 30), ]
frame1[1:2,]
dim(frame1)
my_result <- merge(frame1, iris, by.x = 0, by.y = 0, all = TRUE)
dim(my_result)
```
# dplyr Environment
Modern object classes and methods for handling `data.frame` like structures
are provided by the `dplyr` and `data.table` packages. The following gives a
short introduction to the usage and functionalities of the `dplyr` package.
More detailed tutorials on this topic can be found here:
* [dplyr: A Grammar of Data Manipulation](https://rdrr.io/cran/dplyr/)
* [Introduction to `dplyr`](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html)
* [Tutorial on `dplyr`](http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)
* [Cheatsheet for Joins from Jenny Bryan](http://stat545.com/bit001_dplyr-cheatsheet.html)
* [Tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)
* [Intro to `data.table` package](https://www.r-bloggers.com/intro-to-the-data-table-package/)
* [Big data with `dplyr` and `data.table`](https://www.r-bloggers.com/working-with-large-datasets-with-dplyr-and-data-table/)
* [Fast lookups with `dplyr` and `data.table`](https://www.r-bloggers.com/fast-data-lookups-in-r-dplyr-vs-data-table/)
## Installation
The `dplyr` environment has evolved into an ecosystem of packages. To simplify
package management, one can install and load the entire collection via the
`tidyverse` package. For more details on `tidyverse` see
[here](http://tidyverse.org/).
```{r tidyverse_install, eval=FALSE}
install.packages("tidyverse")
```
## Construct a `data frame` (`tibble`)
```{r data_frame_tbl1, eval=TRUE}
library(tidyverse)
as_data_frame(iris) # coerce data.frame to data frame tbl
```
Alternative functions producing the same result include `as_tibble` and `tbl_df`:
```{r data_frame_tbl2, eval=FALSE}
as_tibble(iris) # newer function provided by tibble package
tbl_df(iris) # this alternative exists for historical reasons
```
## Reading and writing tabular files
While the base R read/write utilities can be used for `data frames`, best time
performance with the least amount of typing is achieved with the export/import
functions from the `readr` package. For very large files the `fread` function from
the `data.table` package achieves the best time performance.
### Import with `readr`
Import functions provided by `readr` include:
* `read_csv()`: comma separated (CSV) files
* `read_tsv()`: tab separated files
* `read_delim()`: general delimited files
* `read_fwf()`: fixed width files
* `read_table()`: tabular files where colums are separated by white-space.
* `read_log()`: web log files
Create a sample tab delimited file for import
```{r tabular_sample, eval=TRUE}
write_tsv(iris, "iris.txt") # Creates sample file
```
Import with `read_tsv`
```{r tabular_import1, eval=TRUE}
iris_df <- read_tsv("iris.txt") # Import with read_tbv from readr package
iris_df
```
To import Google Sheets directly into R, see [here](http://girke.bioinformatics.ucr.edu/GEN242/mydoc_Rbasics_10.html).
### Fast table import with `fread`
The `fread` function from the `data.table` package provides the best time performance for reading large
tabular files into R.
```{r tabular_import2, eval=TRUE}
library(data.table)
iris_df <- as_data_frame(fread("iris.txt")) # Import with fread and conversion to tibble
iris_df
```
Note: to ignore lines starting with comment signs, one can pass on to `fread` a shell
command for preprocessing the file. The following example illustrates this option.
```{r tabular_import_ignore, eval=FALSE}
fread("grep -v '^#' iris.txt")
```
### Export with `readr`
Export function provided by `readr` inlcude
* `write_delim()`: general delimited files
* `write_csv()`: comma separated (CSV) files
* `write_excel_csv()`: excel style CSV files
* `write_tsv()`: tab separated files
For instance, the `write_tsv` function writes a `data frame` to a tab delimited file with much nicer
default settings than the base R `write.table` function.
```{r tabular_export_readr, eval=FALSE}
write_tsv(iris_df, "iris.txt")
```
## Column and row binds
The equivalents to base R's `rbind` and `cbind` are `bind_rows` and `bind_cols`, respectively.
```{r dplyr_bind, eval=TRUE}
bind_cols(iris_df, iris_df)
bind_rows(iris_df, iris_df)
```
## Extract column as vector
The subsetting operators `[[` and `$`can be used to extract from a `data frame` single columns as vector.
```{r plyr_get_cols, eval=TRUE}
iris_df[[5]][1:12]
iris_df$Species[1:12]
```
## Important `dplyr` functions
1. `filter()` and `slice()`
2. `arrange()`
3. `select()` and `rename()`
4. `distinct()`
5. `mutate()` and `transmute()`
6. `summarise()`
7. `sample_n()` and `sample_frac()`
## Slice and filter functions
### Filter function
```{r plyr_filter, eval=TRUE}
filter(iris_df, Sepal.Length > 7.5, Species=="virginica")
```
### Base R code equivalent
```{r plyr_filter_base, eval=TRUE}
iris_df[iris_df[, "Sepal.Length"] > 7.5 & iris_df[, "Species"]=="virginica", ]
```
### Including boolean operators
```{r plyr_filter_boolean, eval=TRUE}
filter(iris_df, Sepal.Length > 7.5 | Sepal.Length < 5.5, Species=="virginica")
```
### Subset rows by position
`dplyr` approach
```{r plyr_subset, eval=TRUE}
slice(iris_df, 1:2)
```
Base R code equivalent
```{r plyr_subset_base, eval=TRUE}
iris_df[1:2,]
```
### Subset rows by names
Since `data frames` do not contain row names, row wise subsetting via the `[,]` operator cannot be used.
However, the corresponding behavior can be achieved by passing to `select` a row position index
obtained by basic R intersect utilities such as `match`.
Create a suitable test `data frame`
```{r plyr_sample_set2, eval=TRUE}
df1 <- bind_cols(data_frame(ids1=paste0("g", 1:10)), as_data_frame(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
df1
```
`dplyr` approach
```{r plyr_subset_names, eval=TRUE}
slice(df1, match(c("g10", "g4", "g4"), df1$ids1))
```
Base R equivalent
```{r plyr_subset_names_base, eval=TRUE}
df1_old <- as.data.frame(df1)
rownames(df1_old) <- df1_old[,1]
df1_old[c("g10", "g4", "g4"),]
```
## Sorting with `arrange`
Row-wise ordering based on specific columns
`dplyr` approach
```{r plyr_order1, eval=TRUE}
arrange(iris_df, Species, Sepal.Length, Sepal.Width)
```
For ordering descendingly use `desc()` function
```{r plyr_order2, eval=TRUE}
arrange(iris_df, desc(Species), Sepal.Length, Sepal.Width)
```
Base R code equivalent
```{r plyr_order_base, eval=TRUE}
iris_df[order(iris_df$Species, iris_df$Sepal.Length, iris_df$Sepal.Width), ]
iris_df[order(iris_df$Species, decreasing=TRUE), ]
```
## Select columns with `select`
Select specific columns
```{r plyr_col_select1, eval=TRUE}
select(iris_df, Species, Petal.Length, Sepal.Length)
```
Select range of columns by name
```{r plyr_col_select2, eval=TRUE}
select(iris_df, Sepal.Length : Petal.Width)
```
Drop specific columns (here range)
```{r plyr_col_drop, eval=TRUE}
select(iris_df, -(Sepal.Length : Petal.Width))
```
## Renaming columns with `rename`
`dplyr` approach
```{r plyr_col_rename, eval=TRUE}
rename(iris_df, new_col_name = Species)
```
Base R code approach
```{r baser_col_rename, eval=FALSE}
colnames(iris_df)[colnames(iris_df)=="Species"] <- "new_col_names"
```
## Obtain unique rows with `distinct`
`dplyr` approach
```{r plyr_unique, eval=TRUE}
distinct(iris_df, Species, .keep_all=TRUE)
```
Base R code approach
```{r baser_unique, eval=TRUE}
iris_df[!duplicated(iris_df$Species),]
```
## Add columns
### `mutate`
The `mutate` function allows to append columns to existing ones.
```{r plyr_mutate, eval=TRUE}
mutate(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)
```
### `transmute`
The `transmute` function does the same as `mutate` but drops existing columns
```{r plyr_transmute, eval=TRUE}
transmute(iris_df, Ratio = Sepal.Length / Sepal.Width, Sum = Sepal.Length + Sepal.Width)
```
### `bind_cols`
The `bind_cols` function is the equivalent of `cbind` in base R. To add rows, use the corresponding
`bind_rows` function.
```{r plyr_bind_cols, eval=TRUE}
bind_cols(iris_df, iris_df)
```
## Summarize data
Summary calculation on single column
```{r plyr_summarize1, eval=TRUE}
summarize(iris_df, mean(Petal.Length))
```
Summary calculation on many columns
```{r plyr_summarize2, eval=TRUE}
summarize_all(iris_df[,1:4], mean)
```
Summarize by grouping column
```{r plyr_summarize, eval=TRUE}
summarize(group_by(iris_df, Species), mean(Petal.Length))
```
Aggregate summaries
```{r plyr_summarize3, eval=TRUE}
summarize_all(group_by(iris_df, Species), mean)
```
Note: `group_by` does the looping for the user similar to `aggregate` or `tapply`.
## Merging data frames
The `dplyr` package provides several join functions for merging `data frames` by a common key column
similar to the `merge` function in base R. These `*_join` functions include:
* `inner_join()`: returns join only for rows matching among both `data tables`
* `full_join()`: returns join for all (matching and non-matching) rows of two `data tables`
* `left_join()`: returns join for all rows in first `data table`
* `right_join()`: returns join for all rows in second `data table`
* `anti_join()`: returns for first `data table` only those rows that have no match in the second one
Sample `data frames` to illustrate `*.join` functions.
```{r plyr_join_sample, eval=TRUE}
df1 <- bind_cols(data_frame(ids1=paste0("g", 1:10)), as_data_frame(matrix(1:40, 10, 4, dimnames=list(1:10, paste0("CA", 1:4)))))
df1
df2 <- bind_cols(data_frame(ids2=paste0("g", c(2,5,11,12))), as_data_frame(matrix(1:16, 4, 4, dimnames=list(1:4, paste0("CB", 1:4)))))
df2
```
### Inner join
```{r plyr_inner_join, eval=TRUE}
inner_join(df1, df2, by=c("ids1"="ids2"))
```
### Left join
```{r plyr_left_join, eval=TRUE}
left_join(df1, df2, by=c("ids1"="ids2"))
```
### Right join
```{r plyr_right_join, eval=TRUE}
right_join(df1, df2, by=c("ids1"="ids2"))
```
### Full join
```{r plyr_full_join, eval=TRUE}
full_join(df1, df2, by=c("ids1"="ids2"))
```
### Anti join
```{r plyr_anti_join, eval=TRUE}
anti_join(df1, df2, by=c("ids1"="ids2"))
```
For additional join options users want to cosult the `*_join` help pages.
## Chaining
To simplify chaining of serveral operations, `dplyr` provides the `%>%`
operator. where `x %>% f(y)` turns into `f(x, y)`. This way one can pipe
together multiple operations by writing them from left-to-right or
top-to-bottom. This makes for easy to type and readable code.
### Example 1
Series of data manipulations and export
```{r plyr_chaining1, eval=TRUE}
iris_df %>% # Declare data frame to use
select(Sepal.Length:Species) %>% # Select columns
filter(Species=="setosa") %>% # Filter rows by some value
arrange(Sepal.Length) %>% # Sort by some column
mutate(Subtract=Petal.Length - Petal.Width) # Calculate and append
# write_tsv("iris.txt") # Export to file, omitted here to show result
```
### Example 2
Series of summary calculations for grouped data (`group_by`)
```{r plyr_chaining2, eval=TRUE}
iris_df %>% # Declare data frame to use
group_by(Species) %>% # Group by species
summarize(Mean_Sepal.Length=mean(Sepal.Length),
Max_Sepal.Length=max(Sepal.Length),
Min_Sepal.Length=min(Sepal.Length),
SD_Sepal.Length=sd(Sepal.Length),
Total=n())
```
### Example 3
Combining `dplyr` chaining with `ggplot`
```{r plyr_chaining3, eval=TRUE}
iris_df %>%
group_by(Species) %>%
summarize_all(mean) %>%
reshape2::melt(id.vars=c("Species"), variable.name = "Samples", value.name="Values") %>%
ggplot(aes(Samples, Values, fill = Species)) +
geom_bar(position="dodge", stat="identity")
```
# SQLite Databases
`SQLite` is a lightweight relational database solution. The `RSQLite` package provides an easy to use interface to create, manage and query `SQLite` databases directly from R. Basic instructions
for using `SQLite` from the command-line are available [here](https://www.sqlite.org/cli.html). A short introduction to `RSQLite` is available [here](https://github.com/rstats-db/RSQLite/blob/master/vignettes/RSQLite.Rmd).
## Loading data into SQLite databases
The following loads two `data.frames` derived from the `iris` data set (here `mydf1` and `mydf2`)
into an SQLite database (here `test.db`).
```{r load_sqlite, eval=TRUE}
library(RSQLite)
mydb <- dbConnect(SQLite(), "test.db") # Creates database file test.db
mydf1 <- data.frame(ids=paste0("id", seq_along(iris[,1])), iris)
mydf2 <- mydf1[sample(seq_along(mydf1[,1]), 10),]
dbWriteTable(mydb, "mydf1", mydf1)
dbWriteTable(mydb, "mydf2", mydf2)
```
## List names of tables in database
```{r list_tables, eval=TRUE}
dbListTables(mydb)
```
## Import table into `data.frame`
```{r import_sqlite_tables, eval=TRUE}
dbGetQuery(mydb, 'SELECT * FROM mydf2')
```
## Query database
```{r query_sqlite_tables, eval=TRUE}
dbGetQuery(mydb, 'SELECT * FROM mydf1 WHERE "Sepal.Length" < 4.6')
```
## Join tables
The two tables can be joined on the shared `ids` column as follows.
```{r join_sqlite_tables, eval=TRUE}
dbGetQuery(mydb, 'SELECT * FROM mydf1, mydf2 WHERE mydf1.ids = mydf2.ids')
```
# Graphics in R
## Advantages
- Powerful environment for visualizing scientific data
- Integrated graphics and statistics infrastructure
- Publication quality graphics
- Fully programmable
- Highly reproducible
- Full [LaTeX](http://www.latex-project.org/) and Markdown support via `knitr` and `R markdown`
- Vast number of R packages with graphics utilities
## Documentation for R Graphics
__General__
- Graphics Task Page - [URL](http://cran.r-project.org/web/views/Graphics.html)
- R Graph Gallery - [URL](http://addictedtor.free.fr/graphiques/allgraph.php)
- R Graphical Manual - [URL](http://cged.genes.nig.ac.jp/RGM2/index.php)
- Paul Murrell's book R (Grid) Graphics - [URL](http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html)
__Interactive graphics__
- rggobi` (GGobi) - [URL](http://www.ggobi.org/)
- `iplots` - [URL](http://www.rosuda.org/iplots/)
- Open GL (`rgl`) - [URL](http://rgl.neoscientists.org/gallery.shtml)
## Graphics Environments
__Viewing and saving graphics in R__
- On-screen graphics
- postscript, pdf, svg
- jpeg, png, wmf, tiff, ...
__Four major graphic environments__
(a) Low-level infrastructure
- R Base Graphics (low- and high-level)
- `grid`: [Manual](http://www.stat.auckland.ac.nz/~paul/grid/grid.html)
(b) High-level infrastructure
\begin{itemize}
- `lattice`: [Manual](http://lmdvr.r-forge.r-project.org), [Intro](http://www.his.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm), [Book](http://www.amazon.com/Lattice-Multivariate-Data-Visualization-Use/dp/0387759689)
- `ggplot2`: [Manual](http://had.co.nz/ggplot2/), [Intro](http://www.ling.upenn.edu/~joseff/rstudy/summer2010_ggplot2_intro.html), [Book](http://had.co.nz/ggplot2/book/)
## Base Graphics: Overview
__Important high-level plotting functions__
- `plot`: generic x-y plotting
- `barplot`: bar plots
- `boxplot`: box-and-whisker plot
- `hist`: histograms
- `pie`: pie charts
- `dotchart`: cleveland dot plots
- `image, heatmap, contour, persp`: functions to generate image-like plots
- `qqnorm, qqline, qqplot`: distribution comparison plots
- `pairs, coplot`: display of multivariant data
__Help on graphics functions__
- `?myfct`
- `?plot`
- `?par`
### Preferred Object Types
- Matrices and data frames
- Vectors
- Named vectors
## Scatter Plots
### Basic Scatter Plot
Sample data set for subsequent plots
```{r sample_data, eval=TRUE}
set.seed(1410)
y <- matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))
```
Plot data
```{r basic_scatter_plot, eval=TRUE}
plot(y[,1], y[,2])
```
### All pairs
```{r pairs_scatter_plot, eval=TRUE}
pairs(y)
```
### With labels
```{r labels_scatter_plot, eval=TRUE}
plot(y[,1], y[,2], pch=20, col="red", main="Symbols and Labels")
text(y[,1]+0.03, y[,2], rownames(y))
```
## More examples
__Print instead of symbols the row names__
```{r row_scatter_plot, eval=TRUE}
plot(y[,1], y[,2], type="n", main="Plot of Labels")
text(y[,1], y[,2], rownames(y))
```
__Usage of important plotting parameters__
```{r plot_usage, eval=FALSE}
grid(5, 5, lwd = 2)
op <- par(mar=c(8,8,8,8), bg="lightblue")
plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2,
cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label",
ylab="y label", main="My Main", sub="My Sub")
par(op)
```
__Important arguments_
- `mar`: specifies the margin sizes around the plotting area in order: `c(bottom, left, top, right)`
- `col`: color of symbols
- `pch`: type of symbols, samples: `example(points)`
- `lwd`: size of symbols
- `cex.*`: control font sizes
- For details see `?par`
### Add regression line
```{r plot_regression, eval=TRUE}
plot(y[,1], y[,2])
myline <- lm(y[,2]~y[,1]); abline(myline, lwd=2)
summary(myline)
```
### Log scale
Same plot as above, but on log scale
```{r plot_regression_log, eval=TRUE}
plot(y[,1], y[,2], log="xy")
```
### Add a mathematical expression
```{r plot_regression_math, eval=TRUE}
plot(y[,1], y[,2]); text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3)
```
## Homework 3B
Homework 3B: [Scatter Plots](http://girke.bioinformatics.ucr.edu/GEN242/mydoc_homework_03.html)
## Line Plots
### Single data set
```{r plot_line_single, eval=TRUE}
plot(y[,1], type="l", lwd=2, col="blue")
```
### Many Data Sets
Plots line graph for all columns in data frame `y`. The `split.screen` function is used in this example in a for loop to overlay several line graphs in the same plot.
```{r plot_line_many, eval=TRUE}
split.screen(c(1,1))
plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1)
for(i in 2:length(y[1,])) {
screen(1, new=FALSE)
plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", xlab="", main="", bty="n")
}
close.screen(all=TRUE)
```
## Bar Plots
### Basics
```{r plot_bar_simple, eval=TRUE}
barplot(y[1:4,], ylim=c(0, max(y[1:4,])+0.3), beside=TRUE, legend=letters[1:4])
text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1) + sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.04)
```
### Error Bars
```{r plot_bar_error, eval=TRUE}
bar <- barplot(m <- rowMeans(y) * 10, ylim=c(0, 10))
stdev <- sd(t(y))
arrows(bar, m, bar, m + stdev, length=0.15, angle = 90)
```
## Histograms
```{r plot_hist, eval=TRUE}
hist(y, freq=TRUE, breaks=10)
```
## Density Plots
```{r plot_dens, eval=TRUE}
plot(density(y), col="red")
```
## Pie Charts
```{r plot_pie, eval=TRUE}
pie(y[,1], col=rainbow(length(y[,1]), start=0.1, end=0.8), clockwise=TRUE)
legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8,
col=rainbow(length(y[,1]), start=0.1, end=0.8), ncol=1)
```
## Color Selection Utilities
Default color palette and how to change it
```{r color_palette, eval=TRUE}
palette()
palette(rainbow(5, start=0.1, end=0.2))
palette()
palette("default")
```
The `gray` function allows to select any type of gray shades by providing values from 0 to 1
```{r color_grey, eval=TRUE}
gray(seq(0.1, 1, by= 0.2))
```
Color gradients with `colorpanel` function from `gplots` library`
```{r color_gradient, eval=TRUE}
library(gplots)
colorpanel(5, "darkblue", "yellow", "white")
```
Much more on colors in R see Earl Glynn's color chart [here](http://research.stowers-institute.org/efg/R/Color/Chart/)
## Saving Graphics to File
After the `pdf()` command all graphs are redirected to file `test.pdf`. Works for all common formats similarly: jpeg, png, ps, tiff, ...
```{r save_graphics, eval=FALSE}
pdf("test.pdf")
plot(1:10, 1:10)
dev.off()
```
Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.
```{r save_graphics_svg, eval=FALSE}
library("RSvgDevice")
devSVG("test.svg")
plot(1:10, 1:10)
dev.off()
```
## Homework 3C
Homework 3C: [Bar Plots](http://girke.bioinformatics.ucr.edu/GEN242/mydoc_homework_03.html)
# Analysis Routine
## Overview
The following exercise introduces a variety of useful data analysis utilities in R.
## Analysis Routine: Data Import
- __Step 1__: To get started with this exercise, direct your R session to a dedicated workshop directory and download into this directory the following sample tables. Then import the files into Excel and save them as tab delimited text files.
- [MolecularWeight_tair7.xls](http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls)
- [TargetP_analysis_tair7.xls](http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/TargetP_analysis_tair7.xls)
__Import the tables into R__
Import molecular weight table
```{r import_data1, eval=FALSE}
my_mw <- read.delim(file="MolecularWeight_tair7.xls", header=T, sep="\t")
my_mw[1:2,]
```
Import subcelluar targeting table
```{r import_data2, eval=FALSE}
my_target <- read.delim(file="TargetP_analysis_tair7.xls", header=T, sep="\t")
my_target[1:2,]
```
Online import of molecular weight table
```{r import_data1b, eval=TRUE}
my_mw <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls", header=T, sep="\t")
my_mw[1:2,]
```
Online import of subcelluar targeting table
```{r import_data2b, eval=TRUE}
my_target <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/TargetP_analysis_tair7.xls", header=T, sep="\t")
my_target[1:2,]
```
## Merging Data Frames
- __Step 2__: Assign uniform gene ID column titles
```{r col_names_uni, eval=TRUE}
colnames(my_target)[1] <- "ID"
colnames(my_mw)[1] <- "ID"
```
- __Step 3__: Merge the two tables based on common ID field
```{r merge_tables, eval=TRUE}
my_mw_target <- merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=T)
```
- __Step 4__: Shorten one table before the merge and then remove the non-matching rows (NAs) in the merged file
```{r merge_tables_shorten, eval=TRUE}
my_mw_target2a <- merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=T) # To remove non-matching rows, use the argument setting 'all=F'.
my_mw_target2 <- na.omit(my_mw_target2a) # Removes rows containing "NAs" (non-matching rows).
```
- __Homework 3D__: How can the merge function in the previous step be executed so that only the common rows among the two data frames are returned? Prove that both methods - the two step version with `na.omit` and your method - return identical results.
- __Homework 3E__: Replace all `NAs` in the data frame `my_mw_target2a` with zeros.
```{r homework3D_solutions, eval=FALSE, echo=FALSE}
my_mw_target_tmp <- merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all=FALSE)
all(my_mw_target2 == my_mw_target_tmp)
my_mw_target2a <- as.matrix(my_mw_target2a)
my_mw_target2a[is.na(my_mw_target2a)] <- 0
my_mw_target2a <- as.data.frame(my_mw_target2a)
```
## Filtering Data
- __Step 5__: Retrieve all records with a value of greater than 100,000 in 'MW' column and 'C' value in 'Loc' column (targeted to chloroplast).
```{r filter_tables1, eval=TRUE}
query <- my_mw_target[my_mw_target[, 2] > 100000 & my_mw_target[, 4] == "C", ]
query[1:4, ]
dim(query)
```
- __Homework 3F__: How many protein entries in the `my`_mw`_target` data frame have a MW of greater then 4,000 and less then 5,000. Subset the data frame accordingly and sort it by MW to check that your result is correct.
## String Substitutions
- __Step 6__: Use a regular expression in a substitute function to generate a separate ID column that lacks the gene model extensions.
<