---
title: "A very short, sketchy, introduction to Bioconductor"
author: Abhijit Dasgupta
date: BIOF 339
---
```{r setup, include=F, child = here::here('slides/templates/setup.Rmd')}
```
layout: true
---
## Bioconductor
Bioconductor provides tools for the analysis and comprehension of high-throughput genomic and biological data, using R.
+ 1823 packages
+ Covers the bioinformatic pipeline
+ Analysis [`GenomicRanges`, `Biostrings`, `GenomicAlignments`, `SummarizedExperiment`]
+ Annotation (species/platform specific, system) [`biomaRt`, `org.Hs.eg.db`, `GO.db`, `KEGG.db`]
+ Experiments [`TENxPBMCData`, `airway`, `ALL`]
+ Workflows [`rnaseqGene`, `TCGAWorkflow`]
---
## Bioconductor
.center[Bioconductor v. 3.10 packages]
.pull-left[
![](../img/software.png)
![](../img/experiments.png)
]
.pull-right[
![](../img/annotations.png)
![](../img/workflow.png)
]
---
## Installing Bioconductor packages
Bioconductor is a separate repository and system which uses R. So the process is a bit different than `install.packages`. The following works for R version 3.5 and higher.
```{r 11-Bioconductor-1, eval=F}
install.packages("BiocManager")
BiocManager::install(c('Biobase','limma','hgu95av2.db','Biostrings')) #<<
```
There are several packages that are often installed for each Bioconductor package, and some have functions
that have the same name as one's you've used. So
+ Use `package::function` format for calling functions from non-Bioconductor packages
---
## Bioconductor basics
```{r 11-Bioconductor-2}
library(Biostrings)
dna <- DNAStringSet(c("AACAT", "GGCGCCT"))
reverseComplement(dna)
```
--
```{r 11-Bioconductor-3}
library(Biostrings)
data("phiX174Phage")
phiX174Phage
```
---
## Bioconductor basics
```{r 11-Bioconductor-4}
letterFrequency(phiX174Phage, 'GC', as.prob=TRUE)
```
---
## Bioconductor data structures
+ So far we've seen the `data.frame` or `tibble` be the unit of data storage
+ In Bioconductor, data are stored in **containers** which can contain many elements of data for an experiment
- Actual quantitative results of experiments
- Phenotype data
- Genotype meta-data
- Results of analysis
+ In Bioconductor workflows, the same container is updated with new elements, which can then be accessed using accessor functions
---
## An ExpressionSet
```{r 11-Bioconductor-5}
library(Biobase)
data('sample.ExpressionSet')
str(sample.ExpressionSet)
```
---
## An ExpressionSet
These objects are based on a different R structure. Instead of extracting elements using `$`, this structure uses **slots** which are accessed using `@`
```{r 11-Bioconductor-6}
slotNames(sample.ExpressionSet)
```
```{r 11-Bioconductor-7}
sample.ExpressionSet@phenoData
```
---
## An ExpressionSet
We almost never use `@`. Instead we use *accessor functions* to extract data
```{r 11-Bioconductor-8}
pData(sample.ExpressionSet) # Phenotype data
```
---
## An ExpressionSet
We almost never use `@`. Instead we use *accessor functions* to extract data
```{r 11-Bioconductor-9}
head(exprs(sample.ExpressionSet)) # Expression data
```
---
## SummarizedExperiment
This is a more common structure related to modern experiments with different technologies
.center[![](img/se.png)]
---
## An Example
```{r 11-Bioconductor-10}
# BiocManager::install('SummarizedExperiment')
library(SummarizedExperiment)
data(airway, package="airway")
se <- airway
se
```
---
## An Example
Count data from the scRNA-seq experiment
```{r 11-Bioconductor-11}
assay(se)
```
---
## An Example
Genomic ranges for each transcript
```{r 11-Bioconductor-12}
rowRanges(se)
```
---
## An Example
Phenotype data
```{r 11-Bioconductor-13}
colData(se)
```
---
## An Example
Experimental meta-data
```{r 11-Bioconductor-14}
metadata(se)
```
---
## An Example
Data subsetting
.pull-left[
```{r 11-Bioconductor-15}
se[1:5, 1:3]
```
]
.pull-right[
```{r 11-Bioconductor-16}
se[,se$cell=='N61311']
```
]
---
class: center, middle
# Annotation
---
## biomaRt
The `biomaRt` package allows access to many public annotation databases
```{r 11-Bioconductor-17}
# BiocManager::install('biomaRt')
library(biomaRt)
ensemblDB <- useMart('ensembl')
searchDatasets(mart=ensemblDB, pattern='Human')
```
--
```{r 11-Bioconductor-18}
ensemblHuman <- useMart('ensembl', dataset='hsapiens_gene_ensembl', host='useast.ensembl.org')
```
---
## Identifying attributes
```{r 11-Bioconductor-19}
searchAttributes(mart=ensemblHuman, pattern='affy')
```
---
## Identifying attributes
```{r 11-Bioconductor-20}
searchAttributes(mart=ensemblHuman, pattern='hgnc')
```
---
## Annotating probsets
We first grab some probesets from the `sample.ExpressionSet` Affy experiment
```{r 11-Bioconductor-21}
affyIDs <- rownames(featureData(sample.ExpressionSet))
```
Now let's find annotation
```{r 11-Bioconductor-22}
getBM(attributes = c('affy_hg_u95av2','entrezgene_id'),
filters = 'affy_hg_u95av2',
values = affyIDs[200:203],
mart = ensemblHuman)
```
---
class:middle, center
# A RNA-seq pipeline
---
## The workflow
+ Exploratory data analysis
+ Differential expression analysis with [DESeq2](https://bioconductor.org/packages/DESeq2)
+ Visualization
+ We will start after reads have been aligned to a reference genome and reads overlapping known genes have been counted
#### The experiment
+ In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours.
+ For each of the four cell lines, we have a treated and an untreated sample.
---
## Get data
```{r 11-Bioconductor-23}
# BiocManager::install('airway')
library(airway)
data(airway)
se <- airway
head(assay(airway))
```
---
## Create a DESeqDataSet
```{r 11-Bioconductor-24}
# BiocManager::install('DESeq2')
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)
dds
```
---
## Run differential expression pipeline
```{r 11-Bioconductor-25, message = T}
dds <- DESeq(dds)
```
---
## Run differential expression pipeline
```{r 11-Bioconductor-26}
(res <- results(dds))
```
---
## Summarizing results
```{r 11-Bioconductor-27}
library(tidyverse)
as.data.frame(res) %>%
rownames_to_column(var = 'ID') %>%
filter(padj < 0.1) %>%
arrange(desc(abs(log2FoldChange))) %>% head()
```
---
## A visualization
```{r 11-Bioconductor-28, fig.height=3, fig.width=6}
topGene <- rownames(res)[which.min(res$padj)]
dat <- plotCounts(dds, gene=topGene, intgroup=c("dex"), returnData=TRUE)
ggplot(dat, aes(x = dex, y = count, fill=dex))+
geom_dotplot(binaxis='y', stackdir='center')+
scale_y_log10()
```
---
class:middle,center
# A Seurat pipeline
---
## Grab the data and convert for Seurat
```{r 11-Bioconductor-29, echo=F}
library(Seurat)
pb2 = Seurat::Read10X('~/filtered_gene_bc_matrices/hg19/')
pbmc2 <- CreateSeuratObject(counts = pb2, project = "pbmc3k", min.cells = 3, min.features = 200)
r2 <- rownames(pbmc2@assays$RNA@counts)
```
```{r 11-Bioconductor-30}
# BiocManager::install('TENxPBMCData')
library(TENxPBMCData)
library(Seurat)
pbmc_3k <- TENxPBMCData(dataset='pbmc3k')
colnames(pbmc_3k) <- paste0('Cell', seq_len(ncol(pbmc_3k)))
pbmc <- CreateSeuratObject( counts=assay(pbmc_3k, 'counts'),
project='pbmc3k',
min.cells=3,
min.features = 200)
pbmc
```
---
## A bit of QC
```{r 11-Bioconductor-31}
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
rownames(pbmc@assays$RNA@counts) <- r2
rownames(pbmc[['RNA']]@meta.features) <- r2
rownames(pbmc@assays$RNA@data) <- r2 # Change to gene names from Ensembl
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # percentage in mitochondria genome
head(pbmc@meta.data)
```
> See how analyses results are added to the same container. The idea is to keep all the experimental information
together. This was a philosophic choice maide by the Bioconductor developers, inspired by the MIAME requirements and how
data are stored on genomics databases like GEO
---
## Visualization
```{r 11-Bioconductor-32}
# Visualize QC metrics as a violin plot
(plt <- VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3))
```
---
## Normalization
```{r 11-Bioconductor-33}
pbmc <- NormalizeData(pbmc)
```
> Note, we're saving in the same object
---
## Feature selection
```{r 11-Bioconductor-34, fig.width=12, fig.height=4}
rownames(pbmc[['RNA']]@meta.features) <- r2
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))
```
---
## PCA
```{r 11-Bioconductor-35}
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
```
> Important to note that each step just adds elements to the Seurat object
---
## PCA
```{r 11-Bioconductor-36}
DimPlot(pbmc, reduction = "pca")
```
---
## t-SNE
```{r 11-Bioconductor-37}
pbmc <- RunTSNE(pbmc, dims=1:10)
DimPlot(pbmc, reduction='tsne')
```
---
class:middle,center
# Heatmaps
---
## Heatmaps
There are several ways of doing heatmaps in R:
+ [http://sebastianraschka.com/Articles/heatmaps_in_r.html](http://sebastianraschka.com/Articles/heatmaps_in_r.html)
+ [https://plot.ly/r/heatmaps/](https://plot.ly/r/heatmaps/)
+ [http://moderndata.plot.ly/interactive-heat-maps-for-r/](http://moderndata.plot.ly/interactive-heat-maps-for-r/)
+ [http://www.siliconcreek.net/r/simple-heatmap-in-r-with-ggplot2](http://www.siliconcreek.net/r/simple-heatmap-in-r-with-ggplot2)
+ [https://rud.is/b/2016/02/14/making-faceted-heatmaps-with-ggplot2/](https://rud.is/b/2016/02/14/making-faceted-heatmaps-with-ggplot2/)
---
## Some example data
```{r 11-Bioconductor-38}
library(Biobase)
data(sample.ExpressionSet)
exdat <- sample.ExpressionSet
library(limma)
design1 <- model.matrix(~type, data=pData(exdat))
lm1 <- lmFit(exprs(exdat), design1)
lm1 <- eBayes(lm1) # compute linear model for each probeset
geneID <- rownames(topTable(lm1, coef=2, num=100, adjust='none',p.value=0.05))
exdat2 <- exdat[geneID,] # Keep features with p-values < 0.05
exdat2
```
---
```{r 11-Bioconductor-39, echo=T, fig.width=6, fig.height=6, eval=T}
# BiocManager::install('Heatplus')
library(Heatplus)
reg1 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors,
breaks=-3:3)
plot(reg1)
```
---
```{r 11-Bioconductor-40, echo=T, fig.width=6, fig.height=6, eval=T}
library(Heatplus)
reg1 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors,
breaks=-3:3)
plot(reg1)
```
---
```{r 11-Bioconductor-41, echo=T, eval=T, fig.width=6, fig.height = 6}
corrdist <- function(x) as.dist(1-cor(t(x)))
hclust.avl <- function(x) hclust(x, method='average')
reg2 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors,
breaks=-3:3,
dendrogram = list(clustfun=hclust.avl, distfun=corrdist))
plot(reg2)
```
---
```{r 11-Bioconductor-42, echo=T, eval=T}
ann1 <- annHeatmap(exprs(exdat2), ann=pData(exdat2), col = heat.colors)
plot(ann1)
```
---
```{r 11-Bioconductor-43, echo=T, eval=T, fig.width=7, fig.height=6}
ann2 <- annHeatmap(exprs(exdat2), ann=pData(exdat2), col = heat.colors,
cluster = list(cuth=7500,
label=c('Control-like','Case-like')))
plot(ann2)
```
---
```{r 11-Bioconductor-44, out.height=500, out.width=500, eval=F}
# devtools::install_github("rstudio/d3heatmap")
d3heatmap(exprs(exdat2), distfun = corrdist,
hclustfun = hclust.avl, scale='row')
```
[Link](../heatmap.html)
Put your mouse over each point :)
---
# Resources
+ [Organizing Single Cell Analysis with Bioconductor](https://osca.bioconductor.org/)
+ [Bioconductor Courses](https://bioconductor.org/help/course-materials/)
+ [2019 Bioconductor workshops](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/)