--- title: "A very short, sketchy, introduction to Bioconductor" author: Abhijit Dasgupta date: BIOF 339 --- ```{r setup, include=F, child = here::here('slides/templates/setup.Rmd')} ``` layout: true
BIOF 339
--- ## Bioconductor Bioconductor provides tools for the analysis and comprehension of high-throughput genomic and biological data, using R. + 1823 packages + Covers the bioinformatic pipeline + Analysis [`GenomicRanges`, `Biostrings`, `GenomicAlignments`, `SummarizedExperiment`] + Annotation (species/platform specific, system) [`biomaRt`, `org.Hs.eg.db`, `GO.db`, `KEGG.db`] + Experiments [`TENxPBMCData`, `airway`, `ALL`] + Workflows [`rnaseqGene`, `TCGAWorkflow`] --- ## Bioconductor .center[Bioconductor v. 3.10 packages] .pull-left[ ![](../img/software.png) ![](../img/experiments.png) ] .pull-right[ ![](../img/annotations.png) ![](../img/workflow.png) ] --- ## Installing Bioconductor packages Bioconductor is a separate repository and system which uses R. So the process is a bit different than `install.packages`. The following works for R version 3.5 and higher. ```{r 11-Bioconductor-1, eval=F} install.packages("BiocManager") BiocManager::install(c('Biobase','limma','hgu95av2.db','Biostrings')) #<< ``` There are several packages that are often installed for each Bioconductor package, and some have functions that have the same name as one's you've used. So + Use `package::function` format for calling functions from non-Bioconductor packages --- ## Bioconductor basics ```{r 11-Bioconductor-2} library(Biostrings) dna <- DNAStringSet(c("AACAT", "GGCGCCT")) reverseComplement(dna) ``` -- ```{r 11-Bioconductor-3} library(Biostrings) data("phiX174Phage") phiX174Phage ``` --- ## Bioconductor basics ```{r 11-Bioconductor-4} letterFrequency(phiX174Phage, 'GC', as.prob=TRUE) ``` --- ## Bioconductor data structures + So far we've seen the `data.frame` or `tibble` be the unit of data storage + In Bioconductor, data are stored in **containers** which can contain many elements of data for an experiment - Actual quantitative results of experiments - Phenotype data - Genotype meta-data - Results of analysis + In Bioconductor workflows, the same container is updated with new elements, which can then be accessed using accessor functions --- ## An ExpressionSet ```{r 11-Bioconductor-5} library(Biobase) data('sample.ExpressionSet') str(sample.ExpressionSet) ``` --- ## An ExpressionSet These objects are based on a different R structure. Instead of extracting elements using `$`, this structure uses **slots** which are accessed using `@` ```{r 11-Bioconductor-6} slotNames(sample.ExpressionSet) ``` ```{r 11-Bioconductor-7} sample.ExpressionSet@phenoData ``` --- ## An ExpressionSet We almost never use `@`. Instead we use *accessor functions* to extract data ```{r 11-Bioconductor-8} pData(sample.ExpressionSet) # Phenotype data ``` --- ## An ExpressionSet We almost never use `@`. Instead we use *accessor functions* to extract data ```{r 11-Bioconductor-9} head(exprs(sample.ExpressionSet)) # Expression data ``` --- ## SummarizedExperiment This is a more common structure related to modern experiments with different technologies .center[![](img/se.png)] --- ## An Example ```{r 11-Bioconductor-10} # BiocManager::install('SummarizedExperiment') library(SummarizedExperiment) data(airway, package="airway") se <- airway se ``` --- ## An Example Count data from the scRNA-seq experiment ```{r 11-Bioconductor-11} assay(se) ``` --- ## An Example Genomic ranges for each transcript ```{r 11-Bioconductor-12} rowRanges(se) ``` --- ## An Example Phenotype data ```{r 11-Bioconductor-13} colData(se) ``` --- ## An Example Experimental meta-data ```{r 11-Bioconductor-14} metadata(se) ``` --- ## An Example Data subsetting .pull-left[ ```{r 11-Bioconductor-15} se[1:5, 1:3] ``` ] .pull-right[ ```{r 11-Bioconductor-16} se[,se$cell=='N61311'] ``` ] --- class: center, middle # Annotation --- ## biomaRt The `biomaRt` package allows access to many public annotation databases ```{r 11-Bioconductor-17} # BiocManager::install('biomaRt') library(biomaRt) ensemblDB <- useMart('ensembl') searchDatasets(mart=ensemblDB, pattern='Human') ``` -- ```{r 11-Bioconductor-18} ensemblHuman <- useMart('ensembl', dataset='hsapiens_gene_ensembl', host='useast.ensembl.org') ``` --- ## Identifying attributes ```{r 11-Bioconductor-19} searchAttributes(mart=ensemblHuman, pattern='affy') ``` --- ## Identifying attributes ```{r 11-Bioconductor-20} searchAttributes(mart=ensemblHuman, pattern='hgnc') ``` --- ## Annotating probsets We first grab some probesets from the `sample.ExpressionSet` Affy experiment ```{r 11-Bioconductor-21} affyIDs <- rownames(featureData(sample.ExpressionSet)) ``` Now let's find annotation ```{r 11-Bioconductor-22} getBM(attributes = c('affy_hg_u95av2','entrezgene_id'), filters = 'affy_hg_u95av2', values = affyIDs[200:203], mart = ensemblHuman) ``` --- class:middle, center # A RNA-seq pipeline --- ## The workflow + Exploratory data analysis + Differential expression analysis with [DESeq2](https://bioconductor.org/packages/DESeq2) + Visualization + We will start after reads have been aligned to a reference genome and reads overlapping known genes have been counted #### The experiment + In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. + For each of the four cell lines, we have a treated and an untreated sample. --- ## Get data ```{r 11-Bioconductor-23} # BiocManager::install('airway') library(airway) data(airway) se <- airway head(assay(airway)) ``` --- ## Create a DESeqDataSet ```{r 11-Bioconductor-24} # BiocManager::install('DESeq2') library("DESeq2") dds <- DESeqDataSet(se, design = ~ cell + dex) dds ``` --- ## Run differential expression pipeline ```{r 11-Bioconductor-25, message = T} dds <- DESeq(dds) ``` --- ## Run differential expression pipeline ```{r 11-Bioconductor-26} (res <- results(dds)) ``` --- ## Summarizing results ```{r 11-Bioconductor-27} library(tidyverse) as.data.frame(res) %>% rownames_to_column(var = 'ID') %>% filter(padj < 0.1) %>% arrange(desc(abs(log2FoldChange))) %>% head() ``` --- ## A visualization ```{r 11-Bioconductor-28, fig.height=3, fig.width=6} topGene <- rownames(res)[which.min(res$padj)] dat <- plotCounts(dds, gene=topGene, intgroup=c("dex"), returnData=TRUE) ggplot(dat, aes(x = dex, y = count, fill=dex))+ geom_dotplot(binaxis='y', stackdir='center')+ scale_y_log10() ``` --- class:middle,center # A Seurat pipeline --- ## Grab the data and convert for Seurat ```{r 11-Bioconductor-29, echo=F} library(Seurat) pb2 = Seurat::Read10X('~/filtered_gene_bc_matrices/hg19/') pbmc2 <- CreateSeuratObject(counts = pb2, project = "pbmc3k", min.cells = 3, min.features = 200) r2 <- rownames(pbmc2@assays$RNA@counts) ``` ```{r 11-Bioconductor-30} # BiocManager::install('TENxPBMCData') library(TENxPBMCData) library(Seurat) pbmc_3k <- TENxPBMCData(dataset='pbmc3k') colnames(pbmc_3k) <- paste0('Cell', seq_len(ncol(pbmc_3k))) pbmc <- CreateSeuratObject( counts=assay(pbmc_3k, 'counts'), project='pbmc3k', min.cells=3, min.features = 200) pbmc ``` --- ## A bit of QC ```{r 11-Bioconductor-31} # The [[ operator can add columns to object metadata. This is a great place to stash QC stats rownames(pbmc@assays$RNA@counts) <- r2 rownames(pbmc[['RNA']]@meta.features) <- r2 rownames(pbmc@assays$RNA@data) <- r2 # Change to gene names from Ensembl pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # percentage in mitochondria genome head(pbmc@meta.data) ``` > See how analyses results are added to the same container. The idea is to keep all the experimental information together. This was a philosophic choice maide by the Bioconductor developers, inspired by the MIAME requirements and how data are stored on genomics databases like GEO --- ## Visualization ```{r 11-Bioconductor-32} # Visualize QC metrics as a violin plot (plt <- VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)) ``` --- ## Normalization ```{r 11-Bioconductor-33} pbmc <- NormalizeData(pbmc) ``` > Note, we're saving in the same object --- ## Feature selection ```{r 11-Bioconductor-34, fig.width=12, fig.height=4} rownames(pbmc[['RNA']]@meta.features) <- r2 pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # Identify the 10 most highly variable genes top10 <- head(VariableFeatures(pbmc), 10) # plot variable features with and without labels plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) CombinePlots(plots = list(plot1, plot2)) ``` --- ## PCA ```{r 11-Bioconductor-35} pbmc <- ScaleData(pbmc) pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) print(pbmc[["pca"]], dims = 1:5, nfeatures = 5) ``` > Important to note that each step just adds elements to the Seurat object --- ## PCA ```{r 11-Bioconductor-36} DimPlot(pbmc, reduction = "pca") ``` --- ## t-SNE ```{r 11-Bioconductor-37} pbmc <- RunTSNE(pbmc, dims=1:10) DimPlot(pbmc, reduction='tsne') ``` --- class:middle,center # Heatmaps --- ## Heatmaps There are several ways of doing heatmaps in R: + [http://sebastianraschka.com/Articles/heatmaps_in_r.html](http://sebastianraschka.com/Articles/heatmaps_in_r.html) + [https://plot.ly/r/heatmaps/](https://plot.ly/r/heatmaps/) + [http://moderndata.plot.ly/interactive-heat-maps-for-r/](http://moderndata.plot.ly/interactive-heat-maps-for-r/) + [http://www.siliconcreek.net/r/simple-heatmap-in-r-with-ggplot2](http://www.siliconcreek.net/r/simple-heatmap-in-r-with-ggplot2) + [https://rud.is/b/2016/02/14/making-faceted-heatmaps-with-ggplot2/](https://rud.is/b/2016/02/14/making-faceted-heatmaps-with-ggplot2/) --- ## Some example data ```{r 11-Bioconductor-38} library(Biobase) data(sample.ExpressionSet) exdat <- sample.ExpressionSet library(limma) design1 <- model.matrix(~type, data=pData(exdat)) lm1 <- lmFit(exprs(exdat), design1) lm1 <- eBayes(lm1) # compute linear model for each probeset geneID <- rownames(topTable(lm1, coef=2, num=100, adjust='none',p.value=0.05)) exdat2 <- exdat[geneID,] # Keep features with p-values < 0.05 exdat2 ``` --- ```{r 11-Bioconductor-39, echo=T, fig.width=6, fig.height=6, eval=T} # BiocManager::install('Heatplus') library(Heatplus) reg1 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors, breaks=-3:3) plot(reg1) ``` --- ```{r 11-Bioconductor-40, echo=T, fig.width=6, fig.height=6, eval=T} library(Heatplus) reg1 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors, breaks=-3:3) plot(reg1) ``` --- ```{r 11-Bioconductor-41, echo=T, eval=T, fig.width=6, fig.height = 6} corrdist <- function(x) as.dist(1-cor(t(x))) hclust.avl <- function(x) hclust(x, method='average') reg2 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors, breaks=-3:3, dendrogram = list(clustfun=hclust.avl, distfun=corrdist)) plot(reg2) ``` --- ```{r 11-Bioconductor-42, echo=T, eval=T} ann1 <- annHeatmap(exprs(exdat2), ann=pData(exdat2), col = heat.colors) plot(ann1) ``` --- ```{r 11-Bioconductor-43, echo=T, eval=T, fig.width=7, fig.height=6} ann2 <- annHeatmap(exprs(exdat2), ann=pData(exdat2), col = heat.colors, cluster = list(cuth=7500, label=c('Control-like','Case-like'))) plot(ann2) ``` --- ```{r 11-Bioconductor-44, out.height=500, out.width=500, eval=F} # devtools::install_github("rstudio/d3heatmap") d3heatmap(exprs(exdat2), distfun = corrdist, hclustfun = hclust.avl, scale='row') ``` [Link](../heatmap.html) Put your mouse over each point :) --- # Resources + [Organizing Single Cell Analysis with Bioconductor](https://osca.bioconductor.org/) + [Bioconductor Courses](https://bioconductor.org/help/course-materials/) + [2019 Bioconductor workshops](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/)