---
title: "scRNA-Seq Embedding Methods"
author: "Author: Daniela Cassol, Le Zhang, Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: true
fontsize: 14pt
bibliography: bibtex.bib
weight: 10
type: docs
---
Source code downloads:
[ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/scrnaseq/scRNAseq.Rmd) ]
[ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/scrnaseq/scRNAseq.R) ]
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
This tutorial introduces the usage of several software implementations of
embedding algorithms for high-dimensional gene expression data [@Duo2018-oo] that are often
used for single cell RNA-Seq (scRNA-Seq) data. Many of them are available as R packages on CRAN,
Bioconductor and/or GitHub. Examples include PCA, MDS,
[SC3](http://bioconductor.org/packages/release/bioc/html/SC3.html)
[@Kiselev2017-ye],
[isomap](https://bioconductor.org/packages/release/bioc/html/RDRToolbox.html),
[t-SNE](https://cran.r-project.org/web/packages/Rtsne/)
[@donaldson2010package], [FIt-SNE](https://github.com/KlugerLab/FIt-SNE)
[@Linderman2019-qh], and
[UMAP](https://cran.r-project.org/web/packages/umap/index.html)
[@McInnes2018-tc]. In addition, some packages such as Bioconductor's
[scater](https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html)
package provide in a single environment access to a wide range of embedding methods that can be
conveniently and uniformly applied to Bioconductor's S4 object class called
[`SingleCellExperiment`](https://bioconductor.org/packages/3.12/bioc/html/SingleCellExperiment.html)
for handling scRNA-Seq data [@Senabouth2019-cr; @Amezquita2020-vu]. The
performance of the different embedding methods for scRNA-Seq data has been
intensively tested by several studies, including Sun et al. [-@Sun2019-po;
-@Sun2020-ct].
For illustration purposes, the following example code first applies four widely
used embedding methods to a bulk RNA-Seq data set [@Howard2013-fq], and then to a much more
complex scRNA-Seq data set [@Aztekin2019-sw] obtained from the
[`scRNAseq`](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html)
package.
## Bulk RNA-Seq data
### Generate `SummarizedExperiment` and `SingleCellExperiment`
The following loads the bulk RNA-Seq data from Howard _et al._ [-@Howard2013-fq]
into `SummarizedExperiment` and `SingleCellExperiment` objects. This is done
by first creating a `SummarizedExperiment` object and then coercing it to a
`SingleCellExperiment` object, as well as intializing the `SingleCellExperiment`
directly.
#### Create `SummarizedExperiment` and coerce to `SingleCellExperiment`
The required `targetsPE.txt` and `countDFeByg.xls` files can be downloaded
from [here](https://github.com/tgirke/GEN242/tree/main/content/en/tutorials/scrnaseq/results).
```{r create_se_sce1a, eval=TRUE, message=FALSE, warning=FALSE}
library(SummarizedExperiment); library(SingleCellExperiment)
targetspath <- "results/targetsPE.txt"
countpath <- "results/countDFeByg.xls"
targets <- read.delim(targetspath, comment.char = "#")
rownames(targets) <- targets$SampleName
countDF <- read.delim(countpath, row.names=1, check.names=FALSE)
(se <- SummarizedExperiment(assays=list(counts=countDF), colData=targets))
(sce <- as(se, "SingleCellExperiment"))
```
#### Create `SingleCellExperiment` directly
```{r create_se_sce1b, eval=TRUE}
sce2 <- SingleCellExperiment(assays=list(counts=countDF), colData=targets)
```
### Prepare data for plotting with embedding methods
The data are preprocessed (_e.g._normalized) to plot them with the `run`
embedding functions from the
[`scran`](https://bioconductor.org/packages/3.12/bioc/vignettes/scran/inst/doc/scran.html)
and [`scater`](https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html) packages.
```{r preprocess1, eval=TRUE, message=FALSE, warning=FALSE}
library(scran); library(scater)
sce <- logNormCounts(sce)
colLabels(sce) <- factor(colData(sce)$Factor) # This uses replicate info from above targets file as pseudo-clusters
```
### Embed with different methods and plot results
Note, the embedding results are sequentially appended to the
SingleCellExperiment object, meaning one can use the plot function whenever
necessary.
#### (a) tSNE
```{r run_tsne1, eval=TRUE}
sce <- runTSNE(sce)
reducedDimNames(sce)
plotTSNE(sce, colour_by="label", text_by="label")
```
#### (b) MDS
```{r run_mds1, eval=TRUE}
sce <- runMDS(sce)
reducedDimNames(sce)
plotMDS(sce, colour_by="label", text_by="label")
```
#### (c) UMAP
```{r run_umap1, eval=TRUE}
sce <- runUMAP(sce)
reducedDimNames(sce)
plotUMAP(sce, colour_by="label", text_by="label")
```
#### (d) PCA
PCA plot for first two components.
```{r run_pca1a, eval=TRUE, message=FALSE, warning=FALSE}
sce <- runPCA(sce) # gives a warning due to small size of data set but it still works
reducedDimNames(sce)
plotPCA(sce, colour_by="label", text_by="label")
```
Multiple components can be plotted in a series of pairwise plots. When more
than two components are plotted, the diagonal boxes in the scatter plot matrix
show the density for each component.
```{r run_pca1b, eval=TRUE, message=FALSE, warning=FALSE}
sce <- runPCA(sce, ncomponents=20) # gives a warning due to small size of data set but it still works
reducedDimNames(sce)
plotPCA(sce, colour_by="label", text_by="label", ncomponents = 4)
```
## scRNA-Seq data
### Load scRNA-Seq data
The `scRNAseq` package is used to load the scRNA-Seq data set from Xenopus tail
directly into a SingleCellExperiment object [@Aztekin2019-sw].
```{r create_sce2, eval=FALSE, message=FALSE, warning=FALSE}
library(scRNAseq)
sce <- AztekinTailData()
```
### Prepare data for plotting with embedding methods
Similarly as above, the data are preprocessed (_e.g._normalized) to plot them with the `run`
embedding functions from the
[`scran`](https://bioconductor.org/packages/3.12/bioc/vignettes/scran/inst/doc/scran.html)
package. In addition, the data is clustered with the `quickCluster` function.
```{r preprocess2, eval=FALSE}
library(scran); library(scater)
sce <- logNormCounts(sce)
clusters <- quickCluster(sce)
# sce <- computeSumFactors(sce, clusters=clusters)
colLabels(sce) <- factor(clusters)
table(colLabels(sce))
```
To acclerate the testing performance of the following code, the size of the expression matrix
is reduced to cell types with values $\ge10^4$.
```{r filter2, eval=FALSE}
filter <- colSums(assays(sce)$counts) >= 10^4
sce <- sce[, filter]
```
To color items in the downstream dot plots by cell type instead of the above clustering result,
one can use the cell type info under `colData()`. Note, this step is not evaluated here.
```{r collor_by_celltype2, eval=TRUE}
# colLabels(sce) <- colData(sce)$cluster
```
### Embed with different methods and plot results
As under the bulk RNA-Seq section, the embedding results are sequentially
appended to the `SingleCellExperiment` object, meaning one can use the plot
function whenever necessary.
#### (a) tSNE
```{r run_tsne2, eval=FALSE}
sce <- runTSNE(sce)
reducedDimNames(sce)
plotTSNE(sce, colour_by="label", text_by="label")
```
![](../results/sctsne.png)
tSNE embedding of scRNA-Seq data
#### (b) MDS
```{r run_mds2, eval=FALSE}
sce <- runMDS(sce)
reducedDimNames(sce)
plotMDS(sce, colour_by="label", text_by="label")
```
![](../results/scmds.png)
MDS embedding of scRNA-Seq data
## (c) UMAP
```{r run_umap2, eval=FALSE}
sce <- runUMAP(sce) # Note, the UMAP embedding is already stored in downloaded SingleCellExperiment object by authers. So one can just use this one or recompute it.
reducedDimNames(sce)
plotUMAP(sce, colour_by="label", text_by="label")
```
![](../results/scumap.png)
UMAP embedding of scRNA-Seq data
## (d) PCA
PCA result plotted for first two components.
```{r run_pca2, eval=FALSE}
sce <- runPCA(sce)
reducedDimNames(sce)
plotPCA(sce, colour_by="label", text_by="label")
```
![](../results/scpca.png)
PCA embedding of scRNA-Seq data
Multiple components can be plotted in a series of pairwise plots. When more
than two components are plotted, the diagonal boxes in the scatter plot matrix
show the density for each component.
```{r run_pca2b, eval=FALSE, message=FALSE, warning=FALSE}
sce <- runPCA(sce, ncomponents=20)
reducedDimNames(sce)
plotPCA(sce, colour_by="label", text_by="label", ncomponents = 4)
```
![](../results/scpca_multi.png)
PCA embedding of scRNA-Seq data for multiple components
## Version Information
```{r sessionInfo}
sessionInfo()
```
## References