---
description: Assignment of cell identities based on gene expression
patterns using reference data.
subtitle: Seurat Toolkit
title: Celltype prediction
---
> **Note**
>
> Code chunks run R commands unless otherwise specified.
Celltype prediction can either be performed on indiviudal cells where
each cell gets a predicted celltype label, or on the level of clusters.
All methods are based on similarity to other datasets, single cell or
sorted bulk RNAseq, or uses known marker genes for each cell type.\
Ideally celltype predictions should be run on each sample separately and
not using the integrated data. In this case we will select one sample
from the Covid data, `ctrl_13` and predict celltype by cell on that
sample.\
Some methods will predict a celltype to each cell based on what it is
most similar to, even if that celltype is not included in the reference.
Other methods include an uncertainty so that cells with low similarity
scores will be unclassified.\
There are multiple different methods to predict celltypes, here we will
just cover a few of those.
We will use a reference PBMC dataset from the `scPred` package which is
provided as a Seurat object with counts. Unfortunately scPred has not
been updated to run with Seurat v5, so we will not use it here. We will
test classification based on Seurat `labelTransfer`, the `SingleR`
method `scPred` and using `Azimuth`. Finally we will use gene set
enrichment predict celltype based on the DEGs of each cluster.
## Read data
First, lets load required libraries
``` {r}
#| label: libraries
suppressPackageStartupMessages({
library(Seurat)
library(dplyr)
library(patchwork)
library(ggplot2)
library(pheatmap)
library(scPred)
library(celldex)
library(SingleR)
library(remotes)
remotes::install_github(
"https://github.com/satijalab/seurat-data@4dc08e022f51c324bc7bf785b1b5771d2742701d",
upgrade = FALSE,
dependencies = FALSE
)
library(SeuratData)
remotes::install_github(
"https://github.com/immunogenomics/presto@7636b3d0465c468c35853f82f1717d3a64b3c8f6",
upgrade = FALSE,
dependencies = FALSE
)
remotes::install_github(
"https://github.com/mojaveazure/seurat-disk@877d4e18ab38c686f5db54f8cd290274ccdbe295",
upgrade = FALSE,
dependencies = FALSE)
remotes::install_github(
"https://github.com/satijalab/azimuth@243ee5db80fcbffa3452c944254a325a3da2ef9e",
upgrade = FALSE,
dependencies = FALSE
)
library(Azimuth)
})
```
Let's read in the saved Covid-19 data object from the clustering step.
``` {r}
#| label: fetch-data
# download pre-computed data if missing or long compute
fetch_data <- TRUE
# url for source and intermediate data
path_data <- "https://nextcloud.dc.scilifelab.se/public.php/webdav"
curl_upass <- "-u zbC5fr2LbEZ9rSE:scRNAseq2025"
path_file <- "data/covid/results/seurat_covid_qc_dr_int_cl.rds"
if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results_seurat/seurat_covid_qc_dr_int_cl.rds"), destfile = path_file, method = "curl", extra = curl_upass)
alldata <- readRDS(path_file)
```
Subset one patient.
``` {r}
#| label: subset
ctrl <- alldata[, alldata$orig.ident == "ctrl_13"]
```
## Reference data
Load the reference dataset with annotated labels that is provided by the
`scPred` package, it is a subsampled set of cells from human PBMCs.
``` {r}
#| label: fetch-ref
reference <- scPred::pbmc_1
reference
```
Rerun analysis pipeline. Run normalization, feature selection and
dimensionality reduction
Here, we will run all the steps that we did in previous labs in one go
using the `magittr` package with the pipe-operator `%>%`.
``` {r}
#| label: process-ref
reference <- reference %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(verbose = F) %>%
RunUMAP(dims = 1:30)
```
``` {r}
#| label: plot-ref
#| fig-height: 5
#| fig-width: 6
DimPlot(reference, group.by = "cell_type", label = TRUE, repel = TRUE) + NoAxes()
```
Run all steps of the analysis for the **ctrl** sample as well. Use the
clustering from the integration lab with resolution 0.5.
``` {r}
#| label: process-data
# Set the identity as louvain with resolution 0.5
ctrl <- SetIdent(ctrl, value = "RNA_snn_res.0.5")
ctrl <- ctrl %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(verbose = F) %>%
RunUMAP(dims = 1:30)
```
``` {r}
#| label: plot-data
#| fig-height: 5
#| fig-width: 6
DimPlot(ctrl, label = TRUE, repel = TRUE) + NoAxes()
```
## Label transfer
First we will run label transfer using a similar method as in the
integration exercise. But, instead of CCA, which is the default for the
`FindTransferAnchors()` function, we will use `pcaproject`, ie; the
query dataset is projected onto the PCA of the reference dataset. Then,
the labels of the reference data are predicted.
``` {r}
#| label: transfer
transfer.anchors <- FindTransferAnchors(
reference = reference, query = ctrl,
dims = 1:30
)
predictions <- TransferData(
anchorset = transfer.anchors, refdata = reference$cell_type,
dims = 1:30
)
ctrl <- AddMetaData(object = ctrl, metadata = predictions)
```
``` {r}
#| label: plot-transfer
#| fig-height: 5
#| fig-width: 6
DimPlot(ctrl, group.by = "predicted.id", label = T, repel = T) + NoAxes()
```
Now plot how many cells of each celltypes can be found in each cluster.
``` {r}
#| label: plot-proportions
#| fig-height: 5
#| fig-width: 6
ggplot(ctrl@meta.data, aes(x = RNA_snn_res.0.5, fill = predicted.id)) +
geom_bar() +
theme_classic()
```
## SinlgeR
SingleR is performs unbiased cell type recognition from single-cell RNA
sequencing data, by leveraging reference transcriptomic datasets of pure
cell types to infer the cell of origin of each single cell
independently.
We first have to convert the Seurat object to a SingleCellExperiment
object.
``` {r}
#| label: sce
sce = as.SingleCellExperiment(ctrl)
```
There are multiple datasets included in the `celldex` package that can
be used for celltype prediction, here we will test two different ones,
the `DatabaseImmuneCellExpressionData` and the
`HumanPrimaryCellAtlasData`. In addition we will use the same reference
dataset that we used for label transfer above but using SingleR instead.
### Immune cell reference
``` {r}
#| label: singler-immune
immune = celldex::DatabaseImmuneCellExpressionData()
singler.immune <- SingleR(test = sce, ref = immune, assay.type.test=1,
labels = immune$label.main)
head(singler.immune)
```
### HPCA reference
``` {r}
#| label: singler-hpca
hpca <- celldex::HumanPrimaryCellAtlasData()
singler.hpca <- SingleR(test = sce, ref = hpca, assay.type.test=1,
labels = hpca$label.main)
head(singler.hpca)
```
### With own reference data
``` {r}
#| label: singler-ref
sce.ref = as.SingleCellExperiment(reference)
singler.ref <- SingleR(test=sce, ref=sce.ref, labels=sce.ref$cell_type, de.method="wilcox")
head(singler.ref)
```
Compare results:
``` {r}
#| label: plot-singler
#| fig-height: 7
#| fig-width: 8
ctrl$singler.immune = singler.immune$pruned.labels
ctrl$singler.hpca = singler.hpca$pruned.labels
ctrl$singler.ref = singler.ref$pruned.labels
DimPlot(ctrl, group.by = c("singler.hpca", "singler.immune", "singler.ref"), ncol = 2)
```
## Azimuth
There are multiple online resources with large curated datasets with
methods to integrate and do label transfer. One such resource is
[Azimuth](https://azimuth.hubmapconsortium.org/) another one is
[Disco](https://www.immunesinglecell.org/).
Here we will use the PBMC reference provided with Azimuth. Which in
principle runs label transfer of your dataset onto a large curated
reference. The first time you run the command, the `pbmcref` dataset
will be downloaded to your local computer.
``` {r}
#| label: azimuth
options(future.globals.maxSize = 1e9)
# will install the pbmcref dataset first time you run it.
ctrl <- RunAzimuth(ctrl, reference = "pbmcref")
```
This dataset has predictions at 3 different levels of annotation wiht
`l1` being the more broad celltypes and `l3` more detailed annotation.
``` {r}
#| label: plot-azimuth
DimPlot(ctrl, group.by = "predicted.celltype.l1", label = T, repel = T) + NoAxes()
DimPlot(ctrl, group.by = "predicted.celltype.l2", label = T, repel = T) + NoAxes()
DimPlot(ctrl, group.by = "predicted.celltype.l3", label = T, repel = T) + NoAxes()
```
## Compare results
Now we will compare the output of the two methods using the convenient
function in scPred `crossTab()` that prints the overlap between two
metadata slots.
``` {r}
#| label: crosstab
crossTab(ctrl, "predicted.id", "singler.hpca")
```
We can also plot all the different predictions side by side
``` {r}
#| label: plot-all
#| fig-height: 10
#| fig-width: 16
wrap_plots(
DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes() + ggtitle("LabelTransfer"),
DimPlot(ctrl, label = T, group.by = "singler.hpca") + NoAxes() + ggtitle("SingleR HPCA"),
DimPlot(ctrl, label = T, group.by = "singler.ref") + NoAxes() + ggtitle("SingleR Ref"),
DimPlot(ctrl, label = T, group.by = "predicted.celltype.l1") + NoAxes() + ggtitle("Azimuth l1"),
ncol = 2
)
```
## GSEA with celltype markers
Another option, where celltype can be classified on cluster level is to
use gene set enrichment among the DEGs with known markers for different
celltypes. Similar to how we did functional enrichment for the DEGs in
the differential expression exercise. There are some resources for
celltype gene sets that can be used. Such as
[CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/),
[PanglaoDB](https://panglaodb.se/) or celltype gene sets at
[MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). We can also
look at overlap between DEGs in a reference dataset and the dataset you
are analyzing.
### DEG overlap
First, lets extract top DEGs for our Covid-19 dataset and the reference
dataset. When we run differential expression for our dataset, we want to
report as many genes as possible, hence we set the cutoffs quite
lenient.
``` {r}
#| label: dge
# run differential expression in our dataset, using clustering at resolution 0.5
# first we need to join the layers
alldata@active.assay = "RNA"
alldata <- JoinLayers(object = alldata, layers = c("data","counts"))
# set the clustering you want to use as the identity class.
alldata <- SetIdent(alldata, value = "RNA_snn_res.0.5")
DGE_table <- FindAllMarkers(
alldata,
logfc.threshold = 0,
test.use = "wilcox",
min.pct = 0.1,
min.diff.pct = 0,
only.pos = TRUE,
max.cells.per.ident = 100,
return.thresh = 1,
assay = "RNA"
)
# split into a list
DGE_list <- split(DGE_table, DGE_table$cluster)
unlist(lapply(DGE_list, nrow))
```
``` {r}
#| label: dge-ref
# Compute differential gene expression in reference dataset (that has cell annotation)
reference <- SetIdent(reference, value = "cell_type")
reference_markers <- FindAllMarkers(
reference,
min.pct = .1,
min.diff.pct = .2,
only.pos = T,
max.cells.per.ident = 20,
return.thresh = 1
)
# Identify the top cell marker genes in reference dataset
# select top 50 with hihgest foldchange among top 100 signifcant genes.
reference_markers <- reference_markers[order(reference_markers$avg_log2FC, decreasing = T), ]
reference_markers %>%
group_by(cluster) %>%
top_n(-100, p_val) %>%
top_n(50, avg_log2FC) -> top50_cell_selection
# Transform the markers into a list
ref_list <- split(top50_cell_selection$gene, top50_cell_selection$cluster)
unlist(lapply(ref_list, length))
```
Now we can run GSEA for the DEGs from our dataset and check for
enrichment of top DEGs in the reference dataset.
``` {r}
#| label: gsea
suppressPackageStartupMessages(library(fgsea))
# run fgsea for each of the clusters in the list
res <- lapply(DGE_list, function(x) {
gene_rank <- setNames(x$avg_log2FC, x$gene)
fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)
return(fgseaRes)
})
names(res) <- names(DGE_list)
# You can filter and resort the table based on ES, NES or pvalue
res <- lapply(res, function(x) {
x[x$pval < 0.1, ]
})
res <- lapply(res, function(x) {
x[x$size > 2, ]
})
res <- lapply(res, function(x) {
x[order(x$NES, decreasing = T), ]
})
res
```
Selecting top significant overlap per cluster, we can now rename the
clusters according to the predicted labels. OBS! Be aware that if you
have some clusters that have non-significant p-values for all the gene
sets, the cluster label will not be very reliable. Also, the gene sets
you are using may not cover all the celltypes you have in your dataset
and hence predictions may just be the most similar celltype. Also, some
of the clusters have very similar p-values to multiple celltypes, for
instance the ncMono and cMono celltypes are equally good for some
clusters.
``` {r}
#| label: plot-gsea
#| fig-height: 5
#| fig-width: 11
new.cluster.ids <- unlist(lapply(res, function(x) {
as.data.frame(x)[1, 1]
}))
annot = new.cluster.ids[as.character(alldata@active.ident)]
names(annot) = colnames(alldata)
alldata$ref_gsea <- annot
wrap_plots(
DimPlot(alldata, label = T, group.by = "RNA_snn_res.0.5") + NoAxes(),
DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes(),
ncol = 2
)
```
Compare the results with the other celltype prediction methods in the
**ctrl_13** sample.
``` {r}
#| label: plot-gsea-sub
#| fig-height: 5
#| fig-width: 16
ctrl$ref_gsea <- alldata$ref_gsea[alldata$orig.ident == "ctrl_13"]
wrap_plots(
DimPlot(ctrl, label = T, group.by = "ref_gsea") + NoAxes() + ggtitle("GSEA"),
DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes() + ggtitle("LabelTransfer"),
ncol = 2
)
```
### With annotated gene sets
We have downloaded the celltype gene lists from
http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and
converted the excel file to a csv for you. Read in the gene lists and do
some filtering.
``` {r}
#| label: fetch-markers
path_file <- file.path("data/cell_marker_human.csv")
if (!file.exists(path_file)) download.file(file.path(path_data, "misc/cell_marker_human.csv"), destfile = path_file, method = "curl", extra = curl_upass)
```
``` {r}
#| label: prep-markers
# Load the human marker table
markers <- read.delim("data/cell_marker_human.csv", sep = ";")
markers <- markers[markers$species == "Human", ]
markers <- markers[markers$cancer_type == "Normal", ]
# Filter by tissue (to reduce computational time and have tissue-specific classification)
sort(unique(markers$tissue_type))
grep("blood", unique(markers$tissue_type), value = T)
markers <- markers[markers$tissue_type %in% c(
"Blood", "Venous blood",
"Serum", "Plasma",
"Spleen", "Bone marrow", "Lymph node"
), ]
# remove strange characters etc.
celltype_list <- lapply(unique(markers$cell_name), function(x) {
x <- paste(markers$Symbol[markers$cell_name == x], sep = ",")
x <- gsub("[[]|[]]| |-", ",", x)
x <- unlist(strsplit(x, split = ","))
x <- unique(x[!x %in% c("", "NA", "family")])
x <- casefold(x, upper = T)
})
names(celltype_list) <- unique(markers$cell_name)
celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100]
celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5]
```
``` {r}
#| label: gsea-marker
# run fgsea for each of the clusters in the list
res <- lapply(DGE_list, function(x) {
gene_rank <- setNames(x$avg_log2FC, x$gene)
fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000, scoreType = "pos")
return(fgseaRes)
})
names(res) <- names(DGE_list)
# You can filter and resort the table based on ES, NES or pvalue
res <- lapply(res, function(x) {
x[x$pval < 0.01, ]
})
res <- lapply(res, function(x) {
x[x$size > 5, ]
})
res <- lapply(res, function(x) {
x[order(x$NES, decreasing = T), ]
})
# show top 3 for each cluster.
lapply(res, head, 3)
```
Let's plot the results.
``` {r}
#| label: plot-gsea-marker
#| fig-height: 5
#| fig-width: 11
new.cluster.ids <- unlist(lapply(res, function(x) {
as.data.frame(x)[1, 1]
}))
annot = new.cluster.ids[as.character(alldata@active.ident)]
names(annot) = colnames(alldata)
alldata$cellmarker_gsea <- annot
wrap_plots(
DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes(),
DimPlot(alldata, label = T, group.by = "cellmarker_gsea") + NoAxes(),
ncol = 2
)
```
> **Discuss**
>
> Do you think that the methods overlap well? Where do you see the most
> inconsistencies?
In this case we do not have any ground truth, and we cannot say which
method performs best. You should keep in mind, that any celltype
classification method is just a prediction, and you still need to use
your common sense and knowledge of the biological system to judge if the
results make sense.
Finally, lets save the data with predictions.
``` {r}
#| label: save
saveRDS(ctrl, "data/covid/results/seurat_covid_qc_dr_int_cl_ct-ctrl13.rds")
```
## Session info
```{=html}
```
```{=html}
```
Click here
```{=html}
```
``` {r}
#| label: session
sessionInfo()
```
```{=html}
```