--- description: Assignment of cell identities based on gene expression patterns using reference data. subtitle: Seurat Toolkit title: Celltype prediction ---
> **Note** > > Code chunks run R commands unless otherwise specified.
Celltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses known marker genes for each cell type.\ We will select one sample from the Covid data, `ctrl_13` and predict celltype by cell on that sample.\ Some methods will predict a celltype to each cell based on what it is most similar to, even if that celltype is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified.\ There are multiple different methods to predict celltypes, here we will just cover a few of those. We will use a reference PBMC dataset from the `scPred` package which is provided as a Seurat object with counts. And we will test classification based on the `scPred` and `scMap` methods. Finally we will use gene set enrichment predict celltype based on the DEGs of each cluster. ## Read data First, lets load required libraries ``` {r} suppressPackageStartupMessages({ library(Seurat) library(dplyr) library(patchwork) library(ggplot2) library(pheatmap) # remotes::install_github("powellgenomicslab/scPred") library(scPred) }) ``` Let's read in the saved Covid-19 data object from the clustering step. ``` {r} # download pre-computed data if missing or long compute fetch_data <- TRUE # url for source and intermediate data path_data <- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq" path_file <- "data/covid/results/seurat_covid_qc_dr_int_cl.rds" if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE) if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/seurat_covid_qc_dr_int_cl.rds"), destfile = path_file) alldata <- readRDS(path_file) ``` Subset one patient. ``` {r} ctrl <- alldata[, alldata$orig.ident == "ctrl_13"] # set active assay to RNA and remove the CCA assay ctrl@active.assay <- "RNA" ctrl[["CCA"]] <- NULL ctrl ``` ## Reference data Load the reference dataset with annotated labels. ``` {r} reference <- scPred::pbmc_1 reference ``` Rerun analysis pipeline. Run normalization, feature selection and dimensionality reduction Here, we will run all the steps that we did in previous labs in one go using the `magittr` package with the pipe-operator `%>%`. ``` {r} reference <- reference %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = F) %>% RunUMAP(dims = 1:30) ``` ``` {r} #| fig-height: 5 #| fig-width: 6 DimPlot(reference, group.by = "cell_type", label = TRUE, repel = TRUE) + NoAxes() ``` Run all steps of the analysis for the **ctrl** sample as well. Use the clustering from the integration lab with resolution 0.5. ``` {r} # Set the identity as louvain with resolution 0.3 ctrl <- SetIdent(ctrl, value = "CCA_snn_res.0.5") ctrl <- ctrl %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = F) %>% RunUMAP(dims = 1:30) ``` ``` {r} #| fig-height: 5 #| fig-width: 6 DimPlot(ctrl, label = TRUE, repel = TRUE) + NoAxes() ``` ## Label transfer First we will run label transfer using a similar method as in the integration exercise. But, instead of CCA, which is the default for the `FindTransferAnchors()` function, we will use `pcaproject`, ie; the query dataset is projected onto the PCA of the reference dataset. Then, the labels of the reference data are predicted. ``` {r} transfer.anchors <- FindTransferAnchors( reference = reference, query = ctrl, dims = 1:30 ) predictions <- TransferData( anchorset = transfer.anchors, refdata = reference$cell_type, dims = 1:30 ) ctrl <- AddMetaData(object = ctrl, metadata = predictions) ``` ``` {r} #| fig-height: 5 #| fig-width: 6 DimPlot(ctrl, group.by = "predicted.id", label = T, repel = T) + NoAxes() ``` Now plot how many cells of each celltypes can be found in each cluster. ``` {r} #| fig-height: 5 #| fig-width: 6 ggplot(ctrl@meta.data, aes(x = CCA_snn_res.0.5, fill = predicted.id)) + geom_bar() + theme_classic() ``` ## scPred scPred will train a classifier based on all principal components. First, `getFeatureSpace()` will create a scPred object stored in the `@misc` slot where it extracts the PCs that best separates the different celltypes. Then `trainModel()` will do the actual training for each celltype. ``` {r} reference <- getFeatureSpace(reference, "cell_type") reference <- trainModel(reference) ``` We can then print how well the training worked for the different celltypes by printing the number of PCs used, the ROC value and Sensitivity/Specificity. Which celltypes do you think are harder to classify based on this dataset? ``` {r} get_scpred(reference) ``` You can optimize parameters for each dataset by chaining parameters and testing different types of models, see more at: . But for now, we will continue with this model. Now, let's predict celltypes on our data, where scPred will align the two datasets with Harmony and then perform classification. ``` {r} ctrl <- scPredict(ctrl, reference) ``` ``` {r} #| fig-height: 5 #| fig-width: 6 DimPlot(ctrl, group.by = "scpred_prediction", label = T, repel = T) + NoAxes() ``` Now plot how many cells of each celltypes can be found in each cluster. ``` {r} #| fig-height: 5 #| fig-width: 6 ggplot(ctrl@meta.data, aes(x = CCA_snn_res.0.5, fill = scpred_prediction)) + geom_bar() + theme_classic() ``` ## Azimuth There are multiple online resources with large curated datasets with methods to integrate and do label transfer. One such resource is [Azimuth](https://azimuth.hubmapconsortium.org/) another one is [Disco](https://www.immunesinglecell.org/). Azimuth is also possible to install and run locally, but in this case we have used the online app and ran the predictions for you. So you just have to download the prediction file. This is how to save the count matrix: ``` {r} #| eval: false # No need to run now. C = ctrl@assays$RNA@counts saveRDS(C, file = "data/covid/results/ctrl13_count_matrix.rds") ``` Instead load the results and visualize: ``` {r} #| fig-height: 7 #| fig-width: 8 path_data <- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq" path_file <- "data/covid/results/azimuth_pred.tsv" if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE) if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/azimuth_pred.tsv"), destfile = path_file) azimuth_pred <- read.table(path_file, sep = "\t", header = T) # add predictions to the seurat object ctrl$azimuth = azimuth_pred$predicted.celltype.l2[match(colnames(ctrl), azimuth_pred$cell)] DimPlot(ctrl, group.by = "azimuth", label = T, repel = T) + NoAxes() ``` ## Compare results Now we will compare the output of the two methods using the convenient function in scPred `crossTab()` that prints the overlap between two metadata slots. ``` {r} crossTab(ctrl, "predicted.id", "scpred_prediction") ``` We can also plot all the different predictions side by side ``` {r} #| fig-height: 5 #| fig-width: 16 wrap_plots( DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes(), DimPlot(ctrl, label = T, group.by = "scpred_prediction") + NoAxes(), DimPlot(ctrl, label = T, group.by = "azimuth") + NoAxes(), ncol = 3 ) ``` ## GSEA with celltype markers Another option, where celltype can be classified on cluster level is to use gene set enrichment among the DEGs with known markers for different celltypes. Similar to how we did functional enrichment for the DEGs in the differential expression exercise. There are some resources for celltype gene sets that can be used. Such as [CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/), [PanglaoDB](https://panglaodb.se/) or celltype gene sets at [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). We can also look at overlap between DEGs in a reference dataset and the dataset you are analyzing. ### DEG overlap First, lets extract top DEGs for our Covid-19 dataset and the reference dataset. When we run differential expression for our dataset, we want to report as many genes as possible, hence we set the cutoffs quite lenient. ``` {r} # run differential expression in our dataset, using clustering at resolution 0.5 alldata <- SetIdent(alldata, value = "CCA_snn_res.0.5") DGE_table <- FindAllMarkers( alldata, logfc.threshold = 0, test.use = "wilcox", min.pct = 0.1, min.diff.pct = 0, only.pos = TRUE, max.cells.per.ident = 20, return.thresh = 1, assay = "RNA" ) # split into a list DGE_list <- split(DGE_table, DGE_table$cluster) unlist(lapply(DGE_list, nrow)) ``` ``` {r} # Compute differential gene expression in reference dataset (that has cell annotation) reference <- SetIdent(reference, value = "cell_type") reference_markers <- FindAllMarkers( reference, min.pct = .1, min.diff.pct = .2, only.pos = T, max.cells.per.ident = 20, return.thresh = 1 ) # Identify the top cell marker genes in reference dataset # select top 50 with hihgest foldchange among top 100 signifcant genes. reference_markers <- reference_markers[order(reference_markers$avg_log2FC, decreasing = T), ] reference_markers %>% group_by(cluster) %>% top_n(-100, p_val) %>% top_n(50, avg_log2FC) -> top50_cell_selection # Transform the markers into a list ref_list <- split(top50_cell_selection$gene, top50_cell_selection$cluster) unlist(lapply(ref_list, length)) ``` Now we can run GSEA for the DEGs from our dataset and check for enrichment of top DEGs in the reference dataset. ``` {r} suppressPackageStartupMessages(library(fgsea)) # run fgsea for each of the clusters in the list res <- lapply(DGE_list, function(x) { gene_rank <- setNames(x$avg_log2FC, x$gene) fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000) return(fgseaRes) }) names(res) <- names(DGE_list) # You can filter and resort the table based on ES, NES or pvalue res <- lapply(res, function(x) { x[x$pval < 0.1, ] }) res <- lapply(res, function(x) { x[x$size > 2, ] }) res <- lapply(res, function(x) { x[order(x$NES, decreasing = T), ] }) res ``` Selecting top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters. ``` {r} #| fig-height: 5 #| fig-width: 11 new.cluster.ids <- unlist(lapply(res, function(x) { as.data.frame(x)[1, 1] })) alldata$ref_gsea <- new.cluster.ids[as.character(alldata@active.ident)] wrap_plots( DimPlot(alldata, label = T, group.by = "CCA_snn_res.0.5") + NoAxes(), DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes(), ncol = 2 ) ``` Compare the results with the other celltype prediction methods in the **ctrl_13** sample. ``` {r} #| fig-height: 5 #| fig-width: 16 ctrl$ref_gsea <- alldata$ref_gsea[alldata$orig.ident == "ctrl_13"] wrap_plots( DimPlot(ctrl, label = T, group.by = "ref_gsea") + NoAxes() + ggtitle("GSEA"), DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes() + ggtitle("LabelTransfer"), DimPlot(ctrl, label = T, group.by = "scpred_prediction") + NoAxes() + ggtitle("scPred"), ncol = 3 ) ``` ### With annotated gene sets We have downloaded the celltype gene lists from http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and converted the excel file to a csv for you. Read in the gene lists and do some filtering. ``` {r} path_file <- file.path("data/cell_marker_human.csv") if (!file.exists(path_file)) download.file(file.path(path_data, "cell_marker_human.csv"), destfile = path_file) ``` ``` {r} # Load the human marker table markers <- read.delim("data/cell_marker_human.csv", sep = ";") markers <- markers[markers$species == "Human", ] markers <- markers[markers$cancer_type == "Normal", ] # Filter by tissue (to reduce computational time and have tissue-specific classification) sort(unique(markers$tissue_type)) grep("blood", unique(markers$tissue_type), value = T) markers <- markers[markers$tissue_type %in% c( "Blood", "Venous blood", "Serum", "Plasma", "Spleen", "Bone marrow", "Lymph node" ), ] # remove strange characters etc. celltype_list <- lapply(unique(markers$cell_name), function(x) { x <- paste(markers$Symbol[markers$cell_name == x], sep = ",") x <- gsub("[[]|[]]| |-", ",", x) x <- unlist(strsplit(x, split = ",")) x <- unique(x[!x %in% c("", "NA", "family")]) x <- casefold(x, upper = T) }) names(celltype_list) <- unique(markers$cell_name) celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100] celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5] ``` ``` {r} # run fgsea for each of the clusters in the list res <- lapply(DGE_list, function(x) { gene_rank <- setNames(x$avg_log2FC, x$gene) fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000, scoreType = "pos") return(fgseaRes) }) names(res) <- names(DGE_list) # You can filter and resort the table based on ES, NES or pvalue res <- lapply(res, function(x) { x[x$pval < 0.01, ] }) res <- lapply(res, function(x) { x[x$size > 5, ] }) res <- lapply(res, function(x) { x[order(x$NES, decreasing = T), ] }) # show top 3 for each cluster. lapply(res, head, 3) ``` Let's plot the results. ``` {r} #| fig-height: 5 #| fig-width: 11 new.cluster.ids <- unlist(lapply(res, function(x) { as.data.frame(x)[1, 1] })) alldata$cellmarker_gsea <- new.cluster.ids[as.character(alldata@active.ident)] wrap_plots( DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes(), DimPlot(alldata, label = T, group.by = "cellmarker_gsea") + NoAxes(), ncol = 2 ) ```
> **Discuss** > > Do you think that the methods overlap well? Where do you see the most > inconsistencies?
In this case we do not have any ground truth, and we cannot say which method performs best. You should keep in mind, that any celltype classification method is just a prediction, and you still need to use your common sense and knowledge of the biological system to judge if the results make sense. Finally, lets save the data with predictions. ``` {r} saveRDS(ctrl, "data/covid/results/seurat_covid_qc_dr_int_cl_ct-ctrl13.rds") ``` ## Session info ```{=html}
``` ```{=html} ``` Click here ```{=html} ``` ``` {r} sessionInfo() ``` ```{=html}
```