--- description: Reconstructing developmental or differentiation pathways from individual cell gene expression profiles to understand cellular transitions and relationships. subtitle: Seurat Toolkit title: Trajectory inference using Slingshot ---
> **Note** > > Code chunks run R commands unless otherwise specified.
## Loading libraries ``` {r} suppressPackageStartupMessages({ library(Seurat) library(plotly) options(rgl.printRglwidget = TRUE) library(Matrix) library(sparseMatrixStats) library(slingshot) library(tradeSeq) library(patchwork) }) # Define some color palette pal <- c(scales::hue_pal()(8), RColorBrewer::brewer.pal(9, "Set1"), RColorBrewer::brewer.pal(8, "Set2")) set.seed(1) pal <- rep(sample(pal, length(pal)), 200) ``` Nice function to easily draw a graph: ``` {r} # Add graph to the base R graphics plot draw_graph <- function(layout, graph, lwd = 0.2, col = "grey") { res <- rep(x = 1:(length(graph@p) - 1), times = (graph@p[-1] - graph@p[-length(graph@p)])) segments( x0 = layout[graph@i + 1, 1], x1 = layout[res, 1], y0 = layout[graph@i + 1, 2], y1 = layout[res, 2], lwd = lwd, col = col ) } ``` ## Preparing data If you have been using the **Seurat**, **Bioconductor** or **Scanpy** toolkits with your own data, you need to reach to the point where you have: - A dimensionality reduction on which to run the trajectory (for example: PCA, ICA, MNN, harmony, Diffusion Maps, UMAP) - The cell clustering information (for example: from Louvain, K-means) - A KNN/SNN graph (this is useful to inspect and sanity-check your trajectories) We will be using a subset of a bone marrow dataset (originally containing about 100K cells) for this exercise on trajectory inference. The bone marrow is the source of adult immune cells, and contains virtually all differentiation stages of cell from the **immune** system which later circulate in the blood to all other organs. ![](../figs/hematopoiesis.png) You can download the data: ``` {r} # download pre-computed data if missing or long compute fetch_data <- TRUE # url for source and intermediate data path_data <- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq" path_file <- "data/trajectory/trajectory_seurat_filtered.rds" if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE) if (!file.exists(path_file)) download.file(url = file.path(path_data, "trajectory/trajectory_seurat_filtered.rds"), destfile = path_file) ``` We already have pre-computed and subsetted the dataset (with 6688 cells and 3585 genes) following the analysis steps in this course. We then saved the objects, so you can use common tools to open and start to work with them (either in R or Python). In addition there was some manual filtering done to remove clusters that are disconnected and cells that are hard to cluster, which can be seen in this [script](https://github.com/NBISweden/workshop-scRNAseq/blob/master/scripts/data_processing/slingshot_preprocessing.Rmd) ## Reading data ``` {r } obj <- readRDS("data/trajectory/trajectory_seurat_filtered.rds") # Calculate cluster centroids (for plotting the labels later) mm <- sparse.model.matrix(~ 0 + factor(obj$clusters_use)) colnames(mm) <- levels(factor(obj$clusters_use)) centroids2d <- as.matrix(t(t(obj@reductions$umap@cell.embeddings) %*% mm) / Matrix::colSums(mm)) ``` Let's visualize which clusters we have in our dataset: ``` {r} #| fig-height: 8 #| fig-width: 8 vars <- c("batches", "dataset", "clusters_use", "Phase") pl <- list() for (i in vars) { pl[[i]] <- DimPlot(obj, group.by = i, label = T) + theme_void() + NoLegend() } wrap_plots(pl) ``` You can check, for example, the number of cells in each cluster: ``` {r} table(obj$clusters) ``` ## Exploring the data It is crucial that you have some understanding of the dataset being analyzed. What are the clusters you see in your data and most importantly **How are the clusters related to each other?**. Well, let's explore the data a bit. With the help of this table, write down which cluster numbers in your dataset express these key markers. Marker Cell Type --------- ---------------------------- Cd34 HSC progenitor Ms4a1 B cell lineage Cd3e T cell lineage Ltf Granulocyte lineage Cst3 Monocyte lineage Mcpt8 Mast Cell lineage Alas2 RBC lineage Siglech Dendritic cell lineage C1qc Macrophage cell lineage Pf4 Megakaryocyte cell lineage ``` {r} #| fig-height: 9 #| fig-width: 12 vars <- c("Cd34", "Ms4a1", "Cd3e", "Ltf", "Cst3", "Mcpt8", "Alas2", "Siglech", "C1qc", "Pf4") pl <- list() pl <- list(DimPlot(obj, group.by = "clusters_use", label = T) + theme_void() + NoLegend()) for (i in vars) { pl[[i]] <- FeaturePlot(obj, features = i, order = T) + theme_void() + NoLegend() } wrap_plots(pl) ``` Another way to better explore your data is to look in higher dimensions, to really get a sense for what is right or wrong. As mentioned in the dimensionality reduction exercises, here we ran UMAP with **3** dimensions.
> **Important** > > The UMAP needs to be computed to results in exactly 3 dimensions
Since the steps below are identical to both `Seurat` and `Scran` pipelines, we will extract the matrices from both, so it is clear what is being used where and to remove long lines of code used to get those matrices. We will use them all. Plot in 3D with `Plotly`: ``` {r} df <- data.frame(obj@reductions$umap3d@cell.embeddings, variable = factor(obj$clusters_use)) colnames(df)[1:3] <- c("UMAP_1", "UMAP_2", "UMAP_3") p_State <- plot_ly(df, x = ~UMAP_1, y = ~UMAP_2, z = ~UMAP_3, color = ~variable, colors = pal, size = .5) %>% add_markers() p_State ``` ``` {r} #| eval: false # to save interactive plot and open in a new tab try(htmlwidgets::saveWidget(p_State, selfcontained = T, "data/trajectory/umap_3d_clustering_plotly.html"), silent = T) utils::browseURL("data/trajectory/umap_3d_clustering_plotly.html") ``` We can now compute the lineages on these dataset. ``` {r} #| fig-height: 6 #| fig-width: 6 # Define lineage ends ENDS <- c("17", "27", "25", "16", "26", "53", "49") set.seed(1) lineages <- as.SlingshotDataSet(getLineages( data = obj@reductions$umap3d@cell.embeddings, clusterLabels = obj$clusters_use, dist.method = "mnn", # It can be: "simple", "scaled.full", "scaled.diag", "slingshot" or "mnn" end.clus = ENDS, # You can also define the ENDS! start.clus = "34" )) # define where to START the trajectories # IF NEEDED, ONE CAN ALSO MANULALLY EDIT THE LINEAGES, FOR EXAMPLE: # sel <- sapply( lineages@lineages, function(x){rev(x)[1]} ) %in% ENDS # lineages@lineages <- lineages@lineages[ sel ] # names(lineages@lineages) <- paste0("Lineage",1:length(lineages@lineages)) # lineages # Change the reduction to our "fixed" UMAP2d (FOR VISUALISATION ONLY) lineages@reducedDim <- obj@reductions$umap@cell.embeddings { plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16) lines(lineages, lwd = 1, col = "black", cex = 2) text(centroids2d, labels = rownames(centroids2d), cex = 0.8, font = 2, col = "white") } ``` Much better! ## Defining Principal Curves Once the clusters are connected, Slingshot allows you to transform them to a smooth trajectory using principal curves. This is an algorithm that iteratively changes an initial curve to better match the data points. It was developed for linear data. To apply it to single-cell data, slingshot adds two enhancements: - It will run principal curves for each 'lineage', which is a set of clusters that go from a defined start cluster to some end cluster - Lineages with a same set of clusters will be constrained so that their principal curves remain bundled around the overlapping clusters Since the function `getCurves()` takes some time to run, we can speed up the convergence of the curve fitting process by reducing the amount of cells to use in each lineage. Ideally you could all cells, but here we had set `approx_points` to 300 to speed up. Feel free to adjust that for your dataset. ``` {r} #| fig-height: 6 #| fig-width: 6 # Define curves curves <- as.SlingshotDataSet(getCurves( data = lineages, thresh = 1e-1, stretch = 1e-1, allow.breaks = F, approx_points = 100 )) curves # Plots { plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16) lines(curves, lwd = 2, col = "black") text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2) } ``` With those results in hands, we can now compute the differentiation **pseudotime**. ``` {r} #| fig-height: 6 #| fig-width: 6 pseudotime <- slingPseudotime(curves, na = FALSE) cellWeights <- slingCurveWeights(curves) x <- rowMeans(pseudotime) x <- x / max(x) o <- order(x) { plot(obj@reductions$umap@cell.embeddings[o, ], main = paste0("pseudotime"), pch = 16, cex = 0.4, axes = F, xlab = "", ylab = "", col = colorRampPalette(c("grey70", "orange3", "firebrick", "purple4"))(99)[x[o] * 98 + 1] ) points(centroids2d, cex = 2.5, pch = 16, col = "#FFFFFF99") text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2) } ```
> **Tip** > > The pseudotime represents the distance of every cell to the starting > cluster!
## Finding differentially expressed genes The main way to interpret a trajectory is to find genes that change along the trajectory. There are many ways to define differential expression along a trajectory: - Expression changes along a particular path (i.e. change with pseudotime) - Expression differences between branches - Expression changes at branch points - Expression changes somewhere along the trajectory - ... `tradeSeq` is a recently proposed algorithm to find trajectory differentially expressed genes. It works by smoothing the gene expression along the trajectory by fitting a smoother using generalized additive models (GAMs), and testing whether certain coefficients are statistically different between points in the trajectory. ``` {r} BiocParallel::register(BiocParallel::MulticoreParam()) ``` The fitting of GAMs can take quite a while, so **for demonstration purposes we first do a very stringent filtering** of the genes.
> **Tip** > > In an ideal experiment, you would use all the genes, or at least those > defined as being variable.
``` {r} sel_cells <- split(colnames(obj@assays$RNA@data), obj$clusters_use) sel_cells <- unlist(lapply(sel_cells, function(x) { set.seed(1) return(sample(x, 20)) })) gv <- as.data.frame(na.omit(scran::modelGeneVar(obj@assays$RNA@data[, sel_cells]))) gv <- gv[order(gv$bio, decreasing = T), ] sel_genes <- sort(rownames(gv)[1:500]) ``` Fitting the model:
> **Caution** > > This is a slow compute intensive step, we will not run this now and > instead use a pre-computed file in the step below.
``` {r} path_file <- "data/trajectory/seurat_scegam.rds" # fetch_data is defined at the top of this document if (!fetch_data) { sceGAM <- fitGAM( counts = drop0(obj@assays$RNA@data[sel_genes, sel_cells]), pseudotime = pseudotime[sel_cells, ], cellWeights = cellWeights[sel_cells, ], nknots = 5, verbose = T, parallel = T, sce = TRUE, BPPARAM = BiocParallel::MulticoreParam() ) saveRDS(sceGAM, path_file) } ``` Download the precomputed file. ``` {r} path_file <- "data/trajectory/seurat_scegam.rds" # fetch_data is defined at the top of this document if (fetch_data) { if (!file.exists(path_file)) download.file(url = file.path(path_data, "trajectory/results/seurat_scegam.rds"), destfile = path_file) } ``` ``` {r} # read data sceGAM <- readRDS(path_file) ``` ``` {r} #| fig-height: 5 #| fig-width: 7 plotGeneCount(curves, clusters = obj$clusters_use, models = sceGAM) lineages ``` ``` {r} #| fig-height: 6 #| fig-width: 6 lc <- sapply(lineages@lineages, function(x) { rev(x)[1] }) names(lc) <- gsub("Lineage", "L", names(lc)) { plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16) lines(curves, lwd = 2, col = "black") points(centroids2d[lc, ], col = "black", pch = 16, cex = 4) text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white") } ``` ### Genes that change with pseudotime We can first look at general trends of gene expression across pseudotime. ``` {r} set.seed(8) res <- na.omit(associationTest(sceGAM, contrastType = "consecutive")) res <- res[res$pvalue < 1e-3, ] res <- res[res$waldStat > mean(res$waldStat), ] res <- res[order(res$waldStat, decreasing = T), ] res[1:10, ] ``` We can plot their expression: ``` {r} #| fig-height: 12 #| fig-width: 12 par(mfrow = c(4, 4), mar = c(.1, .1, 2, 1)) { plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") lines(curves, lwd = 2, col = "black") points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) } vars <- rownames(res[1:15, ]) vars <- na.omit(vars[vars != "NA"]) for (i in vars) { x <- drop0(obj@assays$RNA@data)[i, ] x <- (x - min(x)) / (max(x) - min(x)) o <- order(x) plot(obj@reductions$umap@cell.embeddings[o, ], main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] ) } ``` ### Genes that change between two pseudotime points We can define custom pseudotime values of interest if we're interested in genes that change between particular point in pseudotime. By default, we can look at differences between start and end: ``` {r} res <- na.omit(startVsEndTest(sceGAM, pseudotimeValues = c(0, 1))) res <- res[res$pvalue < 1e-3, ] res <- res[res$waldStat > mean(res$waldStat), ] res <- res[order(res$waldStat, decreasing = T), ] res[1:10, 1:6] ``` You can see now that there are several more columns, one for each lineage. This table represents the differential expression within each lineage, to identify which genes go up or down. Let's check lineage 1: ``` {r} #| fig-height: 12 #| fig-width: 12 # Get the top UP and Down regulated in lineage 1 res_lin1 <- sort(setNames(res$logFClineage1, rownames(res))) vars <- names(c(rev(res_lin1)[1:7], res_lin1[1:8])) vars <- na.omit(vars[vars != "NA"]) par(mfrow = c(4, 4), mar = c(.1, .1, 2, 1)) { plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") lines(curves, lwd = 2, col = "black") points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) } for (i in vars) { x <- drop0(obj@assays$RNA@data)[i, ] x <- (x - min(x)) / (max(x) - min(x)) o <- order(x) plot(obj@reductions$umap@cell.embeddings[o, ], main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] ) } ``` ### Genes that are different between lineages More interesting are genes that are different between two branches. We may have seen some of these genes already pop up in previous analyses of pseudotime. There are several ways to define "different between branches", and each have their own functions: - Different at the end points, using `diffEndTest()` - Different at the branching point, using `earlyDETest()` - Different somewhere in pseudotime the branching point, using `patternTest()` Note that the last function requires that the pseudotimes between two lineages are aligned. ``` {r} res <- na.omit(diffEndTest(sceGAM)) res <- res[res$pvalue < 1e-3, ] res <- res[res$waldStat > mean(res$waldStat), ] res <- res[order(res$waldStat, decreasing = T), ] res[1:10, ] ``` You can see now that there are even more columns, one for the pairwise comparison between each lineage. Let's check lineage 1 vs lineage 2: ``` {r} #| fig-height: 12 #| fig-width: 12 # Get the top UP and Down regulated in lineage 1 vs 2 res_lin1_2 <- sort(setNames(res$logFC1_2, rownames(res))) vars <- names(c(rev(res_lin1_2)[1:7], res_lin1_2[1:8])) vars <- na.omit(vars[vars != "NA"]) par(mfrow = c(4, 4), mar = c(.1, .1, 2, 1)) { plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") lines(curves, lwd = 2, col = "black") points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) } for (i in vars) { x <- drop0(obj@assays$RNA@data)[i, ] x <- (x - min(x)) / (max(x) - min(x)) o <- order(x) plot(obj@reductions$umap@cell.embeddings[o, ], main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] ) } ``` Check out this [vignette](https://statomics.github.io/tradeSeq/articles/tradeSeq.html) for a more in-depth overview of tradeSeq and many other differential expression tests. ## Generating batch-corrected data for differential gene expression Before computing differential gene expression, sometimes it is a good idea to make sure our dataset is somewhat homogeneous (without very strong batch effects). In this dataset, we actually used data from 4 different technologies (Drop-seq, SmartSeq2 and 10X) and therefore massive differences in read counts can be observed: If you want to know more about how to control for this issue, please have a look at [batch_corrected_counts.Rmd](https://github.com/NBISweden/workshop-scRNAseq/blob/master/scripts/data_processing/batch_corrected_counts.Rmd) ## References Cannoodt, Robrecht, Wouter Saelens, and Yvan Saeys. 2016. "Computational Methods for Trajectory Inference from Single-Cell Transcriptomics." *European Journal of Immunology* 46 (11): 2496--2506. [doi](https://doi.org/10.1002/eji.201646347). Saelens, Wouter, Robrecht Cannoodt, Helena Todorov, and Yvan Saeys. 2019. "A Comparison of Single-Cell Trajectory Inference Methods." *Nature Biotechnology* 37 (5): 547--54. [doi](https://doi.org/10.1038/s41587-019-0071-9). ## Session info ```{=html}
``` ```{=html} ``` Click here ```{=html} ``` ``` {r} sessionInfo() ``` ```{=html}
```