# Trajectory inference and RNA velocity { #RNA_velocity } ## scVelo in R There are two major tools for estimating RNA velocity: - [velocyto](http://velocyto.org) is based on a steady-state model. It supports both Python and R. - [scVelo](https://scvelo.readthedocs.io) supports a full dynamical model and various of utility functions. It only supports Python. Here, we will introduce `reticulate` to use the `scvelo` Python package in R (sounds cool, right? Hope you will find it handy!) In order or use `scVelo`, we first need to install it. Please follow [Chapter 2.2](env-pre.html) to install Python and `scVelo==0.2.2`. Generally, the extraction of unspliced RNAs from raw reads is a bit tricky, and not widely supported. Please see [Chapter 7.1](preprocess.html) for three tested options for mroe details. In this tutorial, we will download and use the built-in pancreas dataset from `scvelo` that has already been preprocssed to count the unspliced RNAs to demonstrate the useage of RNA velocity analysis. The dataset will be automatically downloaded to `./data/Pancreas/endocrinogenesis_day15.h5ad` (52.5MB). **Note**, this tutorial is translated from its Python version of scvelo at https://scvelo.readthedocs.io/Pancreas/ ### Getting started First, let's use load `reticulate`: ```{r} # install.packages("reticulate") library(reticulate) ``` Then use reticulate to import Python packages `scvelo` and `matplotlib`: ```{r echo=TRUE, message=TRUE} use_condaenv("sgcell") plt <- import("matplotlib.pyplot", as = "plt") scv <- import ("scvelo") scv$logging$print_version() scv$settings$presenter_view = TRUE scv$settings$verbosity = 3 scv$settings$set_figure_params("scvelo") ``` Endocrine development in pancreas lineage has four major fates: $\alpha$, $\beta$, $\delta$, $\epsilon$. Dataset is obtained and subseted from [Bastidas-Ponce et al. (2018)](https://journals.biologists.com/dev/article/146/12/dev173849/19483/Comprehensive-single-cell-mRNA-profiling-reveals-a) by the scVelo team. ```{r} adata <- scv$datasets$pancreas() adata ``` Note, to run velocity analysis on your own data, read your file with `adata = scv$read(file_path)`. If you want to save model and parameters after processing, run the following command, `adata$write(file_path, compression = 'gzip')`. Now, let's have a look of the proportions of unspliced/spliced mRNA reads, UMAP embedding and cluster annotations can be printed and visualized using built-in functions. ```{r} # create a folder named "images" dir.create('images') scv$utils$show_proportions(adata) scv$pl$proportions(adata, figsize=c(10, 3), show=TRUE) # plt$savefig('images/RNA-velo-fig1.png') ``` ```{r} scv$pl$scatter(adata, legend_loc = "best", size = 50, title = "Pancreas Celltype", show=TRUE) # plt$savefig('images/RNA-velo-fig2.png') ``` ### Data Preprocessing Preprocessing contains: - gene selection by detection (detected with a minimum number of counts) - high variability (dispersion) - normalizing every cell by its initial size and logarithmizing X First and second order moments are also computed (mean, uncentered variance for deterministic, stochastic mode respectively) among nearest neighbors in PCA space. ```{r} scv$pp$filter_and_normalize(adata, min_shared_counts = as.integer(20), n_top_genes = as.integer(2000)) scv$pp$moments(adata, n_pcs = as.integer(30), n_neighbors = as.integer(30)) ``` ### Diffusion pseudotime Pseudotime, a part of standardized scRNA-seq analysis pipeline, is also implemented in this package, and can be compared with the `latent time` introduced in dynamical mode. ```{r} adata$uns$data$iroot <- which.min(adata$obsm['X_umap'][, 1]) scv$tl$diffmap(adata) scv$tl$dpt(adata) scv$pl$scatter(adata, color = 'dpt_pseudotime', title = 'pseudotime', color_map = 'gnuplot', colorbar = TRUE, rescale_color = c(0,1), perc=c(2, 98), show=TRUE) # plt$savefig('images/RNA-velo-fig3.png') ``` ### Compute velocity and velocity graph scVelo has incorporated 3 modes for velocity estimation: - Deterministic - Stochastic - Dynamical For deterministic and stochastic mode, the gene-specific velocities are obtained by fitting linear regression ratio (constant transcriptional state) between unspliced/spliced mRNA abundances. Under linear assumptions, how the `observed abundances deviate from the steady state` regression line is velocity. ```{r} scv$tl$velocity(adata, mode = "stochastic") ``` To calculate velocity graph, we need to run `velocity_graph()`. Velocity graph is the cosine correlation of potential cell transitions with velocity vector in high dimensional space. It summarizes the possible cell transition states and has dimension of ${n}_{obs} * {n}_{obs}$. ```{r} scv$tl$velocity_graph(adata, sqrt_transform = TRUE) ``` ### Diffusion pseudotime with velocity As mentioned above, the scvelo package support computing Diffusion Pseudotime. Before, we used cell transition matrix from spliced RNAs only. Now, let's change the transition matrix to our computed velocity and calculate the diffusion pseudotime again. As you will see that the pseudotime is reversed. This is thanks to the use of velocity which is based on the intrinsic dynamics. ```{r} scv$tl$velocity_pseudotime(adata) scv$pl$scatter(adata, color = 'velocity_pseudotime', cmap = 'gnuplot', show=TRUE) # plt$savefig('images/RNA-velo-fig4.png') ``` ```{r} scv$pl$scatter(adata, x = "velocity_pseudotime", y = c('Actn4', 'Ppp3ca', 'Cpe', 'Nnat'), fontsize = 10, size = 10, legend_loc = 'best', color = 'clusters', figsize=c(12, 10), show=TRUE) # plt$savefig('images/RNA-velo-fig5.png') ``` Also, a few example genes along the pseduo-time. ### Visualise velocity-based trajectory Velocities are projected onto the specified embedding `basis` and can be visualized in one of the three ways: - On single cell level - On grid level - Streamlines, which is most commonly used ```{r} scv$pl$velocity_embedding_stream(adata, basis = "umap", color = "clusters", legend_loc = "best", dpi = 150, show=TRUE) # plt$savefig('images/RNA-velo-fig6.png') ``` ```{r} scv$pl$velocity_embedding(adata, basis = "umap", arrow_length = 3, arrow_size = 2, dpi = 150, show=TRUE) # plt$savefig('images/RNA-velo-fig7.png') ``` ### Interprete Velocity We could also examine the phase portraits of interested genes to understand how inferred directions are supported by particular genes. ```{r} scv$pl$velocity(adata, c("Cpe", "Gnao1", "Ins2", "Adk"), ncols = 2, show=TRUE) # plt$savefig('images/RNA-velo-fig8.png') ``` Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated. ### Velocity in cycling progenitors The cell cycle detected by RNA velocity, is biologically affirmed by cell cycle scores (standardized scores of mean expression levels of phase marker genes). ```{r} scv$tl$score_genes_cell_cycle(adata) scv$pl$scatter(adata, color_gradients = c("S_score", "G2M_score"), smooth = TRUE, perc = c(5, 95), legend_loc="upper center", show=TRUE) # plt$savefig('images/RNA-velo-fig9.png') ``` ### Dynamical Mode and related analysis Dynamical mode does not necessarily rely on linear assumptions, instead it consider gene-specific rates of transcription, splicing and degradation rates as well as transient cell-states. To use dynamical mode, we just need to call `colorize("revocer_dynamics()", "#316A9E")` before computing velocity. ```{} scv$tl$recover_dynamics(adata) scv$tl$velocity(adata, mode = "dynamical") scv$tl$velocity_graph(adata) ``` Latent time of the underlying cellular processes can be recovered via dynamical mode. Based on cells' transcriptional dynamics, latent time approximates the real time experienced by cells as they differentiate. Latent time could distinguish temporal position more compared with pseudotime. ```{} scv$tl$latent_time(adata) scv$pl$scatter(adata, color = "latent_time", color_map = "gnuplot", size = 80, basis = "umap") scv$pl$scatter(adata, x = "latent_time", y = c('Actn4', 'Ppp3ca', 'Cpe', 'Nnat'), fontsize = 10, size = 10, legend_loc = 'best', color = 'clusters') ``` We could also pull out the top genes driving the dynamic RNA velocities. ```{} topgenes <- adata$var["fit_likelihood"] topgenes_vals <- topgenes[, 1] names(topgenes_vals) <- rownames(topgenes) topgenes_vals <- sort(topgenes_vals, decreasing = TRUE) head(topgenes_vals) scv$pl$scatter(adata, basis = names(topgenes_vals)[1:5], ncols = 5, frameon = FALSE) scv$pl$heatmap(adata, var_names = names(topgenes_vals), tkey = 'latent_time', n_convolve = as.integer(100), col_color = 'clusters') ``` ## Last notes Through this tutorial, the RNA velocity method has demostrated its unique power to reveal very detailed cell transitions, impressively the cell cycle trajecotry. However, when you try it on your own data, sometimes, you may experience difficulty to inteprete the results. These difficulties are shared among groups on campus and across international community. At the moment, we are actively work on computational methods, including a few angles: 1. selection of genes with heterogenous transcriptional kinetics (differential momuntume genes) with our [BRIE2](https://www.biorxiv.org/content/10.1101/2020.11.04.368019v2) model 2. projection of the high dimentional RNA velocity into a low dimentional space for denoise and more robust analysis with our [veloAE](https://www.biorxiv.org/content/10.1101/2021.03.19.436127v1) model 3. redesign of the scvelo dynamical model to a more generic mathathic function and biological informed regularisation (method is coming soon).