{
  "cells": [
    {
      "cell_type": "raw",
      "metadata": {},
      "source": [
        "---\n",
        "description: Reconstructing developmental or differentiation pathways\n",
        "  from individual cell gene expression profiles to understand cellular\n",
        "  transitions and relationships.\n",
        "subtitle:  Scanpy Toolkit\n",
        "title:  Trajectory inference using PAGA\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Note**\n",
        ">\n",
        "> Code chunks run Python commands unless it starts with `%%bash`, in\n",
        "> which case, those chunks run shell commands.\n",
        "\n",
        "</div>\n",
        "\n",
        "Partly following this PAGA\n",
        "[tutorial](https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html)\n",
        "with some modifications.\n",
        "\n",
        "## Loading libraries"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: libraries\n",
        "import sys\n",
        "import petsc4py\n",
        "\n",
        "# Tell PETSc to ignore harmless system signals and let Python handle them\n",
        "petsc4py.init(sys.argv + ['-no_signal_handler'])\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as pl\n",
        "from matplotlib import rcParams\n",
        "import scanpy as sc\n",
        "\n",
        "import scipy\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import warnings\n",
        "import cellrank as cr\n",
        "import scvelo as scv\n",
        "\n",
        "warnings.simplefilter(action=\"ignore\", category=Warning)\n",
        "\n",
        "# verbosity: errors (0), warnings (1), info (2), hints (3)\n",
        "sc.settings.verbosity = 3\n",
        "sc.settings.set_figure_params(dpi=100, frameon=False, figsize=(5, 5), facecolor='white', color_map = 'viridis_r') "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Preparing data\n",
        "\n",
        "In order to speed up the computations during the exercises, we will be\n",
        "using a subset of a bone marrow dataset (originally containing about\n",
        "100K cells). The bone marrow is the source of adult immune cells, and\n",
        "contains virtually all differentiation stages of cell from the immune\n",
        "system which later circulate in the blood to all other organs.\n",
        "\n",
        "![](../figs/hematopoiesis.png)\n",
        "\n",
        "If you have been using the **Seurat**, **Bioconductor** or **Scanpy**\n",
        "toolkits with your own data, you need to reach to the point where can\n",
        "find get:\n",
        "\n",
        "-   A dimensionality reduction where to perform the trajectory (for\n",
        "    example: PCA, ICA, MNN, harmony, Diffusion Maps, UMAP)\n",
        "-   The cell clustering information (for example: from Louvain, k-means)\n",
        "-   A KNN/SNN graph (this is useful to inspect and sanity-check your\n",
        "    trajectories)\n",
        "\n",
        "In this case, all the data has been preprocessed with Seurat with\n",
        "standard pipelines. In addition there was some manual filtering done to\n",
        "remove clusters that are disconnected and cells that are hard to\n",
        "cluster, which can be seen in this\n",
        "[script](https://github.com/NBISweden/workshop-scRNAseq/blob/main/scripts/data_processing/slingshot_preprocessing.Rmd)\n",
        "\n",
        "The file trajectory_scanpy_filtered.h5ad was converted from the Seurat\n",
        "object using the SeuratDisk package. For more information on how it was\n",
        "done, have a look at the script:\n",
        "[convert_to_h5ad.R](https://github.com/NBISweden/workshop-scRNAseq/blob/main/scripts/data_processing/convert_to_h5ad.R)\n",
        "in the github repo.\n",
        "\n",
        "You can download the data with the commands:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: fetch-data\n",
        "import os\n",
        "import subprocess\n",
        "\n",
        "# download pre-computed data if missing or long compute\n",
        "fetch_data = True\n",
        "\n",
        "# url for source and intermediate data\n",
        "path_data = \"https://nextcloud.dc.scilifelab.se/public.php/webdav\"\n",
        "curl_upass = \"zbC5fr2LbEZ9rSE:scRNAseq2025\"\n",
        "\n",
        "path_results = \"data/trajectory\"\n",
        "if not os.path.exists(path_results):\n",
        "    os.makedirs(path_results, exist_ok=True)\n",
        "\n",
        "path_file = \"data/trajectory/trajectory_seurat_filtered.h5ad\"\n",
        "if not os.path.exists(path_file):\n",
        "    file_url = os.path.join(path_data, \"trajectory/trajectory_seurat_filtered.h5ad\")\n",
        "    subprocess.call([\"curl\", \"-u\", curl_upass, \"-o\", path_file, file_url ])    "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Reading data\n",
        "\n",
        "We already have pre-computed and subsetted the dataset (with 6688 cells\n",
        "and 3585 genes) following the analysis steps in this course. We then\n",
        "saved the objects, so you can use common tools to open and start to work\n",
        "with them (either in R or Python)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: read-data\n",
        "adata = sc.read_h5ad(\"data/trajectory/trajectory_seurat_filtered.h5ad\")\n",
        "adata.var"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: check-data\n",
        "# check what you have in the X matrix, should be lognormalized counts.\n",
        "print(adata.X[:10,:10])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore the data\n",
        "\n",
        "There is a umap and clusters provided with the object, first plot some\n",
        "information from the previous analysis onto the umap."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-umap\n",
        "sc.pl.umap(adata, color = ['clusters','dataset','batches','Phase'],legend_loc = 'on data', legend_fontsize = 'xx-small', ncols = 2)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "It is crucial that you performing analysis of a dataset understands what\n",
        "is going on, what are the clusters you see in your data and most\n",
        "importantly How are the clusters related to each other?. Well, let's\n",
        "explore the data a bit. With the help of this table, write down which\n",
        "cluster numbers in your dataset express these key markers.\n",
        "\n",
        "  Marker    Cell Type\n",
        "  --------- ----------------------------\n",
        "  Cd34      HSC progenitor\n",
        "  Ms4a1     B cell lineage\n",
        "  Cd3e      T cell lineage\n",
        "  Ltf       Granulocyte lineage\n",
        "  Cst3      Monocyte lineage\n",
        "  Cpa3      Mast Cell/Basophil lineage\n",
        "  Alas2     RBC lineage\n",
        "  Siglech   Dendritic cell lineage\n",
        "  C1qc      Macrophage cell lineage\n",
        "  Pf4       Megakaryocyte cell lineage"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-markers\n",
        "markers = [\"Cd34\",\"Alas2\",\"Pf4\",\"Cpa3\",\"Ltf\",\"Cst3\", \"Siglech\", \"C1qc\", \"Ms4a1\", \"Cd3e\", ]\n",
        "sc.pl.umap(adata, color = markers, use_raw = False, ncols = 4)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Rerun analysis in Scanpy\n",
        "\n",
        "Redo clustering and umap using the basic Scanpy pipeline. Use the\n",
        "provided \"X_harmony_Phase\" dimensionality reduction as the starting\n",
        "point."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: process\n",
        "\n",
        "#calculate highly variable genes to get mean expression and dispersion for every gene - we will need this later\n",
        "sc.pp.highly_variable_genes(adata, n_top_genes=2000)\n",
        "\n",
        "# first, store the old umap with a new name so it is not overwritten\n",
        "adata.obsm['X_umap_old'] = adata.obsm['X_umap']\n",
        "\n",
        "sc.pp.neighbors(adata, n_pcs = 30, n_neighbors = 20, use_rep=\"X_harmony_Phase\", random_state=159)\n",
        "sc.tl.umap(adata, min_dist=0.4, spread=3, random_state=331)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: cluster\n",
        "sc.pl.umap(adata, color = ['clusters'],legend_loc = 'on data', legend_fontsize = 'xx-small', edges = True)\n",
        "\n",
        "sc.pl.umap(adata, color = markers, use_raw = False, ncols = 4)\n",
        "\n",
        "# Redo clustering as well\n",
        "sc.tl.leiden(adata, key_added = \"leiden_1.0\", resolution = 1.0, random_state=137) # default resolution is 1.0\n",
        "sc.tl.leiden(adata, key_added = \"leiden_1.2\", resolution = 1.2, random_state=137) # default resolution is 1.0\n",
        "sc.tl.leiden(adata, key_added = \"leiden_1.4\", resolution = 1.4, random_state=137) # default resolution is 1.0\n",
        "\n",
        "#sc.tl.louvain(adata, key_added = \"leiden_1.0\") # default resolution in 1.0\n",
        "sc.pl.umap(adata, color = ['leiden_1.0', 'leiden_1.2', 'leiden_1.4','clusters'],legend_loc = 'on data', legend_fontsize = 'xx-small', ncols =2)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: annotate\n",
        "#Rename clusters with really clear markers, the rest are left unlabelled.\n",
        "\n",
        "annot = pd.DataFrame(adata.obs['leiden_1.4'].astype('string'))\n",
        "annot[annot['leiden_1.4'] == '10'] = '10_megakaryo' #Pf4\n",
        "annot[annot['leiden_1.4'] == '16'] = '16_macro'  #C1qc\n",
        "annot[annot['leiden_1.4'] == '12'] = '12_eryth' #Alas2\n",
        "annot[annot['leiden_1.4'] == '17'] = '17_dend' #Siglech\n",
        "annot[annot['leiden_1.4'] == '13'] = '13_mast_baso' #Cpa3\n",
        "annot[annot['leiden_1.4'] == '5'] = '5_mono' #Cts3\n",
        "annot[annot['leiden_1.4'] == '2'] = '2_gran' #Ltf\n",
        "annot[annot['leiden_1.4'] == '14'] = '14_TC' #Cd3e\n",
        "annot[annot['leiden_1.4'] == '15'] = '15_BC' #Ms4a1\n",
        "annot[annot['leiden_1.4'] == '0'] = '0_progen'  # Cd34\n",
        "annot[annot['leiden_1.4'] == '9'] = '9_progen' \n",
        "annot[annot['leiden_1.4'] == '7'] = '7_progen'\n",
        "\n",
        "adata.obs['annot']=annot['leiden_1.4'].astype('category')\n",
        "\n",
        "sc.pl.umap(adata, color = 'annot',legend_loc = 'on data', legend_fontsize = 'xx-small', ncols =2)\n",
        "\n",
        "annot.value_counts()\n",
        "#type(annot)\n",
        "\n",
        "# astype('category')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-annot\n",
        "# plot onto the Seurat embedding:\n",
        "sc.pl.embedding(adata, basis='X_umap_old', color = 'annot',legend_loc = 'on data', legend_fontsize = 'xx-small', ncols =2)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Run PAGA\n",
        "\n",
        "Use the clusters from leiden clustering with `leiden_1.4` and run PAGA.\n",
        "PAGA is very good at revealing global topologies of the data that might\n",
        "be distorted in a UMAP. First we create the graph and initialize the\n",
        "positions using the UMAP."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: paga\n",
        "# use the umap to initialize the graph layout.\n",
        "sc.tl.draw_graph(adata, init_pos='X_umap')\n",
        "sc.pl.draw_graph(adata, color='annot', legend_loc='on data', legend_fontsize = 'xx-small')\n",
        "sc.tl.paga(adata, groups='annot')\n",
        "sc.pl.paga(adata, color='annot', edge_width_scale = 0.3)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, we have edges between many clusters that we know are are\n",
        "unrelated, so we may need to clean up the data a bit more.\n",
        "\n",
        "## Filtering graph edges\n",
        "\n",
        "First, lets explore the graph a bit. We plot the umap with the graph\n",
        "edges on top. The edges signify neighborhood between cells in the\n",
        "KNN-graph."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-graph\n",
        "sc.pl.umap(adata, edges=True, color = 'annot', legend_loc= 'on data', legend_fontsize= 'xx-small')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We have many edges in the graph between unrelated clusters, so lets try\n",
        "recalculating the graph with fewer neighbors. You can experiment with\n",
        "calculating different numbers of neighbors to see how it affects the\n",
        "graph and the connection between cells and nodes. Choosing too few\n",
        "neighbors however (\\<=5) will lead to problems with the downstream\n",
        "analysis for this dataset, as the graph becomes too sparse."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: redo-graph\n",
        "sc.pp.neighbors(adata, n_neighbors=8,  use_rep = 'X_harmony_Phase', n_pcs = 30, random_state=159)\n",
        "sc.pl.umap(adata, edges=True, color = 'annot', legend_loc= 'on data', legend_fontsize= 'xx-small')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Rerun PAGA again on the data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: draw-graph\n",
        "sc.tl.draw_graph(adata, init_pos='X_umap')\n",
        "sc.pl.draw_graph(adata, color='annot', legend_loc='on data', legend_fontsize = 'xx-small')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: paga2\n",
        "sc.tl.paga(adata, groups='annot')\n",
        "sc.pl.paga(adata, color='annot', edge_width_scale = 0.3)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Embedding using PAGA-initialization\n",
        "\n",
        "We can now redraw the graph using another starting position from the\n",
        "paga layout. The following is just as well possible for a UMAP."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: draw-graph-paga\n",
        "sc.tl.draw_graph(adata, init_pos='paga')\n",
        "sc.pl.draw_graph(adata, color=['annot'], legend_loc='on data', legend_fontsize=  'xx-small')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now compare the two graphs. On the left, each cell is embedded on\n",
        "our PAGA graph, whereas on the right, each cluster of cells is\n",
        "represented as a node in the graph and each edge represents the\n",
        "connectivity between two clusters."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: paga-compare\n",
        "sc.pl.paga_compare(\n",
        "    adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,\n",
        "    legend_fontsize=12, fontsize=12, frameon=False, edges=True)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Does this graph fit the biological expectations given what you know of\n",
        "> hematopoesis. Please have a look at the figure in Section 2 and\n",
        "> compare to the paths you now have.\n",
        "\n",
        "</div>\n",
        "\n",
        "## Detecting gene changes along trajectories\n",
        "\n",
        "We can reconstruct gene changes along PAGA paths for a given set of\n",
        "genes. For this we calculate the diffusion pseudotime, which orders\n",
        "cells on a pseudotemporal axis based on their expression profile.\n",
        "\n",
        "Choose a root cell for diffusion pseudotime. We have 3 progenitor\n",
        "clusters, but cluster 2 seems the most clear based on the Cd34 Marker,\n",
        "so we chose that cluster as starting timepoint for our pseudotime."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: pseudotime\n",
        "adata.uns['iroot'] = np.flatnonzero(adata.obs['annot']  == '0_progen')[0]\n",
        "\n",
        "sc.tl.dpt(adata)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use the full raw data for visualization."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-pt\n",
        "sc.pl.draw_graph(adata, color=['annot', 'dpt_pseudotime'], legend_loc='on data', legend_fontsize= 'x-small')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> The pseudotime represents the distance of every cell to the starting\n",
        "> cluster. Have a look at the pseudotime plot, how well do you think it\n",
        "> represents actual developmental time? What does it represent?\n",
        "\n",
        "</div>\n",
        "\n",
        "## Using Cellrank to investigate gene expression changes along specific trajectories\n",
        "\n",
        "Now that we have calculated pseudotime, we are interested in what genes\n",
        "might change during a certain lineages differentiation trajectory. To\n",
        "analyse this, we use Cellrank - a Python library that computes\n",
        "probablistic cell fate mappings. It basically models the probabilities\n",
        "for each cell to end up in a specific terminal state (in our case, fully\n",
        "differentiated cell types).\n",
        "\n",
        "Cellrank can use different metrics to calculate a transition matrix,\n",
        "such as RNA velocity or pseudotime. As we have calculated pseudotime for\n",
        "our dataset above, we will use that for the calculation of our\n",
        "transition matrix."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Set up a cellrank kernel \n",
        "pt_kernel = cr.kernels.PseudotimeKernel(adata, time_key='dpt_pseudotime')\n",
        "pt_kernel.compute_transition_matrix()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In order to calculate terminal fates and lineage commitment, cellrank\n",
        "uses estimators like GPCCA. We set up an estimator from our\n",
        "PseudotimeKernel (`pt_kernel`) and compute macrostates in our dataset\n",
        "(groups of cells with similar transition dynamics)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Set up an estimator to analyse the kernel\n",
        "pt_est =  cr.estimators.GPCCA(pt_kernel)\n",
        "\n",
        "#Compute macrostates\n",
        "pt_est.fit(cluster_key='annot', n_states=[10,15], n_cells=50)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have created an estimator and computed macrostates, we can\n",
        "either automatically compute our initial and terminal cell states using\n",
        ", or we can set them manually. In this case, we will set the initial\n",
        "state manually to `0_progen`, but let cellrank predict the terminal\n",
        "states."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Predict initial and terminal states\n",
        "pt_est.set_initial_states({'0_progen': adata.obs_names[adata.obs['annot'] == '0_progen']}, \n",
        "                         cluster_key='annot', n_cells=30)\n",
        "\n",
        "pt_est.predict_terminal_states(n_states=10, method='top_n', allow_overlap=True)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's plot the initial and terminal states on our single cell graph to\n",
        "see how well they correspond to what we would expect."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Create figure with 1 row, 2 columns\n",
        "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))\n",
        "\n",
        "# First plot: macrostates (left subplot)\n",
        "scv.pl.scatter(\n",
        "    adata,\n",
        "    color='init_states_fwd',\n",
        "    basis='X_draw_graph_fa',\n",
        "    legend_loc='right',\n",
        "    frameon=False,\n",
        "    size=10,\n",
        "    linewidth=0,\n",
        "    ax=ax1,  # Pass the axis\n",
        "    show=False  # Don't show yet\n",
        ")\n",
        "\n",
        "# Second plot: terminal states (right subplot)  \n",
        "scv.pl.scatter(\n",
        "    adata,\n",
        "    color='term_states_fwd', \n",
        "    basis='X_draw_graph_fa',\n",
        "    legend_loc='right',\n",
        "    frameon=False,\n",
        "    size=10,\n",
        "    linewidth=0,\n",
        "    ax=ax2,  # Pass the axis\n",
        "    show=False  # Don't show yet\n",
        ")\n",
        "\n",
        "# Add titles\n",
        "ax1.set_title('Macrostates (Forward)')\n",
        "ax2.set_title('Terminal States (Forward)')\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Are the initial and terminal states predicted as you expected? What\n",
        "> deviates from your expectation?\n",
        ">\n",
        "> The results are influenced by a multitude of parameters, such as the\n",
        "> PAGA graph connections (mediated by the n_neighbors parameter in\n",
        "> sc.pp.neighbors) and the `n_states` parameter in `pt_est.fit`. Feel\n",
        "> free to play with these parameters and see how the results change.\n",
        "\n",
        "</div>\n",
        "\n",
        "Now that we have calculated our initial and terminal states, we will\n",
        "calculate fate probabilities. Cellrank does this by using Markov Chains,\n",
        "which calculates the probability for each cell to end up at a given\n",
        "terminal state. Later on we will be able to use this information to\n",
        "investigate which genes drive differentiation from a progenitor towards\n",
        "a certain fully differentiated cell type."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Compute and plot cell fate probabilities\n",
        "pt_est.compute_fate_probabilities(use_petsc=False, solver=\"direct\")\n",
        "pt_est.plot_fate_probabilities(legend_loc='right', basis='X_draw_graph_fa')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have computed the fate probabilities of every cell, we can\n",
        "compute **driver genes** - Cellrank does this by correlating fate\n",
        "probabilities with gene expression, under the assumption that genes that\n",
        "have a strong correlation with the fate probabilities of a certain\n",
        "lineage might drive differentiatin processes towards the lineages\n",
        "terminal state.\n",
        "\n",
        "Below we will calculate driver genes for the erythrocyte lineage\n",
        "(`12_eryth`) and the megakaryocyte lineage (`10_megakaryo`)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Calculate lineage drivers\n",
        "drivers_ery = pt_est.compute_lineage_drivers(lineages=['12_eryth'], cluster_key='annot')\n",
        "#Sort\n",
        "drivers_ery_sort = drivers_ery.sort_values(by='12_eryth_corr', ascending=False)\n",
        "\n",
        "drivers_mega = pt_est.compute_lineage_drivers(lineages=['10_megakaryo'], cluster_key='annot')\n",
        "drivers_mega_sort = drivers_mega.sort_values(by='10_megakaryo_corr', ascending=False)\n",
        "\n",
        "#Get top 10 eruythrocyte driver genes\n",
        "drivers_ery_sort.head(n=10)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we plot the top 10 previously calculated driver genes for the\n",
        "erythrocyte and megakaryocyte lineages."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.draw_graph(adata, color=drivers_ery_sort.index.tolist()[1:10], legend_loc='on data', legend_fontsize= 'x-small', use_raw=False, ncols=3)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Also plot the lineage drivers for the granulocyte lineage\n",
        "sc.pl.draw_graph(adata, color=drivers_mega_sort.index.tolist()[1:10], legend_loc='on data', legend_fontsize= 'x-small', use_raw=False, ncols=3)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Discuss amongst yourselves which genes appear to be good markers for\n",
        "> the respective lineages and why!\n",
        "\n",
        "</div>\n",
        "\n",
        "Now we want to look at the expression trends for some of the genes we\n",
        "just identified as potential drivers over pseudotime. Cellrank uses\n",
        "Generalized Additive Models (GAMs) to fit smooth expression curves, so\n",
        "we start by creating such models for our data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Fit GAMs\n",
        "model = cr.models.GAM(adata, n_knots=6)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we plot some of the interesting genes of the erythrocyte (Car1,\n",
        "Atpif1, Vamp5) and megakaryocyte (Cavin2, Pbx1, Ctla2a) lineages. We\n",
        "only plot the genes for their respective lineage, in order to compare\n",
        "their expression at different pseudotime points during the lineages\n",
        "developmental trajectory."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "cr.pl.gene_trends(adata, \n",
        "                  model = model,\n",
        "                  genes = ['Car1', 'Atpif1', 'Vamp5'],\n",
        "                  same_plot=True,\n",
        "                  time_key = 'dpt_pseudotime',\n",
        "                  lineages = ['12_eryth'],\n",
        "                  hide_cells=True,\n",
        "                  ncols=3,\n",
        "                  n_jobs=1,\n",
        "                  backend='threading')\n",
        "\n",
        "cr.pl.gene_trends(adata, \n",
        "                  model = model,\n",
        "                  genes = ['Cavin2', 'Pbx1', 'Rab27b'],\n",
        "                  same_plot=True,\n",
        "                  time_key = 'dpt_pseudotime',\n",
        "                  lineages = ['10_megakaryo'],\n",
        "                  hide_cells=True,\n",
        "                  ncols=3,\n",
        "                  n_jobs=1,\n",
        "                  backend='threading')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "What would this look like if we were to try and compare the gene\n",
        "expression between the two lineages?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Plot gene trends for erythrocyte and megakaryo lineages\n",
        "cr.pl.gene_trends(adata, \n",
        "                  model = model,\n",
        "                  genes = ['Cavin2', 'Pbx1', 'Car1'],\n",
        "                  same_plot=True,\n",
        "                  time_key = 'dpt_pseudotime',\n",
        "                  lineages = ['10_megakaryo', '12_eryth'],\n",
        "                  hide_cells=True,\n",
        "                  ncols=3,\n",
        "                  n_jobs=1,\n",
        "                  backend='threading')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Why do the different lineages in the plots above have different\n",
        "> maximum pseudotime values? In how far do you think the gene expression\n",
        "> values are comparable across the two lineages?\n",
        "\n",
        "</div>\n",
        "\n",
        "We can also visualize the expression of several genes over pseudotime in\n",
        "a single trajectory. Let's do this for the 30 top driver genes of the\n",
        "erythrocyte and megakaryocyte lineages. If you have any genes that you\n",
        "are interested in, plot these using the code below."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Plot expression cascades of several genes over pseudotime\n",
        "cr.pl.heatmap(\n",
        "    adata,\n",
        "    model=model,\n",
        "    lineages='12_eryth',\n",
        "    cluster_key='annot',\n",
        "    show_fate_probabilities=True,\n",
        "    genes=drivers_ery_sort.index.tolist()[1:30],\n",
        "    time_key='dpt_pseudotime',\n",
        "    figsize=(12,10),\n",
        "    show_all_genes=True\n",
        ")\n",
        "\n",
        "cr.pl.heatmap(\n",
        "    adata,\n",
        "    model=model,\n",
        "    lineages='10_megakaryo',\n",
        "    cluster_key='annot',\n",
        "    show_fate_probabilities=True,\n",
        "    genes=drivers_mega_sort.index.tolist()[1:30],\n",
        "    time_key='dpt_pseudotime',\n",
        "    figsize=(12,10),\n",
        "    show_all_genes=True\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have visualized expression trends for several genes\n",
        "associated with the our lineages, we could be interested in identifying\n",
        "**other genes that follow a specific expression pattern** along the\n",
        "trajectory. Let's say we are interested in genes from the erythrocyte\n",
        "lineage that increase with pseudotime (like Car1) and genes whose\n",
        "expression peaks and then falls off again before the maximum pseudotime\n",
        "value is reached (like Vamp5).\n",
        "\n",
        "Cellrank provides a way for us to identify genes that have a similar\n",
        "expression along the trajectory, using the `cr.pl.cluster_trends`\n",
        "function. You can modify the number of clusters by modifying the\n",
        "resolution parameter until you find the clustering to be appropriate."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Then we cluster trends based on the erythrocyte driver genes\n",
        "cr.pl.cluster_trends(\n",
        "    adata,\n",
        "    model=model,  # use the model from before\n",
        "    lineage=\"12_eryth\",\n",
        "    genes=drivers_ery_sort.index.tolist(),\n",
        "    time_key=\"dpt_pseudotime\",\n",
        "    n_jobs=8,\n",
        "    random_state=0,\n",
        "    clustering_kwargs={\"resolution\": 0.4, \"random_state\": 0},\n",
        "    neighbors_kwargs={\"random_state\": 0},\n",
        "    recompute=True\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The results of the clustering trend are saved as an anndata object in\n",
        "`adata.uns['lineage_12_eryth_trend']`. We will now use this to extract\n",
        "genes that follow a specific trend and plot them."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Save trends as new anndata object\n",
        "ery_trends = adata.uns['lineage_12_eryth_trend'].copy()\n",
        "\n",
        "#Add means and dispersions from original data to the object\n",
        "cols = [\"means\", \"dispersions\"]\n",
        "ery_trends.obs = ery_trends.obs.merge(\n",
        "    right=adata.var[cols], how=\"left\", left_index=True, right_index=True\n",
        ")\n",
        "\n",
        "#Extract genes that follow the expression patterns of the respective clusters\n",
        "low_genes = list(\n",
        "    ery_trends[ery_trends.obs[\"clusters\"] == \"4\"]\n",
        "    .obs.sort_values(\"means\", ascending=False)\n",
        "    .index\n",
        ")\n",
        "\n",
        "mid_genes = list(\n",
        "    ery_trends[ery_trends.obs[\"clusters\"] == \"6\"]\n",
        "    .obs.sort_values(\"means\", ascending=False)\n",
        "    .index\n",
        ")\n",
        "\n",
        "high_genes = list(\n",
        "    ery_trends[ery_trends.obs[\"clusters\"] == \"1\"]\n",
        "    .obs.sort_values(\"means\", ascending=False)\n",
        "    .index\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#There are a lot of ribosomal genes in our gene lists - these are probably not as interesting as other genes, so we remove them from our lists right now\n",
        "low_genes_filt = [str(x) for x in low_genes if not x.lower().startswith(('rpl', 'rps'))]\n",
        "mid_genes_filt = [str(x) for x in mid_genes if not x.lower().startswith(('rpl', 'rps'))]\n",
        "high_genes_filt = [str(x) for x in high_genes if not x.lower().startswith(('rpl', 'rps'))]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#Plot the selected genes\n",
        "cr.pl.gene_trends(\n",
        "    adata,\n",
        "    model=model,\n",
        "    lineages=\"12_eryth\",\n",
        "    cell_color=\"annot\",\n",
        "    genes=high_genes_filt[0:3]+mid_genes_filt[0:3]+low_genes_filt[0:3],\n",
        "    same_plot=True,\n",
        "    ncols=3,\n",
        "    time_key=\"dpt_pseudotime\",\n",
        "    hide_cells=True\n",
        "    )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can visaulize any number of genes in this manner in order to see\n",
        "which ones are interesting for your specific question.\n",
        "\n",
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> As you can see, we can manipulate the trajectory quite a bit by\n",
        "> selecting different number of neighbors, components etc. to fit with\n",
        "> our assumptions on the development of these celltypes.\n",
        ">\n",
        "> Please explore further how you can tweak the trajectory. For instance,\n",
        "> can you create a PAGA trajectory using the orignial UMAP from Seurat\n",
        "> instead? Hint, you first need to compute the neighbors on the UMAP.\n",
        "\n",
        "</div>\n",
        "\n",
        "## Session info\n",
        "\n",
        "```{=html}\n",
        "<details>\n",
        "```\n",
        "```{=html}\n",
        "<summary>\n",
        "```\n",
        "Click here\n",
        "```{=html}\n",
        "</summary>\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: session\n",
        "sc.logging.print_versions()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "```{=html}\n",
        "</details>\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}