{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Single cell data analysis using Scanpy\n",
    "\n",
    "* __Notebook version__: `v0.0.5`\n",
    "* __Created by:__ `Imperial BRC Genomics Facility`\n",
    "* __Maintained by:__ `Imperial BRC Genomics Facility`\n",
    "* __Docker image:__ `imperialgenomicsfacility/scanpy-notebook-image:release-v0.0.4`\n",
    "* __Github repository:__ [imperial-genomics-facility/scanpy-notebook-image](https://github.com/imperial-genomics-facility/scanpy-notebook-image)\n",
    "* __Created on:__ `2021-May-04 12:32`\n",
    "* __Contact us:__ [Imperial BRC Genomics Facility](https://www.imperial.ac.uk/medicine/research-and-impact/facilities/genomics-facility/contact-us/)\n",
    "* __License:__ [Apache License 2.0](https://github.com/imperial-genomics-facility/scanpy-notebook-image/blob/master/LICENSE)\n",
    "\n",
    "\n",
    "## Table of contents\n",
    "  * [Introduction](#Introduction)\n",
    "  * [Tools required](#Tools-required)\n",
    "  * [Loading required libraries](#Loading-required-libraries)\n",
    "  * [Reading data from Cellranger output](#Reading-data-from-Cellranger-output)\n",
    "  * [Data processing and visualization](#Data-processing-and-visualization)\n",
    "    * [Checking highly variable genes](#Checking-highly-variable-genes)\n",
    "    * [Doublet detection using Scrublet](#Doublet-detection-using-Scrublet)\n",
    "      * [Plot doublet score histograms for observed transcriptomes and simulated doublets](#Plot-doublet-score-histograms-for-observed-transcriptomes-and-simulated-doublets)\n",
    "    * [Quality control](#Quality-control)\n",
    "      * [Computing metrics for cell QC](#Computing-metrics-for-cell-QC)\n",
    "      * [Ploting predicted doublets](#Ploting-predicted-doublets)\n",
    "      * [Plotting MT gene fractions](#Ploting-MT-gene-fractions)\n",
    "      * [Ploting predicted dublets](#Ploting-predicted-dublets)\n",
    "      * [Count depth distribution](#Count-depth-distribution)\n",
    "      * [Gene count distribution](#Gene-count-distribution)\n",
    "      * [Counting cells per gene](#Counting-cells-per-gene)\n",
    "      * [Ploting count depth vs MT fraction](#Plotting-count-depth-vs-MT-fraction)\n",
    "      * [Checking thresholds and filtering data](#Checking-thresholds-and-filtering-data)\n",
    "    * [Normalization](#Normalization)\n",
    "    * [Highly variable genes](#Highly-variable-genes)\n",
    "    * [Regressing out technical effects](#Regressing-out-technical-effects)\n",
    "    * [Principal component analysis](#Principal-component-analysis)\n",
    "    * [Neighborhood graph](#Neighborhood-graph)\n",
    "      * [Clustering the neighborhood graph](#Clustering-the-neighborhood-graph)\n",
    "      * [Embed the neighborhood graph using UMAP](#Embed-the-neighborhood-graph-using-UMAP)\n",
    "        * [Plotting 3D UMAP](#Plotting-3D-UMAP)\n",
    "        * [Plotting 2D UMAP](#Plotting-2D-UMAP)\n",
    "      * [Embed doublets using UMAP](#Embed-doublets-using-UMAP)\n",
    "      * [Embed the neighborhood graph using tSNE](#Embed-the-neighborhood-graph-using-tSNE)\n",
    "      * [Embed the neighborhood graph using Diffusion Map and Graph](#Embed-the-neighborhood-graph-using-Diffusion-Map-and-Graph)\n",
    "      * [Embed the neighborhood graph using PHATE](#Embed-the-neighborhood-graph-using-PHATE)\n",
    "      * [Embed the neighborhood graph using PhenoGraph](#Embed-the-neighborhood-graph-using-PhenoGraph)\n",
    "    * [Finding marker genes](#Finding-marker-genes)\n",
    "      * [Stacked violin plot of ranked genes](#Stacked-violin-plot-of-ranked-genes)\n",
    "      * [Dot plot of ranked genes](#Dot-plot-of-ranked-genes)\n",
    "      * [Matrix plot of ranked genes](#Matrix-plot-of-ranked-genes)\n",
    "      * [Heatmap plot of ranked genes](#Heatmap-plot-of-ranked-genes)\n",
    "      * [Tracks plot of ranked genes](#Tracks-plot-of-ranked-genes)\n",
    "    * [Cell annotation](#Cell-annotation)\n",
    "    * [Cell cycle scoring](#Cell-cycle-scoring)\n",
    "    * [Trajectory analysis](#Trajectory-analysis)\n",
    "      * [Partition-based graph abstraction](#Partition-based-graph-abstraction)\n",
    "  * [Explore cells in UCSC Cell Browser](#Explore-cells-in-UCSC-Cell-Browser)\n",
    "  * [References](#References)\n",
    "  * [Acknowledgement](#Acknowledgement)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "This notebook for running single cell data analysis (for a single sample) using Scanpy package. Most of the codes and documentation used in this notebook has been copied from the following sources:\n",
    "\n",
    "* [Scanpy - Preprocessing and clustering 3k PBMCs](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html)\n",
    "* [Single-cell-tutorial](https://github.com/theislab/single-cell-tutorial)\n",
    "\n",
    "## Tools required\n",
    "* [Scanpy](https://scanpy-tutorials.readthedocs.io/en/latest)\n",
    "* [Plotly](https://plotly.com/python/)\n",
    "* [UCSC Cell Browser](https://pypi.org/project/cellbrowser/)\n",
    "* [Scrublet](https://github.com/swolock/scrublet)\n",
    "\n",
    "## Loading required libraries\n",
    "\n",
    "We need  to load all the required libraries to environment before we can run any of the analysis steps. Also, we are checking the version information for most of the major packages used for analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scanpy as sc\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from copy import deepcopy\n",
    "import plotly.graph_objs as go\n",
    "from IPython.display import HTML\n",
    "from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n",
    "sc.settings.verbosity = 0\n",
    "sc.logging.print_header()\n",
    "sns.set_theme(context='notebook', style='darkgrid', palette='colorblind')\n",
    "plt.rcParams['figure.dpi'] = 150"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are setting the output file path to $/tmp/scanpy\\_output.h5ad$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_file = '/tmp/scanpy_output.h5ad'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following steps are only required for downloading test data from 10X Genomics's website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "rm -rf cache\n",
    "rm -rf /tmp/data\n",
    "mkdir -p /tmp/data\n",
    "wget -q -O /tmp/data/pbmc3k_filtered_gene_bc_matrices.tar.gz \\\n",
    "  /tmp/data http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz\n",
    "cd /tmp/data\n",
    "tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "## Reading data from Cellranger output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the Cellranger output to Scanpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = \\\n",
    "  sc.read_10x_mtx(\n",
    "    '/tmp/data/filtered_gene_bc_matrices/hg19/',\n",
    "    var_names='gene_symbols',\n",
    "    cache=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting the gene names to unique values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var_names_make_unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking the data dimensions before checking QC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scanpy stores the count data is an annotated data matrix ($observations$ e.g. cell barcodes × $variables$ e.g gene names) called [AnnData](https://anndata.readthedocs.io) together with annotations of observations($obs$), variables ($var$) and unstructured annotations ($uns$)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "## Data processing and visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking highly variable genes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computing fraction of counts assigned to each gene over all cells. The top genes with the highest mean fraction over all cells are\n",
    "plotted as boxplots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig,ax = plt.subplots(1,1,figsize=(4,5))\n",
    "sc.pl.highest_expr_genes(adata, n_top=20,ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Doublet detection using Scrublet\n",
    "\n",
    "[<b>S</b>ingle-<b>C</b>ell <b>R</b>emover of Do<b>ublet</b>](https://github.com/swolock/scrublet) predicts cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.external.pp.scrublet(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plot doublet score histograms for observed transcriptomes and simulated doublets\n",
    "\n",
    "Scrublet tries to identify dublets using a threshold doublet score (ideally set at the minimum of the simulated doublet histogram).\n",
    "\n",
    "For more information, check this [tutorial notebook](https://github.com/swolock/scrublet/blob/master/examples/scrublet_basics.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.external.pl.scrublet_score_distribution(adata);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Quality control"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking $obs$ section of the AnnData object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking the $var$ section of the AnnData object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Computing metrics for cell QC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Listing the Mitochondrial genes detected in the cell population"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mt_genes = 0\n",
    "mt_genes = [gene for gene in adata.var_names if gene.startswith('MT-')]\n",
    "mito_genes = adata.var_names.str.startswith('MT-')\n",
    "if len(mt_genes)==0:\n",
    "    print('Looking for mito genes with \"mt-\" prefix')\n",
    "    mt_genes = [gene for gene in adata.var_names if gene.startswith('mt-')]\n",
    "    mito_genes = adata.var_names.str.startswith('mt-')\n",
    "\n",
    "if len(mt_genes)==0:\n",
    "    print(\"No mitochondrial genes found\")\n",
    "else:\n",
    "    print(\"Mitochondrial genes: count: {0}, lists: {1}\".format(len(mt_genes),mt_genes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typical quality measures for assessing the quality of a cell includes the following components\n",
    "* Number of molecule counts (UMIs or $n\\_counts$ )\n",
    "* Number of expressed genes ($n\\_genes$)\n",
    "* Fraction of counts that are mitochondrial ($percent\\_mito$)\n",
    "\n",
    "We are calculating the above mentioned details using the following codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs['mito_counts'] =  np.sum(adata[:, mito_genes].X, axis=1).A1\n",
    "adata.obs['percent_mito'] = \\\n",
    "  np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1\n",
    "adata.obs['n_counts'] = adata.X.sum(axis=1).A1\n",
    "adata.obs['log_counts'] = np.log(adata.obs['n_counts'])\n",
    "adata.obs['n_genes'] = (adata.X > 0).sum(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking $obs$ section of the AnnData object again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sorting barcodes based on the $percent\\_mito$ column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs.sort_values('percent_mito',ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A high fraction of mitochondrial reads being picked up can indicate cell stress, as there is a low proportion of nuclear mRNA in the cell. It should be noted that high mitochondrial RNA fractions can also be biological signals indicating elevated respiration. <p/>\n",
    "\n",
    "Cell barcodes with high count depth, few detected genes and high fraction of mitochondrial counts may indicate cells whose cytoplasmic mRNA has leaked out due to a broken membrane and only the mRNA located in the mitochondria has survived. <p/>\n",
    "\n",
    "Cells with high UMI counts and detected genes may represent doublets (it requires further checking).\n",
    "\n",
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Ploting predicted doublets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs['predicted_doublet_str'] = adata.obs['predicted_doublet'].map(lambda x: 'dublets' if x else 'NA').astype(str)\n",
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "ax = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='predicted_doublet_str',show=False,palette=sns.color_palette('colorblind'))\n",
    "ax.set_title('Predicted dublets', fontsize=12)\n",
    "ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "ax.set_ylabel(\"Number of genes\",fontsize=12)\n",
    "ax.tick_params(labelsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Ploting MT gene fractions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "sc.pl.violin(\\\n",
    "  adata,\n",
    "  ['n_genes', 'n_counts', 'percent_mito'],\n",
    "  jitter=0.4,\n",
    "  log=True,\n",
    "  stripplot=True,\n",
    "  show=False,\n",
    "  multi_panel=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Violin plot (above) shows the computed quality measures of UMI counts, gene counts and fraction of mitochondrial counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "ax = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='percent_mito',show=False,)\n",
    "ax.set_title('Fraction mitochondrial counts', fontsize=12)\n",
    "ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "ax.set_ylabel(\"Number of genes\",fontsize=12)\n",
    "ax.tick_params(labelsize=12)\n",
    "ax.axhline(700, 0,1, color='red')\n",
    "ax.axvline(1500, 0,1, color='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above scatter plot shows number of genes vs number of counts with $MT$ fraction information. We will be using a cutoff of 1500 counts and 700 genes (<span style=\"color:red\">red lines</span>) to filter out dying cells. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "ax = sc.pl.scatter(adata[adata.obs['n_counts']<10000], 'n_counts', 'n_genes', color='percent_mito',show=False)\n",
    "ax.set_title('Fraction mitochondrial counts', fontsize=12)\n",
    "ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "ax.set_ylabel(\"Number of genes\",fontsize=12)\n",
    "ax.tick_params(labelsize=12)\n",
    "ax.axhline(700, 0,1, color='red')\n",
    "ax.axvline(1500, 0,1, color='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A similar scatter plot, but this time we have restricted the counts to below _10K_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Count depth distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "count_data = adata.obs['n_counts'].copy()\n",
    "count_data.sort_values(inplace=True, ascending=False)\n",
    "order =  range(1, len(count_data)+1)\n",
    "ax = plt.semilogy(order, count_data, 'b-')\n",
    "plt.gca().axhline(1500, 0,1, color='red')\n",
    "plt.xlabel(\"Barcode rank\", fontsize=12)\n",
    "plt.ylabel(\"Count depth\", fontsize=12)\n",
    "plt.tick_params(labelsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above plot is similar to _UMI counts_ vs _Barcodes_ plot of Cellranger report and it shows the count depth distribution from high to low count depths. This plot can be used to decide the threshold of count depth to filter out empty droplets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "ax = sns.histplot(adata.obs['n_counts'], kde=False)\n",
    "ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "ax.set_ylabel(\"Frequency\",fontsize=12)\n",
    "ax.axvline(1500, 0,1, color='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above histogram plot shows the distribution of count depth and the <span style=\"color:red\">red line</span> marks the count threshold 1500."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "if (adata.obs['n_counts'].max() - 10000)> 10000:\n",
    "    print('Checking counts above 10K')\n",
    "    ax = sns.histplot(adata.obs['n_counts'][adata.obs['n_counts']>10000], kde=False, bins=60)\n",
    "    ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "    ax.set_ylabel(\"Frequency\",fontsize=12)\n",
    "else:\n",
    "    print(\"Skip checking counts above 10K\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,6)\n",
    "if adata.obs['n_counts'].max() > 2000:\n",
    "  print('Zooming into first 2000 counts')\n",
    "  ax = sns.histplot(adata.obs['n_counts'][adata.obs['n_counts']<2000], kde=False, bins=60)\n",
    "  ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "  ax.set_ylabel(\"Frequency\",fontsize=12)\n",
    "  ax.axvline(1500, 0,1, color='red')\n",
    "else:\n",
    "  print(\"Failed to zoom into the counts below 2K\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Gene count distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.histplot(adata.obs['n_genes'], kde=False)\n",
    "ax.set_xlabel(\"Number of genes\",fontsize=12)\n",
    "ax.set_ylabel(\"Frequency\",fontsize=12)\n",
    "ax.tick_params(labelsize=12)\n",
    "ax.axvline(700, 0,1, color='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above histogram plot shows the distribution of gene counts and the <span style=\"color:red\">red line</span> marks the gene count threshold 700."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if adata.obs['n_genes'].max() > 1000:\n",
    "  print('Zooming into first 1000 gene counts')\n",
    "  ax = sns.histplot(adata.obs['n_genes'][adata.obs['n_genes']<1000], kde=False,bins=60)\n",
    "  ax.set_xlabel(\"Number of genes\",fontsize=12)\n",
    "  ax.set_ylabel(\"Frequency\",fontsize=12)\n",
    "  ax.tick_params(labelsize=12)\n",
    "  ax.axvline(700, 0,1, color='red')\n",
    "else:\n",
    "  print(\"Failed to zoom into the gene counts below 1K\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a permissive filtering threshold of 1500 counts and 700 gene counts to filter out the dying cells or empty droplets with ambient RNA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Counting cells per gene"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var['cells_per_gene'] = np.sum(adata.X > 0, 0).T\n",
    "\n",
    "ax = sns.histplot(adata.var['cells_per_gene'][adata.var['cells_per_gene'] < 100], kde=False, bins=60)\n",
    "ax.set_xlabel(\"Number of cells\",fontsize=12)\n",
    "ax.set_ylabel(\"Frequency\",fontsize=12)\n",
    "ax.set_title('Cells per gene', fontsize=12)\n",
    "ax.tick_params(labelsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Plotting count depth vs MT fraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sc.pl.scatter(adata, x='n_counts', y='percent_mito',show=False)\n",
    "ax.set_title('Count depth vs Fraction mitochondrial counts', fontsize=12)\n",
    "ax.set_xlabel(\"Count depth\",fontsize=12)\n",
    "ax.set_ylabel(\"Fraction mitochondrial counts\",fontsize=12)\n",
    "ax.tick_params(labelsize=12)\n",
    "ax.axhline(0.2, 0,1, color='red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The scatter plot showing the count depth vs MT fraction counts and the <span style=\"color:red\">red line</span> shows the default cutoff value for MT fraction 0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Checking thresholds and filtering data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to filter the cells which doesn't match our thresholds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Total number of cells: {0}'.format(adata.n_obs))\n",
    "\n",
    "min_counts_threshold = 1500\n",
    "max_counts_threshold = 40000\n",
    "min_gene_counts_threshold = 700\n",
    "max_mito_pct_threshold = 0.2\n",
    "\n",
    "sc.pp.filter_cells(adata, min_counts = min_counts_threshold)\n",
    "print('Number of cells after min count ({0}) filter: {1}'.format(min_counts_threshold,adata.n_obs))\n",
    "\n",
    "sc.pp.filter_cells(adata, max_counts = max_counts_threshold)\n",
    "print('Number of cells after max count ({0}) filter: {1}'.format(max_counts_threshold,adata.n_obs))\n",
    "\n",
    "sc.pp.filter_cells(adata, min_genes = min_gene_counts_threshold)\n",
    "print('Number of cells after gene ({0}) filter: {1}'.format(min_gene_counts_threshold,adata.n_obs))\n",
    "\n",
    "adata = adata[adata.obs['percent_mito'] < max_mito_pct_threshold]\n",
    "print('Number of cells after MT fraction ({0}) filter: {1}'.format(max_mito_pct_threshold,adata.n_obs))\n",
    "\n",
    "print('Total number of cells after filtering: {0}'.format(adata.n_obs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also, we need to filter out any genes that are detected in only less than 20 cells. This operation reduces the dimensions of the matrix by removing zero count genes which are not really informative."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "min_cell_per_gene_threshold = 20\n",
    "\n",
    "print('Total number of genes: {0}'.format(adata.n_vars))\n",
    "\n",
    "sc.pp.filter_genes(adata, min_cells=min_cell_per_gene_threshold)\n",
    "print('Number of genes after cell filter: {0}'.format(adata.n_vars))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Normalization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.normalize_total(adata, target_sum=1e4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are using a simple total-count based normalization (library-size correct) to transform the data matrix $X$ to 10,000 reads per cell, so that counts become comparable among cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.log1p(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then logarithmize the data matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.raw = adata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copying the normalized and logarithmized raw gene expression data to the `.raw` attribute of the AnnData object for later use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Highly variable genes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following codes blocks are used to identify the highly variable genes (HGV) to further reduce the dimensionality of the dataset and to include only the most informative genes. HGVs will be used for clustering, trajectory inference, and dimensionality reduction/visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(7,5)\n",
    "sc.pp.highly_variable_genes(adata, flavor='seurat',n_top_genes=2000)\n",
    "seurat_hgv = np.sum(adata.var['highly_variable'])\n",
    "print(\"Counts of HGVs: {0}\".format(seurat_hgv))\n",
    "sc.pl.highly_variable_genes(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a 'seurat' flavor based HGV detection step. Then, we run the following codes to do the actual filtering of data. The plots show how the data was normalized to select highly variable genes irrespective of the mean expression of the genes. This is achieved by using the index of dispersion which divides by mean expression, and subsequently binning the data by mean expression and selecting the most variable genes within each bin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = adata[:, adata.var.highly_variable]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Regressing out technical effects\n",
    "\n",
    "Normalization scales count data to make gene counts comparable between cells. But it still contain unwanted variability. One of the most prominent technical covariates in single-cell data is count depth. Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed can improve the performance of trajectory inference algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scale each gene to unit variance. Clip values exceeding standard deviation 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.scale(adata, max_value=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Principal component analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.pca(adata, svd_solver='arpack')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "sc.pl.pca(adata,color=['CST3'],size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells.\n",
    "\n",
    "The first principle component captures variation in count depth between cells and its marginally informative."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.pca_variance_ratio(adata, log=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us compute the neighborhood graph of cells using the PCA representation of the data matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Neighborhood graph\n",
    "Computing the neighborhood graph of cells using the PCA representation of the data matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Clustering the neighborhood graph\n",
    "Scanpy documentation recommends the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering (using `resolution=1`) directly clusters the neighborhood graph of cells, which we have already computed in the previous section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.leiden(adata,resolution=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_length = len(adata.obs['leiden'].value_counts().to_dict().keys())\n",
    "cluster_colors = sns.color_palette('colorblind',cluster_length,as_cmap=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.pca(adata,color='leiden',palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above PCA plot does not show the expected clustering of the data in two dimensions.\n",
    "\n",
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Embed the neighborhood graph using UMAP\n",
    "\n",
    "Scanpy documentation suggests embedding the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preservers trajectories.\n",
    "\n",
    "We are computing a 3D UMAP for a 3D plot then overwriting it with a 2 dimension UMAP.\n",
    "\n",
    "##### Plotting 3D UMAP\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "leiden_series = deepcopy(adata.obs['leiden'])\n",
    "cell_clusters = list(leiden_series.value_counts().to_dict().keys())\n",
    "colors = cluster_colors\n",
    "dict_map = dict(zip(cell_clusters,colors))\n",
    "color_map = leiden_series.map(dict_map).values\n",
    "labels = list(adata.obs.index)\n",
    "\n",
    "sc.tl.umap(\n",
    "  adata,\n",
    "  n_components=3)\n",
    "hovertext = \\\n",
    "  ['cluster: {0}, barcode: {1}'.\\\n",
    "     format(\n",
    "       grp,labels[index])\n",
    "         for index,grp in enumerate(leiden_series.values)]\n",
    "## plotting 3D UMAP as html file\n",
    "plot(\n",
    "  [go.Scatter3d(\n",
    "     x=adata.obsm['X_umap'][:, 0],\n",
    "     y=adata.obsm['X_umap'][:, 1],\n",
    "     z=adata.obsm['X_umap'][:, 2], \n",
    "     mode='markers',\n",
    "     marker=dict(color=color_map,\n",
    "                 size=5),\n",
    "     opacity=0.6,\n",
    "     text=labels,\n",
    "     hovertext=hovertext,\n",
    "  )],\n",
    "  filename='UMAP-3D-plot.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "##### Plotting 2D UMAP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.umap(adata,n_components=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "sc.pl.umap(adata,color=['CST3'],size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can replace the `['CST3']` in the previous cell with your preferred list of genes.\n",
    "\n",
    "e.g. `['LTB','IL32','CD3D']`\n",
    "\n",
    "or may be with a Python onliner code to extract gene names with a specific prefix\n",
    "\n",
    "e.g.\n",
    "\n",
    "```python\n",
    "sc.pl.umap(adata, color=[gene for gene in adata.var_names if gene.startswith('CD')], ncols=2)\n",
    "```\n",
    "\n",
    "Plot the scaled and corrected gene expression by `use_raw=False` and color them based on the leiden clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.umap(adata, color=['leiden'],use_raw=False,palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Embed doublets using UMAP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs['predicted_doublet_str'] = adata.obs['predicted_doublet'].map(lambda x: 'dublets' if x else 'NA').astype(str)\n",
    "sc.pl.umap(adata, color=['predicted_doublet_str'],use_raw=False,palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Embed the neighborhood graph using tSNE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.tsne(adata,n_pcs=40)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.tsne(adata, color=['leiden'],palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Embed the neighborhood graph using Diffusion Map and Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.diffmap(adata)\n",
    "sc.tl.draw_graph(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.diffmap(adata,color='leiden',size=40,palette=cluster_colors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.draw_graph(adata,color='leiden',size=40,palette=cluster_colors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Diffusion Map__\n",
    "\n",
    "* Shows connections between regions of higher density\n",
    "* Very clear trajectories are suggested, but clusters are less clear\n",
    "* Each diffusion component extracts heterogeneity in a different part of the data\n",
    "\n",
    "__Graph__\n",
    "\n",
    "* Shows a central cluster and several outer clusters\n",
    "* Shows clear connections from the central cluster to outer clusters\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Embed the neighborhood graph using PHATE\n",
    "\n",
    "PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) is a tool for visualizing high dimensional data. PHATE uses a novel conceptual framework for learning and visualizing the manifold to preserve both local and global distances. For more details, check [here](https://phate.readthedocs.io/en/stable/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.external.tl.phate(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "sc.external.pl.phate(adata,color='leiden',size=40,palette=cluster_colors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Embed the neighborhood graph using PhenoGraph\n",
    "\n",
    "PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph (“network”) representing phenotypic similarities between cells and then identifying communities in this graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.external.tl.phenograph(adata,clustering_algo='leiden')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs['pheno_leiden'] = adata.obs['pheno_leiden'].astype(str)\n",
    "sc.pl.umap(adata,color =\"pheno_leiden\",palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Finding marker genes\n",
    "\n",
    "Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the `.raw` attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(4,4)\n",
    "sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')\n",
    "sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False,ncols=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar (Sonison & Robinson (2018))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(4,4)\n",
    "sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')\n",
    "sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False,ncols=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the 5 top ranked genes per cluster 0, 1, …, n in a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "HTML(pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5).to_html())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Stacked violin plot of ranked genes\n",
    "Plot marker genes per cluster using stacked violin plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.rank_genes_groups_stacked_violin(\n",
    "    adata, n_genes=5,groupby='leiden',swap_axes=False,figsize=(20,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Dot plot of ranked genes\n",
    "The dotplot visualization provides a compact way of showing per group, the fraction of cells expressing a gene (dot size) and the mean expression of the gene in those cell (color scale)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.rank_genes_groups_dotplot(\n",
    "    adata, n_genes=5,groupby='leiden', dendrogram=True,figsize=(20,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Matrix plot of ranked genes\n",
    "The matrixplot shows the mean expression of a gene in a group by category as a heatmap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.rank_genes_groups_matrixplot(adata, n_genes=5, groupby='leiden', figsize=(20,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Heatmap plot of ranked genes\n",
    "Heatmaps do not collapse cells as in matrix plots. Instead, each cells is shown in a row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.rank_genes_groups_heatmap(\n",
    "    adata, n_genes=5, show_gene_labels=True, groupby='leiden', figsize=(20,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "#### Tracks plot of ranked genes\n",
    "The track plot shows the same information as the heatmap, but, instead of a color scale, the gene expression is represented by height."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.rank_genes_groups_tracksplot(adata, n_genes=5, cmap='bwr',figsize=(20,30))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Cell annotation\n",
    "\n",
    "Downloaded the cell type gene expression markers form [PanglaoDB](https://panglaodb.se/markers.html?cell_type=%27all_cells%27) and upload the file to the Jupyter server using the `Upload Files` button on the file explorer tab. Tpo generated cell annotation plot, replace the `None` value of the variable `cell_marker_list` with the marker filename. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cell_marker_list = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(10,10)\n",
    "if cell_marker_list is not None and \\\n",
    "   os.path.exists(cell_marker_list):\n",
    "  df = pd.read_csv(cell_marker_list,delimiter='\\t')\n",
    "  markers_df = df[(df['species'] == 'Hs') | (df['species'] == 'Mm Hs')]\n",
    "  cell_types = list(markers_df['cell type'].unique())\n",
    "  markers_dict = {}\n",
    "  for ctype in cell_types:\n",
    "    df = markers_df[markers_df['cell type'] == ctype]\n",
    "    markers_dict[ctype] = df['official gene symbol'].to_list()\n",
    "  cell_annotation = sc.tl.marker_gene_overlap(adata, markers_dict, key='rank_genes_groups',top_n_markers=20)\n",
    "  cell_annotation_norm = sc.tl.marker_gene_overlap(adata, markers_dict, key='rank_genes_groups', normalize='reference',top_n_markers=20)\n",
    "  plt.rcParams['figure.figsize']=(10,10)\n",
    "  sns.heatmap(cell_annotation_norm.loc[cell_annotation_norm.sum(axis=1) > 0.05,], cbar=False, annot=True)\n",
    "else:\n",
    "  print('No cell marker list found')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Cell cycle scoring\n",
    "\n",
    "Known sources of technical variation in the data have been investigated and corrected for (e.g. batch, count depth). A known source of biological variation that can explain the data is the cell cycle. Here, a gene list from [Macosko et al.](https://pubmed.ncbi.nlm.nih.gov/26000488/) is used to score the cell cycle effect in the data and classify cells by cell cycle phase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%file /tmp/Macosko_cell_cycle_genes.txt\n",
    "IG1.S,S,G2.M,M,M.G1\n",
    "ACD,ABCC5,ANLN,AHI1, AGFG1\n",
    "ACYP1,ABHD10,AP3D1,AKIRIN2,AGPAT3\n",
    "ADAMTS1,ANKRD18A,ARHGAP19,ANKRD40,AKAP13\n",
    "ANKRD10,ASF1B,ARL4A,ANLN,AMD1\n",
    "APEX2,ATAD2,ARMC1,ANP32B,ANP32E\n",
    "ARGLU1,BBS2,ASXL1,ANP32E,ANTXR1\n",
    "ATAD2,BIVM,ATL2,ARHGAP19,BAG3\n",
    "BARD1,BLM,AURKB,ARL6IP1,BTBD3\n",
    "BRD7,BMI1,BCLAF1,ASXL1,CBX3\n",
    "C1orf63,BRCA1,BORA,ATF7IP,CDC42\n",
    "C7orf41,BRIP1,BRD8,AURKA,CDK7\n",
    "C14orf142,C5orf42,BUB3,BIRC2,CDKN3\n",
    "CAPN7,C11orf82,C2orf69,BIRC5,CEP70\n",
    "CASP2,CALD1,C14orf80,BUB1,CNIH4\n",
    "CASP8AP2,CALM2,CASP3,CADM1,CTR9\n",
    "CCNE1,CASP2,CBX5,CCDC88A,CWC15\n",
    "CCNE2,CCDC14,CCDC107,CCDC90B,DCP1A\n",
    "CDC6,CCDC84,CCNA2,CCNA2,DCTN6\n",
    "CDC25A,CCDC150,CCNF,CCNB2,DEXI\n",
    "CDCA7,CDC7,CDC16,CDC20,DKC1\n",
    "CDCA7L,CDC45,CDC25C,CDC25B,DNAJB6\n",
    "CEP57,CDCA5,CDCA2,CDC27,DSP\n",
    "CHAF1A,CDKN2AIP,CDCA3,CDC42EP1,DYNLL1\n",
    "CHAF1B,CENPM,CDCA8,CDCA3,EIF4E\n",
    "CLSPN,CENPQ,CDK1,CENPA,ELP3\n",
    "CREBZF,CERS6,CDKN1B,CENPE,FAM60A\n",
    "CTSD,CHML,CDKN2C,CENPF,FAM189B\n",
    "DIS3,COQ9,CDR2,CEP55,FOPNL\n",
    "DNAJC3,CPNE8,CENPL,CFLAR,FOXK2\n",
    "DONSON,CREBZF,CEP350,CIT,FXR1\n",
    "DSCC1,CRLS1,CFD,CKAP2,G3BP1\n",
    "DTL,DCAF16,CFLAR,CKAP5,GATA2\n",
    "E2F1,DEPDC7,CHEK2,CKS1B,GNB1\n",
    "EIF2A,DHFR,CKAP2,CKS2,GRPEL1\n",
    "ESD,DNA2,CKAP2L,CNOT10,GSPT1\n",
    "FAM105B,DNAJB4,CYTH2,CNTROB,GTF3C4\n",
    "FAM122A,DONSON,DCAF7,CTCF,HIF1A\n",
    "FLAD1,DSCC1,DHX8,CTNNA1,HMG20B\n",
    "GINS2,DYNC1LI2,DNAJB1,CTNND1,HMGCR\n",
    "GINS3,E2F8,ENTPD5,DEPDC1,HSD17B11\n",
    "GMNN,EIF4EBP2,ESPL1,DEPDC1B,HSPA8\n",
    "HELLS,ENOSF1,FADD,DIAPH3,ILF2\n",
    "HOXB4,ESCO2,FAM83D,DLGAP5,JMJD1C\n",
    "HRAS,EXO1,FAN1,DNAJA1,KDM5B\n",
    "HSF2,EZH2,FANCD2,DNAJB1,KIAA0586\n",
    "INSR,FAM178A,G2E3,DR1,KIF5B\n",
    "INTS8,FANCA,GABPB1,DZIP3,KPNB1\n",
    "IVNS1ABP,FANCI,GAS1,E2F5,KRAS\n",
    "KIAA1147,FEN1,GAS2L3,ECT2,LARP1\n",
    "KIAA1586,GCLM,H2AFX,FAM64A,LARP7\n",
    "LNPEP,GOLGA8A,HAUS8,FOXM1,LRIF1\n",
    "LUC7L3,GOLGA8B,HINT3,FYN,LYAR\n",
    "MCM2,H1F0,HIPK2,G2E3,MORF4L2\n",
    "MCM4,HELLS,HJURP,GADD45A,MRPL19\n",
    "MCM5,HIST1H2AC,HMGB2,GAS2L3,MRPS2\n",
    "MCM6,HIST1H4C,HN1,GOT1,MRPS18B\n",
    "MDM1,INTS7,HP1BP3,GRK6,MSL1\n",
    "MED31,KAT2A,HRSP12,GTSE1,MTPN\n",
    "MRI1,KAT2B,IFNAR1,HCFC1,NCOA3\n",
    "MSH2,KDELC1,IQGAP3,HMG20B,NFIA\n",
    "NASP,KIAA1598,KATNA1,HMGB3,NFIC\n",
    "NEAT1,LMO4,KCTD9,HMMR,NUCKS1\n",
    "NKTR,LYRM7,KDM4A,HN1,NUFIP2\n",
    "NPAT,MAN1A2,KIAA1524,HP1BP3,NUP37\n",
    "NUP43,MAP3K2,KIF5B,HPS4,ODF2\n",
    "ORC1,MASTL,KIF11,HS2ST1,OPN3\n",
    "OSBPL6,MBD4,KIF20B,HSPA8,PAK1IP1\n",
    "PANK2,MCM8,KIF22,HSPA13,PBK\n",
    "PCDH7,MLF1IP,KIF23,INADL,PCF11\n",
    "PCNA,MYCBP2,KIFC1,KIF2C,PLIN3\n",
    "PLCXD1,NAB1,KLF6,KIF5B,PPP2CA\n",
    "PMS1,NEAT1,KPNA2,KIF14,PPP2R2A\n",
    "PNN,NFE2L2,LBR,KIF20B,PPP6R3\n",
    "POLD3,NRD1,LIX1L,KLF9,PRC1\n",
    "RAB23,NSUN3,LMNB1,LBR,PSEN1\n",
    "RECQL4,NT5DC1,MAD2L1,LMNA,PTMS\n",
    "RMI2,NUP160,MALAT1,MCM4,PTTG1\n",
    "RNF113A,OGT,MELK,MDC1,RAD21\n",
    "RNPC3,ORC3,MGAT2,MIS18BP1,RAN\n",
    "SEC62,OSGIN2,MID1,MKI67,RHEB\n",
    "SKP2,PHIP,MIS18BP1,MLLT4,RPL13A\n",
    "SLBP,PHTF1,MND1,MZT1,SLC39A10\n",
    "SLC25A36,PHTF2,NCAPD3,NCAPD2,SNUPN\n",
    "SNHG10,PKMYT1,NCAPH,NCOA5,SRSF3\n",
    "SRSF7,POLA1,NCOA5,NEK2,STAG1\n",
    "SSR3,PRIM1,NDC80,NUF2,SYNCRIP\n",
    "TAF15,PTAR1,NEIL3,NUP35,TAF9\n",
    "TIPIN,RAD18,NFIC,NUP98,TCERG1\n",
    "TOPBP1,RAD51,NIPBL,NUSAP1,TLE3\n",
    "TRA2A,RAD51AP1,NMB,ODF2,TMEM138\n",
    "TTC14,RBBP8,NR3C1,ORAOV1,TOB2\n",
    "UBR7,REEP1,NUCKS1,PBK,TOP1\n",
    "UHRF1,RFC2,NUMA1,PCF11,TROAP\n",
    "UNG,RHOBTB3,NUSAP1,PLK1,TSC22D1\n",
    "USP53,RMI1,PIF1,POC1A,TULP4\n",
    "VPS72,RPA2,PKNOX1,POM121,UBE2D3\n",
    "WDR76,RRM1,POLQ,PPP1R10,VANGL1\n",
    "ZMYND19,RRM2,PPP1R2,PRPSAP1,VCL\n",
    "ZNF367,RSRC2,PSMD11,PRR11,WIPF2\n",
    "ZRANB2,SAP30BP,PSRC1,PSMG3,WWC1\n",
    ",SLC38A2,RANGAP1,PTP4A1,YY1\n",
    ",SP1,RCCD1,PTPN9,ZBTB7A\n",
    ",SRSF5,RDH11,PWP1,ZCCHC10\n",
    ",SVIP,RNF141,QRICH1,ZNF24\n",
    ",TOP2A,SAP30,RAD51C,ZNF281\n",
    ",TTC31,SKA3,RANGAP1,ZNF593\n",
    ",TTLL7,SMC4,RBM8A,\n",
    ",TYMS,STAT1,RCAN1,\n",
    ",UBE2T,STIL,RERE,\n",
    ",UBL3,STK17B,RNF126,\n",
    ",USP1,SUCLG2,RNF141,\n",
    ",ZBED5,TFAP2A,RNPS1,\n",
    ",ZWINT,TIMP1,RRP1,\n",
    ",,TMEM99,SEPHS1,\n",
    ",,TMPO,SETD8,\n",
    ",,TNPO2,SFPQ,\n",
    ",,TOP2A,SGOL2,\n",
    ",,TRAIP,SHCBP1,\n",
    ",,TRIM59,SMARCB1,\n",
    ",,TRMT2A,SMARCD1,\n",
    ",,TTF2,SPAG5,\n",
    ",,TUBA1A,SPTBN1,\n",
    ",,TUBB,SRF,\n",
    ",,TUBB2A,SRSF3,\n",
    ",,TUBB4B,SS18,\n",
    ",,TUBD1,SUV420H1,\n",
    ",,UACA,TACC3,\n",
    ",,UBE2C,THRAP3,\n",
    ",,VPS25,TLE3,\n",
    ",,VTA1,TMEM138,\n",
    ",,WSB1,TNPO1,\n",
    ",,ZNF587,TOMM34,\n",
    ",,ZNHIT2,TPX2,\n",
    ",,,TRIP13,\n",
    ",,,TSG101,\n",
    ",,,TSN,\n",
    ",,,TTK,\n",
    ",,,TUBB4B,\n",
    ",,,TXNDC9,\n",
    ",,,TXNRD1,\n",
    ",,,UBE2D3,\n",
    ",,,USP13,\n",
    ",,,USP16,\n",
    ",,,VANGL1,\n",
    ",,,WIBG,\n",
    ",,,WSB1,\n",
    ",,,YWHAH,\n",
    ",,,ZC3HC1,\n",
    ",,,ZFX,\n",
    ",,,ZMYM1,\n",
    ",,,ZNF207,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "cc_genes = pd.read_csv('/tmp/Macosko_cell_cycle_genes.txt', sep=',')\n",
    "s_genes = cc_genes['S'].dropna()\n",
    "g2m_genes = cc_genes['G2.M'].dropna()\n",
    "s_genes = [gene.upper() for gene in s_genes]\n",
    "g2m_genes = [gene.upper() for gene in g2m_genes]\n",
    "s_genes_ens = adata.var_names[np.in1d(adata.var_names, s_genes)]\n",
    "g2m_genes_ens = adata.var_names[np.in1d(adata.var_names, g2m_genes)]\n",
    "sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes_ens, g2m_genes=g2m_genes_ens)\n",
    "sc.pl.umap(adata, color='S_score', use_raw=False, size=40 , palette=cluster_colors)\n",
    "sc.pl.umap(adata, color='G2M_score', use_raw=False, size=40, palette=cluster_colors)\n",
    "sc.pl.umap(adata, color='phase', use_raw=False, size=40, palette=cluster_colors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "### Trajectory analysis\n",
    "\n",
    "Clustering of single cell data cannot sufficiently explain the cellular heterogeneity as it assumes that the data is composed of distinct groups (e.g. discrete cell types or states). The biological processes that drive the development of the cells are continuous processes. Trajectory inference methods interpret single cell data as a snapshot of the continuous process and findss paths through cellular space that minimize transcriptional changes between neighbouring cells.\n",
    "\n",
    "#### Partition-based graph abstraction\n",
    "\n",
    "[Partition-based graph abstraction (PAGA)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x) is a method to reconcile clustering and pseudotemporal ordering. It can be applied to an entire dataset and does not assume that there are continuous trajectories connecting all clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.paga(adata, groups='leiden')\n",
    "adata.uns.get('paga')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "sc.pl.paga(adata, color='leiden',cmap=cluster_colors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "sc.pl.umap(adata, color='leiden', palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.draw_graph(adata, init_pos='paga')\n",
    "sc.pl.draw_graph(adata, color='leiden',palette=cluster_colors,size=40)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.paga_compare(adata, basis='umap')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting same visualization on the UMAP layout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize']=(8,5)\n",
    "fig1, ax1 = plt.subplots()\n",
    "sc.pl.umap(adata, size=40, ax=ax1,show=False, palette=cluster_colors)\n",
    "sc.pl.paga(adata, pos=adata.uns['paga']['pos'],show=False,  node_size_scale=1, node_size_power=1, ax=ax1, text_kwds={'alpha':0.75})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "## Explore cells in UCSC Cell Browser\n",
    "\n",
    "The UCSC Cell Browser is an interactive browser for single cell data. You can visualize the PCA, t-SNA and UMAP plot of these cells using it. For more details, please check [Cellbrowser docs](https://cellbrowser.readthedocs.io/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obsm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.external.exporting.cellbrowser(\n",
    "    adata,\n",
    "    data_name='PBMC-3K',\n",
    "    annot_keys=['leiden', 'percent_mito', 'n_genes', 'n_counts','predicted_doublet','doublet_score','pheno_leiden'],\n",
    "    data_dir='/tmp/ucsc-data',\n",
    "    cluster_field='leiden')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are using Binder for running this tutorial, please run next two cells (after removing the `#` from the `#!cbBuild -i...`) to access the UCSC Cell Browser."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "if 'BINDER_LAUNCH_HOST' in os.environ:\n",
    "  url = '<a href=\"{0}user/{1}/proxy/8080/\">Cellbrowser UI</a>'.\\\n",
    "          format(\n",
    "            os.environ.get('JUPYTERHUB_BASE_URL'),\n",
    "            os.environ.get('JUPYTERHUB_CLIENT_ID').replace('jupyterhub-user-','')\n",
    "          )\n",
    "else:\n",
    "  url = '<a href=\"http://localhost:8080/\">Cellbrowser UI</a>'\n",
    "\n",
    "HTML(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!cbBuild -i /tmp/ucsc-data/cellbrowser.conf -o /tmp/ucsc-tmp -p 8080 2> /dev/null"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you are done, feel free to stop the above cell by clicking on the stop button from the tab menu."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "## References\n",
    "* [Scanpy - Preprocessing and clustering 3k PBMCs](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html)\n",
    "* [single-cell-tutorial](https://github.com/theislab/single-cell-tutorial)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\"><a href=\"#Table-of-contents\">Go to TOC</a></div>\n",
    "\n",
    "## Acknowledgement\n",
    "The Imperial BRC Genomics Facility is supported by NIHR funding to the Imperial Biomedical Research Centre."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}