{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pegasus Plotting Tutorial\n",
    "\n",
    "Author: [Hui Ma](https://github.com/huimalinda), [Yiming Yang](https://github.com/yihming), [Rimte Rocher](https://github.com/rocher)<br />\n",
    "Date: 2022-03-09<br />\n",
    "Notebook Source: [plotting_tutorial.ipynb](https://raw.githubusercontent.com/lilab-bcb/pegasus-tutorials/main/notebooks/plotting_tutorial.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pegasus as pg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this plotting tutorial, we provide an analysis result of gene-count matrix dataset on Human Bone Marrow with 8 donors. You can get the data from https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip, or use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download via its Google bucket URL (gs://terra-featured-workspace/Cumulus/MantonBM_result.zarr.zip):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After downloading, load the file using Pegasus `read_input` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pg.read_input(\"MantonBM_result.zarr.zip\")\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following sections, we'll cover Pegasus plotting functions using this dataset. Moreover, for gene plots, the canonical gene markers below will be used:\n",
    "\n",
    "* **B cells and Plasma cells**: CD38, JCHAIN, CD79A, MS4A1.\n",
    "* **T cells**: TRAC, CD3D, CCR7.\n",
    "* **Cytotoxic T cells**: CD8A, CD8B.\n",
    "* **NK cells**: NKG7.\n",
    "* **Monocytes**: CD14, FCGR3A.\n",
    "* **Erythroid cells**: GYPA.\n",
    "* **Hematopoietic Stem cells**: CD34, SELL.\n",
    "* **Dendritic cells**: HLA-DPA1, CD4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "marker_genes = ['CD38', 'JCHAIN', 'FCGR3A', 'HLA-DPA1', 'CD14', 'CD79A', 'MS4A1', 'CD34', 'TRAC', 'CD3D', 'CD8A',\n",
    "                'CD8B', 'GYPA', 'NKG7', 'CD4', 'SELL', 'CCR7']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sections\n",
    "-  [QC Violin Plot](#qcviolin_plot)\n",
    "-  [Highly Variable Feature Plot](#hvf_plot)\n",
    "-  [Composition Plot](#composition_plot)\n",
    "-  [Scatter Plot](#scatter_plot)\n",
    "-  [Violin Plot](#violin_plot)\n",
    "-  [Heat Map](#heat_map)\n",
    "-  [Dot Plot](#dot_plot)\n",
    "-  [Dendrogram Plot](#dendrogram_plot)\n",
    "-  [Volcano Plot](#volcano_plot)\n",
    "-  [More Reference](#ref)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QC Violin Plot\n",
    "<a id='qcviolin_plot'></a>\n",
    "The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.\n",
    "\n",
    "<b>pg.qcviolin</b> shows the effect of quality control more intuitively by presenting the violin plot of cell distribution before and after filtration. \n",
    "\n",
    "<b>plot_type</b>=<font color='red'>'gene'</font> shows the number of expressed cells before and after filtration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.qcviolin(data, plot_type='gene', dpi=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Quality control stats on number of percentage of <b>mitochondrial genes</b>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.qcviolin(data, plot_type='mito', dpi=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of UMIs before and after filtration is also an important aspect of quality control."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.qcviolin(data, plot_type='count', dpi=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Highly Variable Feature Plot\n",
    "<a id='hvf_plot'></a>\n",
    "\n",
    "<b>Highly Variable Genes (HVG)</b> are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.\n",
    "\n",
    "Pegasus provides `hvfplot` function to generate a scatterplot of genes upon HVG selection. This plot only works for Pegasus-flavor HVGs (i.e. `flavor='pegasus'` in Pegasus `highly_variable_features` function).\n",
    "\n",
    "After selecting 2000 HVGs using the Pegasus selection method, the plot below is generated. Each point stands for one gene. Blue points are selected to be <b>HVG</b>s, which account for the majority of variation of the dataset. By default, it prints labels of 20 top HVGs. You can change this number in `top_n` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.hvfplot(data, dpi=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Composition Plot\n",
    "<a id='composition_plot'></a>\n",
    "\n",
    "Composition plot is a bar plot showing the cell compositions (under different conditions) in each cluster. Below is to show the composition of different samples in each Louvain cluster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = pg.compo_plot(data, 'louvain_labels', 'Channel', style = 'frequency')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Composition plot is useful to fast assess library quality and batch effects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scatter Plot\n",
    "<a id='scatter_plot'></a>\n",
    "Scatter plot requires at least 2 parameters\n",
    "- data  – Gene-count matrix to show.\n",
    "- basis – Cell embedding to show. Can be either 'umap', 'tsne', ‘fle’, ‘net_umap’ or ‘net_fle’.\n",
    "\n",
    "For this demonstration, we select <b>annotation</b> and <b>channel</b> as data attributes, and <b>tsne</b> as basis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell Visualization\n",
    "\n",
    "To visualize the data, users need to first run visualization functions to calcualte the corresponding cell embeddings. See [here](https://pegasus.readthedocs.io/en/stable/api/index.html#visualization-algorithms) for all visualization functions in Pegasus.\n",
    "\n",
    "**tSNE Plot.** Pegasus uses FIt-SNE algorithm for tSNE plot, which works better on large-scale datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs=['anno','Channel'], basis='tsne')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can show both cluster-specific cell types (`anno`) and samples (`Channel`) in one function.\n",
    "\n",
    "Note here, the legend of left plot is not fully dislplayed. We need to adjust the horizon distance between two plots by changing  <b>\"wspace\"</b>\n",
    "\n",
    "<b>\"wspace\"</b> is the ratio of horizon distance to width of plots. For example ,the default value of wspace is 0.4, which means the gap is 0.4 times the width of plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs=['anno','Channel'], basis='tsne', wspace=1.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also other optional parameters. For example, the commanly used ones\n",
    "- alpha (float or List[float], optional, default: `1.0`) – Alpha value for blending, from 0.0 (transparent) to 1.0 (opaque). If this is a list, the length must match attrs, which means we set a separate alpha value for each attribute.\n",
    "- dpi (float, optional, default: `300.0`) – The resolution of the figure in dots-per-inch.\n",
    "\n",
    "For the plots below, Left one has <b>alpha=1.0</b>, while right one has <b>alpha=0.1</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs=['Channel','Channel'], basis='tsne',alpha=[1.0,0.1], wspace=0.6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plots with lower dpi are smaller and have lower resolution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs=['Channel'], basis='tsne',dpi=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**UMAP Plot.** Pegasus also provides 2 different kinds of UMAP plots: `umap`, or `net_umap`.\n",
    "\n",
    "The embedding methods can be changed by changing <b>basis</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data,attrs=['anno', 'Channel'],basis='umap', wspace=1.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Cell Developmental Trajectory.** Pegasus provides Force-directed Layout (FLE) based plots to show pseudo-trajectories of cell development. To show it, you need to first run `diffmap`, then FLE-like visualization, `fle` or `net_fle`, and finally plot using `scatter` with the corresponding basis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data,attrs='anno', basis='fle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Plot\n",
    "\n",
    "Besides plotting cell attributes in `data.obs`, you can also specify gene names in `attrs` parameter of `scatter` function to generate Feature Plots, which is useful for inspecting specific genes in details.\n",
    "\n",
    "Below are feature plots of 4 genes, together with UMAP plot of cell type annotations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs=['TRAC', 'CD79A', 'CD14', 'CD34', 'anno'], basis='umap', ncols=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the plots show, TRAC is highly expressed in T-cell clusters, CD79A in B-cell and Plasma-cell clusters, CD14 in CD14+ Monocytes, and CD34 in the Stem-cell cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scatter Plot against Categories\n",
    "<a id='scatter_groups'></a>\n",
    "By using `scatter_groups` function, user can generate a set of scatter plots against a categorical cell attribute, for example `Channel`, one plot regarding a category.\n",
    "\n",
    "Below is an example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter_groups(data, attr='louvain_labels', groupby='Channel', basis='tsne', \n",
    "                  nrows = 3, ncols = 3, legend_fontsize=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 8 categories in `Channel` attribute, since the dataset contains 8 samples. As a result, 9 FIt-SNE plots are generated: first one shows cells from all samples, followed by 8 plots, one of which only shows cells from one sample.\n",
    "\n",
    "You can see that all the 8 sample-specific plots have similar cell distribution regarding clusters. This is due to the batch correction on the original data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Violin Plot\n",
    "<a id='violin_plot'></a>\n",
    "The violin plot gives us an idea of the distribution of gene expression values across clusters. The violin plot below shows the expression of the marker gene set we use in this tutorial against Louvain clusters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.violin(data, attrs=marker_genes, groupby='anno', dpi=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that all the genes' violin plots are stacked together. And you can check that their expression levels across clusters are pretty consistent with our knowledge.\n",
    "\n",
    "Pegasus violin plot can also be split regarding a categorical cell attributes with 2 categories. For example, below is a plot showing expression of gene **XIST** and **RPS4Y1** with cells split into male and female donors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.violin(data, attrs=['XIST', 'RPS4Y1'], groupby='anno', hue='gender', dpi=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's easy to see that in the plot above, *XIST* is highly expressed in female samples across all clusters, while *RPS4Y1* is highly expressed in male samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Heat Map\n",
    "<a id='heat_map'></a>\n",
    "Besides Violin Plot, Pegasus also provides Heat Map.\n",
    "\n",
    "It is also useful to add a dendrogram to the graph to bring together similar clusters or genes, based on the expression of selected genes.\n",
    "\n",
    "By default, `heatmap` shows a Heat Map based on average expression of selected genes within each cluster, and cluster similar genes altogether. You can set `attrs_cluster=True` to merge similar clusters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.heatmap(data, attrs=marker_genes, attrs_cluster=True, groupby='anno', dpi=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>switch_axes=<font color='green'>True</font></b> can switch x-axis (genes) and y-axis (clusters):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.heatmap(data, attrs=marker_genes, groupby='anno', groupby_cluster=True, switch_axes=True, dpi=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, you can also show Heat Map based on expression of individual cell by setting <b>on_average=<font color='green'>False</font></b>. Notice that this can take some time for a large dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.heatmap(data, attrs=marker_genes, groupby='anno', on_average=False, \n",
    "           attrs_cluster=True, groupby_cluster=False, groupby_dendrogram=False, dpi=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By comparing with Violin plot, you can see the Heat maps are consistent with it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dot Plot\n",
    "<a id='dot_plot'></a>\n",
    "Dot plot is an alternative to Heat map. Besides the mean expression of a gene within a cluster (dot color), it also conveys the fraction of cells within the cluster expressing that gene (dot size):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.dotplot(data, genes=marker_genes, groupby='anno')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly as `heatmap`, <b>switch_axes=<font color='green'>True</font></b> switches the two axes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.dotplot(data, genes=marker_genes, groupby='anno',switch_axes=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dendrogram\n",
    "<a id='dendrogram_plot'></a>\n",
    "Pegasus provides dendrograms based on hierarchical clustering. There are 2 ways for clustering: one is regarding the expression of a selected set of genes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.dendrogram(data, genes=marker_genes, groupby='anno', dpi=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way is to use a cell embedding for clustering. As this dataset is corrected using Harmony algorithm, with the corrected PCA embedding stored at `data.obsm['X_pca_harmony']`, we can simply set `rep` parameter in `dendrogram` to `'pca_harmony'` to use this embedding for clustering:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.dendrogram(data, rep='pca_harmony', groupby='anno', dpi=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have a different dendrogram."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Volcano Plot\n",
    "<a id='volcano_plot'></a>\n",
    "<b>Differential Expression (DE) Analysis</b> is to discover cluster-specific marker genes. For each cluster, it compares cells within the cluster with all the others, then finds genes significantly highly expressed (up-regulated) and lowly expressed (down-regulated) for the cluster.\n",
    "\n",
    "After running `de_analysis` function in Pegasus, we can use Volcano plot to see the DE result. Here we use Louvain cluster 1 for illustration. You can see that this cluster is annotated as \"Naive T cell\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.obs.loc[data.obs['louvain_labels']=='1', 'anno'].astype('str').value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is its Volcano plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.volcano(data, cluster_id = '1', dpi=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot uses the default thresholds: Log fold change at $\\pm 1.0$, and Q-value at $0.05$. Each point stands for a gene. Red ones are significant marker genes: those at right-hand side are up-regulated genes for Cluster 1, while those at left-hand side are down-regulated genes.\n",
    "\n",
    "In specific, **TRAC** gene, a marker gene of T cells, is the second to the rightmost up-regulated gene for Cluster 1. This is consistent with its cell type annotation.\n",
    "\n",
    "The clusters used in Volcano plot depend on how you run `de_analysis` function. For this dataset, we performed DE analysis on Louvain clusters, as a result, we have to use their labels in Volcano plot.\n",
    "\n",
    "By default, `volcano` uses t test results. User can set `de_test` parameter to either `'fisher'` or `'mwu'` to use other test result, as long as those tests have been performed in `de_analysis` ahead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read more ...\n",
    "<a id='ref'></a>\n",
    "See [Pegasus_Tutorial](https://pegasus-tutorials.readthedocs.io/en/latest/_static/tutorials/pegasus_analysis.html) for tutorial on running downstream analysis using Pegasus.\n",
    "\n",
    "See [Plotting_Documentation](https://pegasus.readthedocs.io/en/stable/api/index.html#plotting) for details on parameters of Pegasus plotting functions."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}