{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Doublet Detection Tutorial\n",
    "\n",
    "Author: [Hui Ma](https://github.com/huimalinda), [Yiming Yang](https://github.com/yihming), [Rimte Rocher](https://github.com/rocherr)<br>\n",
    "Date: 2022-03-09<br>\n",
    "Notebook Source: [doublet_detection.ipynb](https://raw.githubusercontent.com/lilab-bcb/pegasus-tutorials/main/notebooks/doublet_detection.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pegasus as pg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "In this tutorial, we'll use the output result of [Pegasus Tutorial](https://pegasus-tutorials.readthedocs.io/en/latest/_static/tutorials/pegasus_analysis.html) to demonstrate how to detect and remove doublet cells in Pegasus. The dataset consists of human bone marrow single cells from 8 donors.\n",
    "\n",
    "The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip. You can also use `gsutil` to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip).\n",
    "\n",
    "Now load the count matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pg.read_input(\"MantonBM_result.zarr.zip\")\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sections\n",
    "-  [Detect Doublets](#mark)\n",
    "-  [Remove Doublets and Recluster](#recluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Detect Doublets\n",
    "<a id='mark'></a>\n",
    "In this step, infer doublets per channel. Set <b>clust_attr = 'anno'</b> to see the doublet density in each cluster and infer doublet cluster.\n",
    "\n",
    "The method used for detecting doublets can be found [here](https://github.com/klarman-cell-observatory/pegasus/raw/master/doublet_detection.pdf)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.infer_doublets(data, channel_attr = 'Channel', clust_attr = 'anno') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, plot annotation and Scrublet-like doublet score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data,attrs=['anno','doublet_score'], basis='umap', wspace=1.2) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want to see the doublet percentage of each cluster to decide if there is a doublet cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.uns['pred_dbl_cluster']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All clusters have doublet percentage under 5%, so no need to mark any doublet clusters here. If any cluster has doublet percentage more than $50\\%$, we can consider to mark it as doublet cluster. \n",
    "\n",
    "For example, If we want to mark 'CD14+ Monocyte' and 'CD14+ Monocyte-2' as doublet clusters, use the following code:\n",
    "\n",
    "`pg.mark_doublets(data, dbl_clusts = 'anno:CD14+ Monocyte,CD14+ Monocyte-2')`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `mark_doublets` function will mark doublet cluster (if any), and write singlet/doublet assignment to the \"demux_type\" column attribute in `data.obs`. The \"demux_type\" attribute is also used for singlet/doublet assignment of cell hashing, nucleus hashing and genetics pooling data (see [documentation](https://cumulus.readthedocs.io/en/latest/demultiplexing.html)).\n",
    "\n",
    "For this demonstration dataset, among $35,465$ cells, $724$ doublets detected. Doublet rate is $2.04\\%$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.mark_doublets(data)\n",
    "data.obs['demux_type'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Doublets distribution can be better observed in UMAP plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs = ['anno', 'demux_type'], legend_loc = ['on data', 'right margin'], \n",
    "           wspace = 0.1,alpha = [1.0, 0.8], palettes = 'demux_type:gainsboro,red')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove Doublets and Recluster\n",
    "<a id='recluster'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.qc_metrics(data, select_singlets=True)\n",
    "pg.filter_data(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start the reclustering process from re-selecting highly variable genes. Batch effect is observed, so we also want to use harmony algorithm to correct bach effect for reclustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.highly_variable_features(data, batch='Channel')\n",
    "pg.pca(data)\n",
    "pca_key = pg.run_harmony(data)\n",
    "pg.neighbors(data,rep=pca_key)\n",
    "pg.louvain(data,rep=pca_key)\n",
    "pg.umap(data,rep=pca_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Re-annotate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.de_analysis(data, cluster='louvain_labels')\n",
    "celltype_dict = pg.infer_cell_types(data, markers = 'human_immune',de_test='mwu',output_file='BM_celltype_re_dict.txt')\n",
    "cluster_names = pg.infer_cluster_names(celltype_dict)\n",
    "pg.annotate(data, name='anno', based_on='louvain_labels', anno_dict=cluster_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Umap of annotation after re-clustering:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data,attrs='anno',legend_loc='on data',basis='umap')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}