{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regress Out Tutorial\n",
    "\n",
    "Author: [Yiming Yang](https://github.com/yihming), [Rimte Rocher](https://github.com/rocherr)<br />\n",
    "Date: 2022-03-09<br />\n",
    "Notebook Source: [regress_out.ipynb](https://raw.githubusercontent.com/lilab-bcb/pegasus-tutorials/main/notebooks/regress_out.ipynb)\n",
    "\n",
    "This tutorial shows how to regress out cell cycle using Pegasus. To benchmark the work analogous to [Seurat's tutorial](https://satijalab.org/seurat/v3.2/cell_cycle_vignette.html) and [SCANPY's tutorial](https://nbviewer.jupyter.org/github/theislab/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb), we use the same dataset of murine hematopoietic progenitors from [[Nestorowa et al., Blood 2016]](https://doi.org/10.1182/blood-2016-05-716480), which can be downloaded [here](https://www.dropbox.com/s/3dby3bjsaf5arrw/cell_cycle_vignette_files.zip?dl=1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cell-Cycle Scores\n",
    "\n",
    "First, load the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pegasus as pg\n",
    "\n",
    "data = pg.read_input(\"nestorawa_forcellcycle_expressionMatrix.txt\")\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to do some preprocessing work before calculating the cell-cycle scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.qc_metrics(data, min_genes=0, max_genes=1e5)\n",
    "pg.filter_data(data)\n",
    "pg.identify_robust_genes(data)\n",
    "pg.log_norm(data, norm_count=1e4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be consistent with Seurat's tutorial, the filtration threshold on number of genes is set to be large enough to avoid removing any cell barcode, and a normalization of TP10k (transcripts per 10k) is performed.\n",
    "\n",
    "Now calculate cell-cycle scores using `calc_signature_score` (see [here](https://pegasus.readthedocs.io/en/stable/api/pegasus.calc_signature_score.html) for its documentation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.calc_signature_score(data, 'cell_cycle_human')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By setting `cell_cycle_human`, Pegasus uses the cell cycle genes defined in [[Tirosh et al. 2015]](https://science.sciencemag.org/content/352/6282/189) for calculation. We can fetch these genes from Pegasus pre-stored file (https://raw.githubusercontent.com/klarman-cell-observatory/pegasus/master/pegasus/data_files/cell_cycle_human.gmt), as we'll need them in the following sections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cell_cycle_genes = []\n",
    "with open(\"cell_cycle_human.gmt\", 'r') as f:\n",
    "    for line in f:\n",
    "        cell_cycle_genes += line.strip().split('\\t')[2:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moreover, the predicted phases of cells are stored in `data.obs` field with key `predicted_phase`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.obs['predicted_phase'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the corresponding scores are in `data.obs['G1/S']` and `data.obs['G2/M']`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cell Cycle Effects\n",
    "\n",
    "The presence of cell cycle effects can be shown in terms of Principal Components (PC), which are calculated from gene-count matrix with respect to cell cycle genes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_cc_genes = data[:, cell_cycle_genes].copy()\n",
    "pg.pca(data_cc_genes)\n",
    "data.obsm['X_pca'] = data_cc_genes.obsm['X_pca']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After achieving PCA result, we can generate a scatter plot of first two PCs with predicted phase being the colors for cells:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs='predicted_phase', basis='pca', dpi=130)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this plot, we can see that cells are quite separated due to their phases when cell cycle genes are considered."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regress Out Cell Cycle Effects\n",
    "\n",
    "This section shows how to regress out cell cycle effects.\n",
    "\n",
    "Pegasus performs this regressing out on PCs. After getting the original PCA matrix, it uses the cell-cycle scores for this operation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_key = pg.regress_out(data, attrs=['G1/S', 'G2/M'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When `regress_out` is finished, the resulting new PCA embedding matrix is stored in `data.obsm['X_pca_regressed']`, with the representation key `'pca_regressed'` returned to variable `pca_key`.\n",
    "\n",
    "Now let's see how PC scatter plot looks on the regressed-out components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.scatter(data, attrs=['predicted_phase'], basis=pca_key, dpi=130)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that after regressing out cell-cycle scores, no evident cell-cycle effects are observed in the first 2 PCs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alternate Workflow\n",
    "\n",
    "The alternative in this section simply mimics Seurat's way of regressing out the difference between the G2/M and G1/S phase scores. As stated in Seurat tutorial, in this way, we'll have differences in cell cycle phases regressed out, while the differences between non-cycling cells and cycling cells are maintained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_alt = data.copy()\n",
    "\n",
    "data_alt.obs['CC_diff'] = data_alt.obs['G1/S'] - data_alt.obs['G2/M']\n",
    "pca_key = pg.regress_out(data_alt, attrs=['CC_diff'])\n",
    "pg.scatter(data_alt, attrs=['predicted_phase'], basis=pca_key, dpi=130)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "You can use the immediate result of `regress_out` function (i.e. `data.obsm['X_pca_regressed']`) for the downstream analysis (e.g. `pg.neighbors(data, rep='pca_regressed'); pg.louvain(data, rep='pca_regressed')`). Just remember to set the `basis` argument of the Pegasus functions whenever applicable.\n",
    "\n",
    "The major difference between Pegasus' method and conventional method is that Pegasus performs regressing out on PCs, rather than all the features. For example, in this case, Pegasus only considers $50$ PCs, instead of all the $24,158$ features. Therefore, Pegasus method can be more efficient on large-scale datasets which contains many more cell barcodes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching with Seurat and SCANPY Tutorials\n",
    "\n",
    "In sections above, we only show the scatter plots of first two of the regressed-out components. In order to match with the corresponding tutorials in Seurat and SCANPY, we further perform PCA on these components, and show the PC scatter plots in the newly computed PCs.\n",
    "\n",
    "First is the result after regressing out the cell cycle effects based on G2/M and G1/S phase scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "X = data.obsm['X_' + pca_key]\n",
    "pca = PCA(n_components=X.shape[1], random_state=0, svd_solver='full')\n",
    "X_pca_new = pca.fit_transform(X)\n",
    "data.obsm['X_pca_new'] = np.ascontiguousarray(X_pca_new)\n",
    "\n",
    "pg.scatter(data, attrs=['predicted_phase'], basis='pca_new', dpi=130)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So indeed, no evident cell-cycle effects are observed after regressing out.\n",
    "\n",
    "Then the result of the alternative workflow, i.e. regressing out the difference between the G2/M and G1/S phase scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = data_alt.obsm['X_' + pca_key]\n",
    "pca = PCA(n_components=X.shape[1], random_state=0, svd_solver='full')\n",
    "X_pca_new = pca.fit_transform(X)\n",
    "data_alt.obsm['X_pca_new'] = np.ascontiguousarray(X_pca_new)\n",
    "\n",
    "pg.scatter(data_alt, attrs=['predicted_phase'], basis='pca_new', dpi=130)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the plot above, cells in G2/M and G1/S phases have a better mixture, while their difference from those in G0 phase is maintained."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}