{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a6bde5eb",
   "metadata": {},
   "source": [
    "# Pegasus Pseudobulk Analysis\n",
    "\n",
    "Author: [Bo Li](https://github.com/bli25), [Yiming Yang](https://github.com/yihming)<br />\n",
    "Date: 2022-04-10 <br />\n",
    "Notebook Source: [pseudobulk.ipynb](https://raw.githubusercontent.com/lilab-bcb/pegasus-tutorials/main/notebooks/pseudobulk.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73510eeb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pegasus as pg\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b4b6ddd",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, create pseudobulk matrix regarding donors, and perform pseudobulk analysis on the data.\n",
    "\n",
    "The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).\n",
    "\n",
    "Now load the count matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbe49dcb",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pg.read_input('MantonBM_nonmix_subset.zarr.zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ef8e9c0",
   "metadata": {},
   "source": [
    "## Generate pseudobulk matrix\n",
    "\n",
    "Pegasus provides `pseudobulk` function to generate pseudobulk matrix. The code below generate a pseudobulk count matrix regarding donors (`Channel`), and transfer the gender attribute to the resulting pseudobulks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba9210f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "pseudo = pg.pseudobulk(data, 'Channel', 'gender')\n",
    "pseudo"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed8a8b16",
   "metadata": {},
   "source": [
    "For details on `pseudobulk` function, please see its [documentation](https://pegasus.readthedocs.io/en/stable/api/pegasus.pseudobulk.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93a287ad",
   "metadata": {},
   "source": [
    "## Differential Expression (DE) Analysis on Pseudobulk Matrix\n",
    "\n",
    "Pegasus has `deseq2` function to perform DE analysis on Pseudobulk data, which is a Python wrapper of [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) package in R (You need to first install the original R package). The code below analyzes based on a regression model considering the gender attribute, and estimates the contrast between `female` and `male`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86cdfc84",
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.deseq2(pseudo, '~gender', ('gender', 'female', 'male'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c02f8d23",
   "metadata": {},
   "source": [
    "The DE result is a Pandas DataFrame object stored in `pseudo.varm[\"deseq2\"]`, which could be viewed as the following (ranked by log2 fold-change):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8dbd8cbf",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(pseudo.varm[\"deseq2\"], index=pseudo.var_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "747e65dc",
   "metadata": {},
   "source": [
    "Notice that the fold-change calculated has `female` be the numerator, while `male` be the denominator.\n",
    "\n",
    "For details on `deseq2` function, please see its [documentation](https://pegasus.readthedocs.io/en/stable/api/pegasus.deseq2.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01e10459",
   "metadata": {},
   "source": [
    "### Get significant DE genes in human-readable format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a074fb23",
   "metadata": {},
   "outputs": [],
   "source": [
    "markers = pg.pseudo.markers(pseudo)\n",
    "print(markers)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb6b7c6d",
   "metadata": {},
   "source": [
    "### Write DE results to spreadsheet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10148b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.pseudo.write_results_to_excel(markers, 'test.de.xlsx')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd875cde",
   "metadata": {},
   "source": [
    "### Generate volcano plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7641ac37",
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.pseudo.volcano(pseudo)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23e6be94",
   "metadata": {},
   "source": [
    "## Gene Set Enrichment Analysis (GSEA)\n",
    "\n",
    "Pegasus has `fgsea` function to perform GSEA analysis, which is a Python wrapper of [fgsea](http://bioconductor.org/packages/release/bioc/html/fgsea.html) package in R (You need to first install the original R package):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e62d4e47",
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.fgsea(pseudo, 'log2FoldChange', 'canonical_pathways', 'deseq2', fgsea_key = 'fgsea_deseq2')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d72f99f8",
   "metadata": {},
   "source": [
    "The code above runs GSEA analysis based on log2 fold-change values, and use one of the preset gene sets that Pegasus provides:\n",
    "* `canonical_pathways`: The [MsigDB](http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) C2/CP gene set.\n",
    "* `hallmark`: The [MsigDB](http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) H gene set.\n",
    "Notice that this function is applicable to DE results from both single-cell and pseudobulk data.\n",
    "\n",
    "For details on `fgsea` function, please see its [documentation](https://pegasus.readthedocs.io/en/stable/api/pegasus.fgsea.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29168c1b",
   "metadata": {},
   "source": [
    "### Generate GSEA plots\n",
    "\n",
    "Finally, generate the GSEA histograms by `plot_gsea` function. In general, there will be 2 panels:\n",
    "* Top panel shows the up-regulated pathways in red color.\n",
    "* Bottom panel shows the down-regulated pathways in green color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90b62fa5",
   "metadata": {},
   "outputs": [],
   "source": [
    "pg.plot_gsea(pseudo, 'fgsea_deseq2')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c211c078",
   "metadata": {},
   "source": [
    "As shown in the plot above, for this pseudobulk data, there is no up-regulated (i.e. female significant) pathways, while some down-regulated (i.e. male significant) pathways are observed."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}