{
  "cells": [
    {
      "cell_type": "raw",
      "metadata": {},
      "source": [
        "---\n",
        "description: Assignment of cell identities based on gene expression\n",
        "  patterns using reference data.\n",
        "subtitle:  Scanpy Toolkit\n",
        "title:  Celltype prediction\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Note**\n",
        ">\n",
        "> Code chunks run Python commands unless it starts with `%%bash`, in\n",
        "> which case, those chunks run shell commands.\n",
        "\n",
        "</div>\n",
        "\n",
        "Celltype prediction can either be performed on indiviudal cells where\n",
        "each cell gets a predicted celltype label, or on the level of clusters.\n",
        "All methods are based on similarity to other datasets, single cell or\n",
        "sorted bulk RNAseq, or uses known marker genes for each cell type.\\\n",
        "We will select one sample from the Covid data, `ctrl_13` and predict\n",
        "celltype by cell on that sample.\\\n",
        "Some methods will predict a celltype to each cell based on what it is\n",
        "most similar to, even if that celltype is not included in the reference.\n",
        "Other methods include an uncertainty so that cells with low similarity\n",
        "scores will be unclassified.\\\n",
        "There are multiple different methods to predict celltypes, here we will\n",
        "just cover a few of those.\n",
        "\n",
        "Here we will use a reference PBMC dataset that we get from scanpy\n",
        "datasets and classify celltypes based on two methods:\n",
        "\n",
        "-   Using scanorama for integration just as in the integration lab, and\n",
        "    then do label transfer based on closest neighbors.\n",
        "-   Using ingest to project the data onto the reference data and\n",
        "    transfer labels.\n",
        "\n",
        "First, lets load required libraries"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import scanpy as sc\n",
        "import matplotlib.pyplot as plt\n",
        "import warnings\n",
        "import os\n",
        "import urllib.request\n",
        "\n",
        "warnings.simplefilter(action=\"ignore\", category=Warning)\n",
        "\n",
        "# verbosity: errors (0), warnings (1), info (2), hints (3)\n",
        "sc.settings.verbosity = 2\n",
        "sc.settings.set_figure_params(dpi=80)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's read in the saved Covid-19 data object from the clustering step."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# download pre-computed data if missing or long compute\n",
        "fetch_data = True\n",
        "\n",
        "# url for source and intermediate data\n",
        "path_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n",
        "\n",
        "path_results = \"data/covid/results\"\n",
        "if not os.path.exists(path_results):\n",
        "    os.makedirs(path_results, exist_ok=True)\n",
        "\n",
        "# path_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\n",
        "path_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\n",
        "if fetch_data and not os.path.exists(path_file):\n",
        "    urllib.request.urlretrieve(os.path.join(\n",
        "        path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)\n",
        "\n",
        "adata = sc.read_h5ad(path_file)\n",
        "adata"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata.uns['log1p']['base']=None\n",
        "print(adata.shape)\n",
        "print(adata.raw.shape)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Subset one patient."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata = adata[adata.obs[\"sample\"] == \"ctrl_13\",:]\n",
        "print(adata.shape)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata.obs[\"louvain_0.6\"].value_counts()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, we have only one cell from cluster 10 in this sample, so\n",
        "lets remove that cell for now."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata = adata[adata.obs[\"louvain_0.6\"] != \"10\",:]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(\n",
        "    adata, color=[\"louvain_0.6\"], palette=sc.pl.palettes.default_20\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Reference data\n",
        "\n",
        "Load the reference data from `scanpy.datasets`. It is the annotated and\n",
        "processed pbmc3k dataset from 10x."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_ref = sc.datasets.pbmc3k_processed() \n",
        "\n",
        "adata_ref.obs['sample']='pbmc3k'\n",
        "\n",
        "print(adata_ref.shape)\n",
        "adata_ref.obs"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(adata_ref, color='louvain')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Make sure we have the same genes in both datset by taking the\n",
        "intersection"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "print(adata_ref.shape[1])\n",
        "print(adata.shape[1])\n",
        "var_names = adata_ref.var_names.intersection(adata.var_names)\n",
        "print(len(var_names))\n",
        "\n",
        "adata_ref = adata_ref[:, var_names]\n",
        "adata = adata[:, var_names]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First we need to rerun pca and umap with the same gene set for both\n",
        "datasets."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.pca(adata_ref)\n",
        "sc.pp.neighbors(adata_ref)\n",
        "sc.tl.umap(adata_ref)\n",
        "sc.pl.umap(adata_ref, color='louvain')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.pca(adata)\n",
        "sc.pp.neighbors(adata)\n",
        "sc.tl.umap(adata)\n",
        "sc.pl.umap(adata, color='louvain_0.6')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Integrate with scanorama"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import scanorama\n",
        "\n",
        "#subset the individual dataset to the same variable genes as in MNN-correct.\n",
        "alldata = dict()\n",
        "alldata['ctrl']=adata\n",
        "alldata['ref']=adata_ref\n",
        "\n",
        "#convert to list of AnnData objects\n",
        "adatas = list(alldata.values())\n",
        "\n",
        "# run scanorama.integrate\n",
        "scanorama.integrate_scanpy(adatas, dimred = 50)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# add in sample info\n",
        "adata_ref.obs['sample']='pbmc3k'\n",
        "\n",
        "# create a merged scanpy object and add in the scanorama \n",
        "adata_merged = alldata['ctrl'].concatenate(alldata['ref'], batch_key='sample', batch_categories=['ctrl','pbmc3k'])\n",
        "\n",
        "embedding = np.concatenate([ad.obsm['X_scanorama'] for ad in adatas], axis=0)\n",
        "adata_merged.obsm['Scanorama'] = embedding"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#run  umap.\n",
        "sc.pp.neighbors(adata_merged, n_pcs =50, use_rep = \"Scanorama\")\n",
        "sc.tl.umap(adata_merged)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(adata_merged, color=[\"sample\",\"louvain\"])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Label transfer\n",
        "\n",
        "Using the function in the Spatial tutorial at the scanpy website we will\n",
        "calculate normalized cosine distances between the two datasets and\n",
        "tranfer labels to the celltype with the highest scores."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from sklearn.metrics.pairwise import cosine_distances\n",
        "\n",
        "distances = 1 - cosine_distances(\n",
        "    adata_merged[adata_merged.obs['sample'] == \"pbmc3k\"].obsm[\"Scanorama\"],\n",
        "    adata_merged[adata_merged.obs['sample'] == \"ctrl\"].obsm[\"Scanorama\"],\n",
        ")\n",
        "\n",
        "def label_transfer(dist, labels, index):\n",
        "    lab = pd.get_dummies(labels)\n",
        "    class_prob = lab.to_numpy().T @ dist\n",
        "    norm = np.linalg.norm(class_prob, 2, axis=0)\n",
        "    class_prob = class_prob / norm\n",
        "    class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)\n",
        "    # convert to df\n",
        "    cp_df = pd.DataFrame(\n",
        "        class_prob, columns=lab.columns\n",
        "    )\n",
        "    cp_df.index = index\n",
        "    # classify as max score\n",
        "    m = cp_df.idxmax(axis=1)\n",
        "    \n",
        "    return m\n",
        "\n",
        "class_def = label_transfer(distances, adata_ref.obs.louvain, adata.obs.index)\n",
        "\n",
        "# add to obs section of the original object\n",
        "adata.obs['predicted'] = class_def\n",
        "\n",
        "sc.pl.umap(adata, color=\"predicted\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# add to merged object.\n",
        "adata_merged.obs[\"predicted\"] = pd.concat(\n",
        "    [class_def, adata_ref.obs[\"louvain\"]], axis=0\n",
        ").tolist()\n",
        "\n",
        "sc.pl.umap(adata_merged, color=[\"sample\",\"louvain\",'predicted'])\n",
        "#plot only ctrl cells.\n",
        "sc.pl.umap(adata_merged[adata_merged.obs['sample']=='ctrl'], color='predicted')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now plot how many cells of each celltypes can be found in each cluster."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "tmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['predicted'], normalize='index')\n",
        "tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Ingest\n",
        "\n",
        "Another method for celltype prediction is Ingest, for more information,\n",
        "please look at\n",
        "https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.tl.ingest(adata, adata_ref, obs='louvain')\n",
        "sc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now plot how many cells of each celltypes can be found in each cluster."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "tmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['louvain'], normalize='index')\n",
        "tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Compare results\n",
        "\n",
        "The predictions from ingest is stored in the column 'louvain' while we\n",
        "named the label transfer with scanorama as 'predicted'"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(adata, color=['louvain','predicted'], wspace=0.5)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, the main celltypes are the same, but dendritic cells are\n",
        "mainly predicted to cluster 8 by ingest and the proportions of the\n",
        "different celltypes are different.\n",
        "\n",
        "The only way to make sure which method you trust is to look at what\n",
        "genes the different celltypes express and use your biological knowledge\n",
        "to make decisions.\n",
        "\n",
        "## Gene set analysis\n",
        "\n",
        "Another way of predicting celltypes is to use the differentially\n",
        "expressed genes per cluster and compare to lists of known cell marker\n",
        "genes. This requires a list of genes that you trust and that is relevant\n",
        "for the tissue you are working on.\n",
        "\n",
        "You can either run it with a marker list from the ontology or a list of\n",
        "your choice as in the example below."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "path_file = 'data/human_cell_markers.txt'\n",
        "if not os.path.exists(path_file):\n",
        "    urllib.request.urlretrieve(os.path.join(\n",
        "        path_data, 'human_cell_markers.txt'), path_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "df = pd.read_table(path_file)\n",
        "df\n",
        "\n",
        "print(df.shape)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Filter for number of genes per celltype\n",
        "df['nG'] = df.geneSymbol.str.split(\",\").str.len()\n",
        "\n",
        "df = df[df['nG'] > 5]\n",
        "df = df[df['nG'] < 100]\n",
        "d = df[df['cancerType'] == \"Normal\"]\n",
        "print(df.shape)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "df.index = df.cellName\n",
        "gene_dict = df.geneSymbol.str.split(\",\").to_dict()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# run differential expression per cluster\n",
        "sc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = \"wilcoxon\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# do gene set overlap to the groups in the gene list and top 300 DEGs.\n",
        "import gseapy\n",
        "\n",
        "gsea_res = dict()\n",
        "pred = dict()\n",
        "\n",
        "for cl in adata.obs['louvain_0.6'].cat.categories.tolist():\n",
        "    print(cl)\n",
        "    glist = sc.get.rank_genes_groups_df(adata, group=cl, key='wilcoxon')[\n",
        "        'names'].squeeze().str.strip().tolist()\n",
        "    enr_res = gseapy.enrichr(gene_list=glist[:300],\n",
        "                             organism='Human',\n",
        "                             gene_sets=gene_dict,\n",
        "                             background=adata.raw.shape[1],\n",
        "                             cutoff=1)\n",
        "    if enr_res.results.shape[0] == 0:\n",
        "        pred[cl] = \"Unass\"\n",
        "    else:\n",
        "        enr_res.results.sort_values(\n",
        "            by=\"P-value\", axis=0, ascending=True, inplace=True)\n",
        "        print(enr_res.results.head(2))\n",
        "        gsea_res[cl] = enr_res\n",
        "        pred[cl] = enr_res.results[\"Term\"][0]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# prediction per cluster\n",
        "pred"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "prediction = [pred[x] for x in adata.obs['louvain_0.6']]\n",
        "adata.obs[\"GS_overlap_pred\"] = prediction\n",
        "\n",
        "sc.pl.umap(adata, color='GS_overlap_pred')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> As you can see, it agrees to some extent with the predictions from\n",
        "> label transfer and ingest, but there are clear differences, which do\n",
        "> you think looks better?\n",
        "\n",
        "</div>\n",
        "\n",
        "## Session info\n",
        "\n",
        "```{=html}\n",
        "<details>\n",
        "```\n",
        "```{=html}\n",
        "<summary>\n",
        "```\n",
        "Click here\n",
        "```{=html}\n",
        "</summary>\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.logging.print_versions()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "```{=html}\n",
        "</details>\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}