{
  "cells": [
    {
      "cell_type": "raw",
      "metadata": {},
      "source": [
        "---\n",
        "description: Combining and harmonizing samples or datasets from\n",
        "  different batches such as experiments or conditions to enable\n",
        "  meaningful cross-sample comparisons.\n",
        "subtitle:  Scanpy Toolkit\n",
        "title:  Data Integration\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Note**\n",
        ">\n",
        "> Code chunks run Python commands unless it starts with `%%bash`, in\n",
        "> which case, those chunks run shell commands.\n",
        "\n",
        "</div>\n",
        "\n",
        "In this tutorial we will look at different ways of integrating multiple\n",
        "single cell RNA-seq datasets. We will explore a few different methods to\n",
        "correct for batch effects across datasets. Seurat uses the data\n",
        "integration method presented in Comprehensive Integration of Single Cell\n",
        "Data, while Scran and Scanpy use a mutual Nearest neighbour method\n",
        "(MNN). Below you can find a list of some methods for single data\n",
        "integration:\n",
        "\n",
        "  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "  Markdown          Language          Library           Ref\n",
        "  ----------------- ----------------- ----------------- -----------------------------------------------------------------------------------------------------------------------------------\n",
        "  CCA               R                 Seurat            [Cell](https://www.sciencedirect.com/science/article/pii/S0092867419305598?via%3Dihub)\n",
        "\n",
        "  MNN               R/Python          Scater/Scanpy     [Nat. Biotech.](https://www.nature.com/articles/nbt.4091)\n",
        "\n",
        "  Conos             R                 conos             [Nat.\n",
        "                                                        Methods](https://www.nature.com/articles/s41592-019-0466-z?error=cookies_not_supported&code=5680289b-6edb-40ad-9934-415dac4fdb2f)\n",
        "\n",
        "  Scanorama         Python            scanorama         [Nat. Biotech.](https://www.nature.com/articles/s41587-019-0113-3)\n",
        "  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
        "\n",
        "## Data preparation\n",
        "\n",
        "Let's first load necessary libraries and the data saved in the previous\n",
        "lab."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import scanpy as sc\n",
        "import matplotlib.pyplot as plt\n",
        "import warnings\n",
        "import os\n",
        "import urllib.request\n",
        "\n",
        "warnings.simplefilter(action='ignore', category=Warning)\n",
        "\n",
        "# verbosity: errors (0), warnings (1), info (2), hints (3)\n",
        "sc.settings.verbosity = 3             \n",
        "\n",
        "sc.settings.set_figure_params(dpi=80)\n",
        "%matplotlib inline"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Create individual **adata** objects per batch."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# download pre-computed data if missing or long compute\n",
        "fetch_data = True\n",
        "\n",
        "# url for source and intermediate data\n",
        "path_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n",
        "\n",
        "path_results = \"data/covid/results\"\n",
        "if not os.path.exists(path_results):\n",
        "    os.makedirs(path_results, exist_ok=True)\n",
        "\n",
        "path_file = \"data/covid/results/scanpy_covid_qc_dr.h5ad\"\n",
        "if fetch_data and not os.path.exists(path_file):\n",
        "    urllib.request.urlretrieve(os.path.join(\n",
        "        path_data, 'covid/results/scanpy_covid_qc_dr.h5ad'), path_file)\n",
        "\n",
        "adata = sc.read_h5ad(path_file)\n",
        "adata"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "print(adata.X.shape)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As the stored AnnData object contains scaled data based on variable\n",
        "genes, we need to make a new object with the logtransformed normalized\n",
        "counts. The new variable gene selection should not be performed on the\n",
        "scaled data matrix."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata2 = adata.raw.to_adata() \n",
        "\n",
        "# in some versions of Anndata there is an issue with information on the logtransformation in the slot log1p.base so we set it to None to not get errors.\n",
        "adata2.uns['log1p']['base']=None\n",
        "\n",
        "# check that the matrix looks like normalized counts\n",
        "print(adata2.X[1:10,1:10])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Detect variable genes\n",
        "\n",
        "Variable genes can be detected across the full dataset, but then we run\n",
        "the risk of getting many batch-specific genes that will drive a lot of\n",
        "the variation. Or we can select variable genes from each batch\n",
        "separately to get only celltype variation. In the dimensionality\n",
        "reduction exercise, we already selected variable genes, so they are\n",
        "already stored in `adata.var.highly_variable`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "var_genes_all = adata.var.highly_variable\n",
        "\n",
        "print(\"Highly variable genes: %d\"%sum(var_genes_all))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Detect variable genes in each dataset separately using the `batch_key`\n",
        "parameter."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.highly_variable_genes(adata2, min_mean=0.0125, max_mean=3, min_disp=0.5, batch_key = 'sample')\n",
        "\n",
        "print(\"Highly variable genes intersection: %d\"%sum(adata2.var.highly_variable_intersection))\n",
        "\n",
        "print(\"Number of batches where gene is variable:\")\n",
        "print(adata2.var.highly_variable_nbatches.value_counts())\n",
        "\n",
        "var_genes_batch = adata2.var.highly_variable_nbatches > 0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Compare overlap of variable genes with batches or with all data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "print(\"Any batch var genes: %d\"%sum(var_genes_batch))\n",
        "print(\"All data var genes: %d\"%sum(var_genes_all))\n",
        "print(\"Overlap: %d\"%sum(var_genes_batch & var_genes_all))\n",
        "print(\"Variable genes in all batches: %d\"%sum(adata2.var.highly_variable_nbatches == 6))\n",
        "print(\"Overlap batch instersection and all: %d\"%sum(var_genes_all & adata2.var.highly_variable_intersection))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Select all genes that are variable in at least 2 datasets and use for\n",
        "remaining analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "var_select = adata2.var.highly_variable_nbatches > 2\n",
        "var_genes = var_select.index[var_select]\n",
        "len(var_genes)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## BBKNN\n",
        "\n",
        "First, we will run BBKNN that is implemented in scanpy."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import bbknn\n",
        "bbknn.bbknn(adata2,batch_key='sample')\n",
        "\n",
        "# then run umap on the integrated space\n",
        "sc.tl.umap(adata2)\n",
        "sc.tl.tsne(adata2)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now plot the unintegrated and the integrated space reduced\n",
        "dimensions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "fig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\n",
        "sc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN Corrected tsne\", ax=axs[0,0], show=False)\n",
        "sc.pl.tsne(adata, color=\"sample\", title=\"Uncorrected tsne\", ax=axs[0,1], show=False)\n",
        "sc.pl.umap(adata2, color=\"sample\", title=\"BBKNN Corrected umap\", ax=axs[1,0], show=False)\n",
        "sc.pl.umap(adata, color=\"sample\", title=\"Uncorrected umap\", ax=axs[1,1], show=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's save the integrated data for further analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "save_file = './data/covid/results/scanpy_covid_qc_dr_bbknn.h5ad'\n",
        "adata2.write_h5ad(save_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Combat\n",
        "\n",
        "Batch correction can also be performed with combat. Note that ComBat\n",
        "batch correction requires a dense matrix format as input (which is\n",
        "already the case in this example)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# create a new object with lognormalized counts\n",
        "adata_combat = sc.AnnData(X=adata.raw.X, var=adata.raw.var, obs = adata.obs)\n",
        "\n",
        "# first store the raw data \n",
        "adata_combat.raw = adata_combat\n",
        "\n",
        "# run combat\n",
        "sc.pp.combat(adata_combat, key='sample')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then we run the regular steps of dimensionality reduction on the combat\n",
        "corrected data. Variable gene selection, pca and umap with combat data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.highly_variable_genes(adata_combat)\n",
        "print(\"Highly variable genes: %d\"%sum(adata_combat.var.highly_variable))\n",
        "sc.pl.highly_variable_genes(adata_combat)\n",
        "\n",
        "sc.pp.pca(adata_combat, n_comps=30, use_highly_variable=True, svd_solver='arpack')\n",
        "\n",
        "sc.pp.neighbors(adata_combat)\n",
        "\n",
        "sc.tl.umap(adata_combat)\n",
        "sc.tl.tsne(adata_combat)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# compare var_genes\n",
        "var_genes_combat = adata_combat.var.highly_variable\n",
        "print(\"With all data %d\"%sum(var_genes_all))\n",
        "print(\"With combat %d\"%sum(var_genes_combat))\n",
        "print(\"Overlap %d\"%sum(var_genes_all & var_genes_combat))\n",
        "\n",
        "print(\"With 2 batches %d\"%sum(var_select))\n",
        "print(\"Overlap %d\"%sum(var_genes_combat & var_select))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now plot the unintegrated and the integrated space reduced\n",
        "dimensions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "fig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\n",
        "sc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN tsne\", ax=axs[0,0], show=False)\n",
        "sc.pl.tsne(adata_combat, color=\"sample\", title=\"Combat tsne\", ax=axs[0,1], show=False)\n",
        "sc.pl.umap(adata2, color=\"sample\", title=\"BBKNN umap\", ax=axs[1,0], show=False)\n",
        "sc.pl.umap(adata_combat, color=\"sample\", title=\"Combat umap\", ax=axs[1,1], show=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's save the integrated data for further analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#save to file\n",
        "save_file = './data/covid/results/scanpy_covid_qc_dr_combat.h5ad'\n",
        "adata_combat.write_h5ad(save_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Scanorama\n",
        "\n",
        "Try out [Scanorama](https://github.com/brianhie/scanorama) for data\n",
        "integration as well. First we need to create individual AnnData objects\n",
        "from each of the datasets."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# split per batch into new objects.\n",
        "batches = adata.obs['sample'].cat.categories.tolist()\n",
        "alldata = {}\n",
        "for batch in batches:\n",
        "    alldata[batch] = adata2[adata2.obs['sample'] == batch,]\n",
        "\n",
        "alldata   "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import scanorama\n",
        "\n",
        "#subset the individual dataset to the variable genes we defined at the beginning\n",
        "alldata2 = dict()\n",
        "for ds in alldata.keys():\n",
        "    print(ds)\n",
        "    alldata2[ds] = alldata[ds][:,var_genes]\n",
        "\n",
        "#convert to list of AnnData objects\n",
        "adatas = list(alldata2.values())\n",
        "\n",
        "# run scanorama.integrate\n",
        "scanorama.integrate_scanpy(adatas, dimred = 50)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#scanorama adds the corrected matrix to adata.obsm in each of the datasets in adatas.\n",
        "adatas[0].obsm['X_scanorama'].shape"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Get all the integrated matrices.\n",
        "scanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]\n",
        "\n",
        "# make into one matrix.\n",
        "all_s = np.concatenate(scanorama_int)\n",
        "print(all_s.shape)\n",
        "\n",
        "# add to the AnnData object, create a new object first\n",
        "adata_sc = adata.copy()\n",
        "adata_sc.obsm[\"Scanorama\"] = all_s"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# tsne and umap\n",
        "sc.pp.neighbors(adata_sc, n_pcs =30, use_rep = \"Scanorama\")\n",
        "sc.tl.umap(adata_sc)\n",
        "sc.tl.tsne(adata_sc, n_pcs = 30, use_rep = \"Scanorama\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now plot the unintegrated and the integrated space reduced\n",
        "dimensions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "fig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\n",
        "sc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN tsne\", ax=axs[0,0], show=False)\n",
        "sc.pl.tsne(adata_sc, color=\"sample\", title=\"Scanorama tsne\", ax=axs[0,1], show=False)\n",
        "sc.pl.umap(adata2, color=\"sample\", title=\"BBKNN umap\", ax=axs[1,0], show=False)\n",
        "sc.pl.umap(adata_sc, color=\"sample\", title=\"Scanorama umap\", ax=axs[1,1], show=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's save the integrated data for further analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#save to file\n",
        "save_file = './data/covid/results/scanpy_covid_qc_dr_scanorama.h5ad'\n",
        "adata_sc.write_h5ad(save_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Overview all methods\n",
        "\n",
        "Now we will plot UMAPS with all three integration methods side by side."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "fig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\n",
        "sc.pl.umap(adata, color=\"sample\", title=\"Uncorrected\", ax=axs[0,0], show=False)\n",
        "sc.pl.umap(adata2, color=\"sample\", title=\"BBKNN\", ax=axs[0,1], show=False)\n",
        "sc.pl.umap(adata_combat, color=\"sample\", title=\"Combat\", ax=axs[1,0], show=False)\n",
        "sc.pl.umap(adata_sc, color=\"sample\", title=\"Scanorama\", ax=axs[1,1], show=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Look at the different integration results, which one do you think\n",
        "> looks the best? How would you motivate selecting one method over the\n",
        "> other? How do you think you could best evaluate if the integration\n",
        "> worked well?\n",
        "\n",
        "</div>\n",
        "\n",
        "## Extra task\n",
        "\n",
        "Have a look at the documentation for\n",
        "[BBKNN](https://scanpy.readthedocs.io/en/latest/generated/scanpy.external.pp.bbknn.html#scanpy-external-pp-bbknn)\n",
        "\n",
        "Try changing some of the parameteres in BBKNN, such as distance metric,\n",
        "number of PCs and number of neighbors. How does the results change with\n",
        "different parameters? Can you explain why?\n",
        "\n",
        "## Session info\n",
        "\n",
        "```{=html}\n",
        "<details>\n",
        "```\n",
        "```{=html}\n",
        "<summary>\n",
        "```\n",
        "Click here\n",
        "```{=html}\n",
        "</summary>\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.logging.print_versions()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "```{=html}\n",
        "</details>\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}