{
  "cells": [
    {
      "cell_type": "raw",
      "metadata": {},
      "source": [
        "---\n",
        "description: Combining single-cell gene expression data with spatial\n",
        "  information to reveal the spatial distribution of gene activity within\n",
        "  tissues.\n",
        "subtitle:  Scanpy Toolkit\n",
        "title:  Spatial Transcriptomics\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Note**\n",
        ">\n",
        "> Code chunks run Python commands unless it starts with `%%bash`, in\n",
        "> which case, those chunks run shell commands.\n",
        "\n",
        "</div>\n",
        "\n",
        "Adapted from tutorials by Giovanni Palla\n",
        "(https://scanpy-tutorials.readthedocs.io/en/latest/spatial/integration-scanorama.html)\n",
        "and Carlos Talavera-López\n",
        "(https://docs.scvi-tools.org/en/latest/tutorials/notebooks/stereoscope_heart_LV_tutorial.html)\n",
        "\n",
        "Spatial transcriptomic data with the Visium platform is in many ways\n",
        "similar to scRNAseq data. It contains UMI counts for 5-20 cells instead\n",
        "of single cells, but is still quite sparse in the same way as scRNAseq\n",
        "data is, but with the additional information about spatial location in\n",
        "the tissue.\\\n",
        "Here we will first run quality control in a similar manner to scRNAseq\n",
        "data, then QC filtering, dimensionality reduction, integration and\n",
        "clustering. Then we will use scRNAseq data from mouse cortex to run\n",
        "label transfer to predict celltypes in the Visium spots.\\\n",
        "We will use two **Visium** spatial transcriptomics dataset of the mouse\n",
        "brain (Sagittal), which are publicly available from the [10x genomics\n",
        "website](https://support.10xgenomics.com/spatial-gene-expression/datasets/).\n",
        "Note, that these dataset have already been filtered for spots that does\n",
        "not overlap with the tissue.\n",
        "\n",
        "## Preparation\n",
        "\n",
        "Load packages"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import scanpy as sc\n",
        "import anndata as an\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "import matplotlib as mpl\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "import scanorama\n",
        "import warnings\n",
        "import os\n",
        "\n",
        "warnings.simplefilter(action=\"ignore\", category=Warning)\n",
        "\n",
        "sc.set_figure_params(facecolor=\"white\", figsize=(8, 8))\n",
        "sc.settings.verbosity = 3"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Load ST data\n",
        "\n",
        "The function `datasets.visium_sge()` downloads the dataset from 10x\n",
        "genomics and returns an AnnData object that contains counts, images and\n",
        "spatial coordinates. We will calculate standards QC metrics with\n",
        "`pp.calculate_qc_metrics()` and visualize them.\n",
        "\n",
        "When using your own Visium data, use Scanpy's read_visium() function to\n",
        "import it."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# download pre-computed data if missing or long compute\n",
        "fetch_data = True\n",
        "\n",
        "# url for source and intermediate data\n",
        "path_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n",
        "\n",
        "if not os.path.exists(\"./data/spatial/visium\"):\n",
        "    os.makedirs(\"./data/spatial/visium\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_anterior = sc.datasets.visium_sge(\n",
        "    sample_id=\"V1_Mouse_Brain_Sagittal_Anterior\"\n",
        ")\n",
        "adata_posterior = sc.datasets.visium_sge(\n",
        "    sample_id=\"V1_Mouse_Brain_Sagittal_Posterior\"\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_anterior.var_names_make_unique()\n",
        "adata_posterior.var_names_make_unique()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To make sure that both images are included in the merged object, use\n",
        "uns_merge=\"unique\"."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# merge into one dataset\n",
        "library_names = [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n",
        "\n",
        "adata = adata_anterior.concatenate(\n",
        "    adata_posterior,\n",
        "    batch_key=\"library_id\",\n",
        "    uns_merge=\"unique\",\n",
        "    batch_categories=library_names\n",
        ")\n",
        "\n",
        "adata"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, we now have the slot spatial in obsm, which contains the\n",
        "spatial information from the Visium platform.\n",
        "\n",
        "## Quality control\n",
        "\n",
        "Similar to scRNA-seq we use statistics on number of counts, number of\n",
        "features and percent mitochondria for quality control."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# add info on mitochondrial and hemoglobin genes to the objects.\n",
        "adata.var['mt'] = adata.var_names.str.startswith('mt-') \n",
        "adata.var['hb'] = adata.var_names.str.contains((\"^Hb.*-\"))\n",
        "\n",
        "sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','hb'], percent_top=None, log1p=False, inplace=True)\n",
        "\n",
        "sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_hb'], jitter=0.4, groupby = 'library_id', rotation= 45)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also plot the same data onto the tissue section.\n",
        "\n",
        "In scanpy, this is a bit tricky when you have multiple sections, as you\n",
        "would have to subset and plot them separately."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# need to plot the two sections separately and specify the library_id\n",
        "for library in library_names:\n",
        "    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = [\"total_counts\", \"n_genes_by_counts\",'pct_counts_mt', 'pct_counts_hb'])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, the spots with low number of counts/features and high\n",
        "mitochondrial content are mainly towards the edges of the tissue. It is\n",
        "quite likely that these regions are damaged tissue. You may also see\n",
        "regions within a tissue with low quality if you have tears or folds in\n",
        "your section.\\\n",
        "But remember, for some tissue types, the amount of genes expressed and\n",
        "proportion mitochondria may also be a biological features, so bear in\n",
        "mind what tissue you are working on and what these features mean.\n",
        "\n",
        "### Filter spots\n",
        "\n",
        "Select all spots with less than **25%** mitocondrial reads, less than\n",
        "**20%** hb-reads and **500** detected genes. You must judge for yourself\n",
        "based on your knowledge of the tissue what are appropriate filtering\n",
        "criteria for your dataset."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "keep = (adata.obs['pct_counts_hb'] < 20) & (adata.obs['pct_counts_mt'] < 25) & (adata.obs['n_genes_by_counts'] > 1000)\n",
        "print(sum(keep))\n",
        "\n",
        "adata = adata[keep,:]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "And replot onto tissue sections."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "for library in library_names:\n",
        "    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = [\"total_counts\", \"n_genes_by_counts\",'pct_counts_mt', 'pct_counts_hb'])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Top expressed genes\n",
        "\n",
        "As for scRNA-seq data, we will look at what the top expressed genes are."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.highest_expr_genes(adata, n_top=20)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see, the mitochondrial genes are among the top expressed\n",
        "genes. Also the lncRNA gene Bc1 (brain cytoplasmic RNA 1). Also one\n",
        "hemoglobin gene.\n",
        "\n",
        "### Filter genes\n",
        "\n",
        "We will remove the *Bc1* gene, hemoglobin genes (blood contamination)\n",
        "and the mitochondrial genes."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "mito_genes = adata.var_names.str.startswith('mt-')\n",
        "hb_genes = adata.var_names.str.contains('^Hb.*-')\n",
        "\n",
        "remove = np.add(mito_genes, hb_genes)\n",
        "remove[adata.var_names == \"Bc1\"] = True\n",
        "keep = np.invert(remove)\n",
        "print(sum(remove))\n",
        "\n",
        "adata = adata[:,keep]\n",
        "\n",
        "print(adata.n_obs, adata.n_vars)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Analysis\n",
        "\n",
        "We will proceed with the data in a very similar manner to scRNA-seq\n",
        "data.\n",
        "\n",
        "As we have two sections, we will select variable genes with\n",
        "batch_key=\"library_id\" and then take the union of variable genes for\n",
        "further analysis. The idea is to avoid including batch specific genes in\n",
        "the analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# save the counts to a separate object for later, we need the normalized counts in raw for DEG dete\n",
        "counts_adata = adata.copy()\n",
        "\n",
        "sc.pp.normalize_total(adata, inplace=True)\n",
        "sc.pp.log1p(adata)\n",
        "# take 1500 variable genes per batch and then use the union of them.\n",
        "sc.pp.highly_variable_genes(adata, flavor=\"seurat\", n_top_genes=1500, inplace=True, batch_key=\"library_id\")\n",
        "\n",
        "# subset for variable genes\n",
        "adata.raw = adata\n",
        "adata = adata[:,adata.var.highly_variable_nbatches > 0]\n",
        "\n",
        "# scale data\n",
        "sc.pp.scale(adata)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we can plot gene expression of individual genes, the gene *Hpca* is\n",
        "a strong hippocampal marker and *Ttr* is a marker of the choroid plexus."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "for library in library_names:\n",
        "    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = [\"Ttr\", \"Hpca\"])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Dimensionality reduction and clustering\n",
        "\n",
        "We can then now run dimensionality reduction and clustering using the\n",
        "same workflow as we use for scRNA-seq analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.neighbors(adata)\n",
        "sc.tl.umap(adata)\n",
        "sc.tl.leiden(adata, key_added=\"clusters\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can then plot clusters onto umap or onto the tissue section."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(\n",
        "    adata, color=[\"clusters\", \"library_id\"], palette=sc.pl.palettes.default_20\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we are plotting the two sections separately, we need to make sure\n",
        "that they get the same colors by fetching cluster colors from a dict."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "clusters_colors = dict(\n",
        "    zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns[\"clusters_colors\"])\n",
        ")\n",
        "\n",
        "fig, axs = plt.subplots(1, 2, figsize=(15, 10))\n",
        "\n",
        "for i, library in enumerate(\n",
        "    [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n",
        "):\n",
        "    ad = adata[adata.obs.library_id == library, :].copy()\n",
        "    sc.pl.spatial(\n",
        "        ad,\n",
        "        img_key=\"hires\",\n",
        "        library_id=library,\n",
        "        color=\"clusters\",\n",
        "        size=1.5,\n",
        "        palette=[\n",
        "            v\n",
        "            for k, v in clusters_colors.items()\n",
        "            if k in ad.obs.clusters.unique().tolist()\n",
        "        ],\n",
        "        legend_loc=None,\n",
        "        show=False,\n",
        "        ax=axs[i],\n",
        "    )\n",
        "\n",
        "plt.tight_layout()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Integration\n",
        "\n",
        "Quite often, there are strong batch effects between different ST\n",
        "sections, so it may be a good idea to integrate the data across\n",
        "sections.\n",
        "\n",
        "We will do a similar integration as in the Data Integration lab, here we\n",
        "will use Scanorama for integration."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adatas = {}\n",
        "for batch in library_names:\n",
        "    adatas[batch] = adata[adata.obs['library_id'] == batch,]\n",
        "\n",
        "adatas "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import scanorama\n",
        "\n",
        "#convert to list of AnnData objects\n",
        "adatas = list(adatas.values())\n",
        "\n",
        "# run scanorama.integrate\n",
        "scanorama.integrate_scanpy(adatas, dimred = 50)\n",
        "\n",
        "# Get all the integrated matrices.\n",
        "scanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]\n",
        "\n",
        "# make into one matrix.\n",
        "all_s = np.concatenate(scanorama_int)\n",
        "print(all_s.shape)\n",
        "\n",
        "# add to the AnnData object\n",
        "adata.obsm[\"Scanorama\"] = all_s\n",
        "\n",
        "adata"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then we run dimensionality reduction and clustering as before."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.neighbors(adata, use_rep=\"Scanorama\")\n",
        "sc.tl.umap(adata)\n",
        "sc.tl.leiden(adata, key_added=\"clusters\")\n",
        "\n",
        "sc.pl.umap(\n",
        "    adata, color=[\"clusters\", \"library_id\"], palette=sc.pl.palettes.default_20\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we have new clusters, we again need to make a new dict for cluster\n",
        "colors"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "clusters_colors = dict(\n",
        "    zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns[\"clusters_colors\"])\n",
        ")\n",
        "\n",
        "fig, axs = plt.subplots(1, 2, figsize=(15, 10))\n",
        "\n",
        "for i, library in enumerate(\n",
        "    [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n",
        "):\n",
        "    ad = adata[adata.obs.library_id == library, :].copy()\n",
        "    sc.pl.spatial(\n",
        "        ad,\n",
        "        img_key=\"hires\",\n",
        "        library_id=library,\n",
        "        color=\"clusters\",\n",
        "        size=1.5,\n",
        "        palette=[\n",
        "            v\n",
        "            for k, v in clusters_colors.items()\n",
        "            if k in ad.obs.clusters.unique().tolist()\n",
        "        ],\n",
        "        legend_loc=None,\n",
        "        show=False,\n",
        "        ax=axs[i],\n",
        "    )\n",
        "\n",
        "plt.tight_layout()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Do you see any differences between the integrated and non-integrated\n",
        "> clustering? Judge for yourself, which of the clusterings do you think\n",
        "> looks best? As a reference, you can compare to brain regions in the\n",
        "> [Allen brain\n",
        "> atlas](https://mouse.brain-map.org/experiment/thumbnails/100042147?image_type=atlas).\n",
        "\n",
        "</div>\n",
        "\n",
        "### Spatially Variable Features\n",
        "\n",
        "There are two main workflows to identify molecular features that\n",
        "correlate with spatial location within a tissue. The first is to perform\n",
        "differential expression based on spatially distinct clusters, the other\n",
        "is to find features that have spatial patterning without taking clusters\n",
        "or spatial annotation into account. First, we will do differential\n",
        "expression between clusters just as we did for the scRNAseq data before."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# run t-test \n",
        "sc.tl.rank_genes_groups(adata, \"clusters\", method=\"wilcoxon\")\n",
        "# plot as heatmap for cluster5 genes\n",
        "sc.pl.rank_genes_groups_heatmap(adata, groups=\"5\", n_genes=10, groupby=\"clusters\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# plot onto spatial location\n",
        "top_genes = sc.get.rank_genes_groups_df(adata, group='5',log2fc_min=0)['names'][:3]\n",
        "\n",
        "for library in [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]:\n",
        "    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = top_genes)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Spatial transcriptomics allows researchers to investigate how gene\n",
        "expression trends varies in space, thus identifying spatial patterns of\n",
        "gene expression. For this purpose there are multiple methods, such as\n",
        "SpatailDE, SPARK, Trendsceek, HMRF and Splotch.\n",
        "\n",
        "We use SpatialDE Svensson et al., a Gaussian process-based statistical\n",
        "framework that aims to identify spatially variable genes.\n",
        "\n",
        "<div>\n",
        "\n",
        "> **Caution**\n",
        ">\n",
        "> This is a slow compute intensive step, we will not run this now and\n",
        "> instead use a pre-computed file in the step below.\n",
        ">\n",
        "> ``` {python}\n",
        "> # | results: hide\n",
        "> # this code is not executed\n",
        ">\n",
        "> if not fetch_data:\n",
        ">     import NaiveDE\n",
        ">     import SpatialDE\n",
        ">\n",
        ">     counts = sc.get.obs_df(adata, keys=list(adata.var_names), use_raw=True)\n",
        ">     total_counts = sc.get.obs_df(adata, keys=[\"total_counts\"])\n",
        ">     norm_expr = NaiveDE.stabilize(counts.T).T\n",
        ">     resid_expr = NaiveDE.regress_out(\n",
        ">         total_counts, norm_expr.T, \"np.log(total_counts)\").T\n",
        ">     results = SpatialDE.run(adata.obsm[\"spatial\"], resid_expr)\n",
        ">\n",
        ">     import pickle\n",
        ">     with open('data/spatial/visium/scanpy_spatialde.pkl', 'wb') as file:\n",
        ">         pickle.dump(results, file)\n",
        "> ```\n",
        "\n",
        "</div>\n",
        "\n",
        "Download precomputed file."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "results": "hide"
      },
      "source": [
        "path_file = \"data/spatial/visium/scanpy_spatialde.pkl\"\n",
        "if fetch_data and not os.path.exists(path_file):\n",
        "    import urllib.request\n",
        "    file_url = os.path.join(\n",
        "        path_data, \"spatial/visium/results/scanpy_spatialde.pkl\")\n",
        "    urllib.request.urlretrieve(file_url, path_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "results": "hide"
      },
      "source": [
        "import pickle\n",
        "with open('data/spatial/visium/scanpy_spatialde.pkl', 'rb') as file:\n",
        "    results = pickle.load(file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "results": "hide"
      },
      "source": [
        "#| eval: false\n",
        "# skip for now.\n",
        "\n",
        "# We concatenate the results with the DataFrame of annotations of variables: `adata.var`.\n",
        "results.index = results[\"g\"]\n",
        "adata.var = pd.concat(\n",
        "    [adata.var, results.loc[adata.var.index.values, :]], axis=1)\n",
        "adata.write_h5ad('./data/spatial/visium/adata_processed_sc.h5ad')\n",
        "\n",
        "# Then we can inspect significant genes that varies in space and visualize them with `sc.pl.spatial` function.\n",
        "results.sort_values(\"qval\").head(10)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Single cell data\n",
        "\n",
        "We can use a scRNA-seq dataset as a reference to predict the proportion\n",
        "of different celltypes in the Visium spots. Keep in mind that it is\n",
        "important to have a reference that contains all the celltypes you expect\n",
        "to find in your spots. Ideally it should be a scRNA-seq reference from\n",
        "the exact same tissue. We will use a reference scRNA-seq dataset of\n",
        "\\~14,000 adult mouse cortical cell taxonomy from the Allen Institute,\n",
        "generated with the SMART-Seq2 protocol.\n",
        "\n",
        "Conveniently, you can also download the pre-processed dataset in h5ad\n",
        "format."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import urllib.request\n",
        "import os\n",
        "\n",
        "path_file = \"data/spatial/visium/allen_cortex.h5ad\"\n",
        "if not os.path.exists(path_file):\n",
        "    file_url = os.path.join(\n",
        "        path_data, \"spatial/visium/allen_cortex.h5ad\")\n",
        "    urllib.request.urlretrieve(file_url, path_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_cortex = sc.read_h5ad(\"data/spatial/visium/allen_cortex.h5ad\")\n",
        "adata_cortex"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the metadata for the cell annotation:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_cortex.obs"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There is an issue with the raw matrix in this object that the gene names\n",
        "are not in the index, so we will put them back in."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_cortex.raw.var.index = adata_cortex.raw.var._index\n",
        "adata_cortex.raw.var"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then we run the regular pipline with normalization and dimensionality\n",
        "reduction."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pp.normalize_total(adata_cortex, target_sum=1e5)\n",
        "sc.pp.log1p(adata_cortex)\n",
        "sc.pp.highly_variable_genes(adata_cortex, min_mean=0.0125, max_mean=3, min_disp=0.5)\n",
        "sc.pp.scale(adata_cortex, max_value=10)\n",
        "sc.tl.pca(adata_cortex, svd_solver='arpack')\n",
        "sc.pp.neighbors(adata_cortex, n_pcs=30)\n",
        "sc.tl.umap(adata_cortex)\n",
        "sc.pl.umap(adata_cortex, color=\"subclass\", legend_loc='on data')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "adata_cortex.obs.subclass.value_counts()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For speed, and for a more fair comparison of the celltypes, we will\n",
        "subsample all celltypes to a maximum of 200 cells per class\n",
        "(`subclass`)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "target_cells = 200\n",
        "\n",
        "adatas2 = [adata_cortex[adata_cortex.obs.subclass == clust] for clust in adata_cortex.obs.subclass.cat.categories]\n",
        "\n",
        "for dat in adatas2:\n",
        "    if dat.n_obs > target_cells:\n",
        "          sc.pp.subsample(dat, n_obs=target_cells)\n",
        "\n",
        "adata_cortex = adatas2[0].concatenate(*adatas2[1:])\n",
        "\n",
        "adata_cortex.obs.subclass.value_counts()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(\n",
        "    adata_cortex, color=[\"class\", \"subclass\", \"genotype\", \"brain_region\"], palette=sc.pl.palettes.default_28\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.umap(adata_cortex, color=\"subclass\", legend_loc = 'on data')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Subset ST for cortex\n",
        "\n",
        "Since the scRNAseq dataset was generated from the mouse cortex, we will\n",
        "subset the visium dataset in order to select mainly the spots part of\n",
        "the cortex. Note that the integration can also be performed on the whole\n",
        "brain slice, but it would give rise to false positive cell type\n",
        "assignments and therefore it should be interpreted with more care.\n",
        "\n",
        "For deconvolution we will need the counts data, so we will subset from\n",
        "the counts_adata object that we created earlier."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "lib_a = \"V1_Mouse_Brain_Sagittal_Anterior\"\n",
        "\n",
        "counts_adata.obs['clusters'] = adata.obs.clusters\n",
        "\n",
        "adata_anterior_subset = counts_adata[\n",
        "    (counts_adata.obs.library_id == lib_a) \n",
        "    & (counts_adata.obsm[\"spatial\"][:, 1] < 6000), :\n",
        "].copy()\n",
        "\n",
        "# select also the cortex clusters\n",
        "adata_anterior_subset = adata_anterior_subset[adata_anterior_subset.obs.clusters.isin(['3','5','6']),:]\n",
        "\n",
        "# plot to check that we have the correct regions\n",
        "\n",
        "sc.pl.spatial(\n",
        "    adata_anterior_subset,\n",
        "    img_key=\"hires\",\n",
        "    library_id = lib_a,\n",
        "    color=['clusters'],\n",
        "    size=1.5\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Deconvolution\n",
        "\n",
        "Deconvolution is a method to estimate the abundance (or proportion) of\n",
        "different celltypes in a bulkRNAseq dataset using a single cell\n",
        "reference. As the Visium data can be seen as a small bulk, we can both\n",
        "use methods for traditional bulkRNAseq as well as methods especially\n",
        "developed for Visium data. Some methods for deconvolution are DWLS,\n",
        "cell2location, Tangram, Stereoscope, RCTD, SCDC and many more.\n",
        "\n",
        "Here, we will use deconvolution with Stereoscope implemented in the\n",
        "SCVI-tools package. To read more about Stereoscope please check out this\n",
        "github page (https://github.com/almaan/stereoscope)\n",
        "\n",
        "### Select genes for deconvolution\n",
        "\n",
        "Most deconvolution methods does a prior gene selection and there are\n",
        "different options that are used: - Use variable genes in the SC data. -\n",
        "Use variable genes in both SC and ST data - DE genes between clusters in\n",
        "the SC data.\\\n",
        "In this case we will use top DE genes per cluster, so first we have to\n",
        "run DGE detection on the scRNAseq data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.tl.rank_genes_groups(adata_cortex, 'subclass', method = \"t-test\", n_genes=100, use_raw=False)\n",
        "sc.pl.rank_genes_groups_dotplot(adata_cortex, n_genes=3)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.tl.filter_rank_genes_groups(adata_cortex, min_fold_change=1)\n",
        "\n",
        "genes = sc.get.rank_genes_groups_df(adata_cortex, group = None)\n",
        "genes"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "deg = genes.names.unique().tolist()\n",
        "print(len(deg))\n",
        "# check that the genes are also present in the ST data\n",
        "\n",
        "deg = np.intersect1d(deg,adata_anterior_subset.var.index).tolist()\n",
        "print(len(deg))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Train the model\n",
        "\n",
        "First, train the model using scRNAseq data.\n",
        "\n",
        "Stereoscope requires the data to be in counts, earlier in this tutorial\n",
        "we saved the spatial counts in a separate object counts_adata.\n",
        "\n",
        "In the single cell data we have the raw counts in the `raw.X` matrix so\n",
        "that one will be used. So here we create a new object with all the\n",
        "correct slots for scVI."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc_adata = adata_cortex.copy()\n",
        "sc_adata.X = adata_cortex.raw.X.copy()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Setup the anndata, the implementation requires the counts matrix to be\n",
        "in the \"counts\" layer as a copy."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import scvi\n",
        "# from scvi.data import register_tensor_from_anndata\n",
        "from scvi.external import RNAStereoscope, SpatialStereoscope\n",
        "\n",
        "# add counts layer\n",
        "sc_adata.layers[\"counts\"] = sc_adata.X.copy()\n",
        "\n",
        "# subset for the selected genes\n",
        "sc_adata = sc_adata[:, deg].copy()\n",
        "\n",
        "# create stereoscope object\n",
        "RNAStereoscope.setup_anndata(sc_adata, layer=\"counts\", labels_key=\"subclass\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# the model is saved to a file, so if is slow to run, you can simply reload it from disk by setting train = False\n",
        "\n",
        "train = True\n",
        "if train:\n",
        "    sc_model = RNAStereoscope(sc_adata)\n",
        "    sc_model.train(max_epochs=300)\n",
        "    sc_model.history[\"elbo_train\"][10:].plot()\n",
        "    sc_model.save(\"./data/spatial/visium/scanpy_scmodel\", overwrite=True)\n",
        "else:\n",
        "    sc_model = RNAStereoscope.load(\"./data/spatial/visium/scanpy_scmodel\", sc_adata)\n",
        "    print(\"Loaded RNA model from file!\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Predict proportions on the spatial data\n",
        "\n",
        "First create a new st object with the correct genes and counts as a\n",
        "layer."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "st_adata = adata_anterior_subset.copy()\n",
        "\n",
        "st_adata.layers[\"counts\"] = st_adata.X.copy()\n",
        "st_adata = st_adata[:, deg].copy()\n",
        "\n",
        "SpatialStereoscope.setup_anndata(st_adata, layer=\"counts\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "train = True\n",
        "if train:\n",
        "    spatial_model = SpatialStereoscope.from_rna_model(st_adata, sc_model)\n",
        "    spatial_model.train(max_epochs = 3000)\n",
        "    spatial_model.history[\"elbo_train\"][10:].plot()\n",
        "    spatial_model.save(\"./data/spatial/visium/scanpy_stmodel\", overwrite = True)\n",
        "else:\n",
        "    spatial_model = SpatialStereoscope.load(\"./data/spatial/visium/scanpy_stmodel\", st_adata)\n",
        "    print(\"Loaded Spatial model from file!\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Get the results from the model, also put them in the .obs slot."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "st_adata.obsm[\"deconvolution\"] = spatial_model.get_proportions()\n",
        "\n",
        "# also copy to the obsm data frame\n",
        "for ct in st_adata.obsm[\"deconvolution\"].columns:\n",
        "    st_adata.obs[ct] = st_adata.obsm[\"deconvolution\"][ct]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We are then able to explore how cell types in the scRNA-seq dataset are\n",
        "predicted onto the visium dataset. Let's first visualize the neurons\n",
        "cortical layers."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.spatial(\n",
        "    st_adata,\n",
        "    img_key=\"hires\",\n",
        "    color=[\"L2/3 IT\", \"L4\", \"L5 PT\", \"L6 CT\"],\n",
        "    library_id=lib_a,\n",
        "    size=1.5,\n",
        "    ncols=2\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can go ahead an visualize astrocytes and oligodendrocytes as well."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.spatial(\n",
        "    st_adata, img_key=\"hires\", color=[\"Oligo\", \"Astro\"], size=1.5, library_id=lib_a\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Keep in mind that the deconvolution results are just predictions,\n",
        "depending on how well your scRNAseq data covers the celltypes that are\n",
        "present in the ST data and on how parameters, gene selection etc. are\n",
        "tuned you may get different results."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.pl.violin(st_adata, [\"L2/3 IT\", \"L6 CT\",\"Oligo\",\"Astro\"],\n",
        "            jitter=0.4, groupby = 'clusters', rotation= 45)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Subset for another region that does not contain cortex cells and check\n",
        "> what you get from the label transfer. Suggested region is the right\n",
        "> end of the posterial section that you can select like this:\n",
        ">\n",
        "> ``` {python}\n",
        ">\n",
        "> lib_p = \"V1_Mouse_Brain_Sagittal_Posterior\"\n",
        ">\n",
        "> adata_subregion = adata[\n",
        ">     (adata.obs.library_id == lib_p)\n",
        ">     & (adata.obsm[\"spatial\"][:, 0] > 6500),\n",
        ">     :,\n",
        "> ].copy()\n",
        ">\n",
        "> sc.pl.spatial(\n",
        ">     adata_subregion,\n",
        ">     img_key=\"hires\",\n",
        ">     library_id=lib_p,\n",
        ">     color=['n_genes_by_counts'],\n",
        ">     size=1.5\n",
        "> )\n",
        "> ```\n",
        "\n",
        "</div>\n",
        "\n",
        "## Session info\n",
        "\n",
        "```{=html}\n",
        "<details>\n",
        "```\n",
        "```{=html}\n",
        "<summary>\n",
        "```\n",
        "Click here\n",
        "```{=html}\n",
        "</summary>\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sc.logging.print_versions()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "```{=html}\n",
        "</details>\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}