{
  "cells": [
    {
      "cell_type": "raw",
      "metadata": {},
      "source": [
        "---\n",
        "description: Reduce high-dimensional gene expression data from\n",
        "  individual cells into a lower-dimensional space for visualization.\n",
        "  This lab explores PCA, tSNE and UMAP.\n",
        "subtitle:  Scanpy Toolkit\n",
        "title:  Dimensionality Reduction\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Note**\n",
        ">\n",
        "> Code chunks run Python commands unless it starts with `%%bash`, in\n",
        "> which case, those chunks run shell commands.\n",
        "\n",
        "</div>\n",
        "\n",
        "## Data preparation\n",
        "\n",
        "First, let's load all necessary libraries and the QC-filtered dataset\n",
        "from the previous step."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: libraries\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import scanpy as sc\n",
        "import matplotlib.pyplot as plt\n",
        "import warnings\n",
        "import os\n",
        "import subprocess\n",
        "\n",
        "warnings.simplefilter(action=\"ignore\", category=Warning)\n",
        "\n",
        "# verbosity: errors (0), warnings (1), info (2), hints (3)\n",
        "sc.settings.verbosity = 3\n",
        "# sc.logging.print_versions()\n",
        "\n",
        "sc.settings.set_figure_params(dpi=80)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: fetch-data\n",
        "# download pre-computed data if missing or long compute\n",
        "fetch_data = True\n",
        "\n",
        "# url for source and intermediate data\n",
        "path_data = \"https://nextcloud.dc.scilifelab.se/public.php/webdav\"\n",
        "curl_upass = \"zbC5fr2LbEZ9rSE:scRNAseq2025\"\n",
        "\n",
        "\n",
        "path_results = \"data/covid/results\"\n",
        "if not os.path.exists(path_results):\n",
        "    os.makedirs(path_results, exist_ok=True)\n",
        "\n",
        "path_file = \"data/covid/results/scanpy_covid_qc.h5ad\"\n",
        "# if fetch_data is false and path_file doesn't exist\n",
        "\n",
        "if fetch_data and not os.path.exists(path_file):\n",
        "    file_url = os.path.join(path_data, \"covid/results_scanpy/scanpy_covid_qc.h5ad\")\n",
        "    subprocess.call([\"curl\", \"-u\", curl_upass, \"-o\", path_file, file_url ])\n",
        "\n",
        "adata = sc.read_h5ad(path_file)\n",
        "adata"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Before variable gene selection we need to normalize and log transform\n",
        "the data. Then store the full matrix in the `raw` slot before doing\n",
        "variable gene selection."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: normalize\n",
        "# normalize to depth 10 000\n",
        "sc.pp.normalize_total(adata, target_sum=1e4)\n",
        "\n",
        "# log transform\n",
        "sc.pp.log1p(adata)\n",
        "\n",
        "# store normalized counts in the raw slot, \n",
        "# we will subset adata.X for variable genes, but want to keep all genes matrix as well.\n",
        "adata.raw = adata\n",
        "\n",
        "adata"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Feature selection\n",
        "\n",
        "We first need to define which features/genes are important in our\n",
        "dataset to distinguish cell types. For this purpose, we need to find\n",
        "genes that are highly variable across cells, which in turn will also\n",
        "provide a good separation of the cell clusters."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: hvg\n",
        "# compute variable genes\n",
        "sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\n",
        "print(\"Highly variable genes: %d\"%sum(adata.var.highly_variable))\n",
        "\n",
        "#plot variable genes\n",
        "sc.pl.highly_variable_genes(adata)\n",
        "\n",
        "# subset for variable genes in the dataset\n",
        "adata = adata[:, adata.var['highly_variable']]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Z-score transformation\n",
        "\n",
        "Now that the genes have been selected, we now proceed with PCA. Since\n",
        "each gene has a different expression level, it means that genes with\n",
        "higher expression values will naturally have higher variation that will\n",
        "be captured by PCA. This means that we need to somehow give each gene a\n",
        "similar weight when performing PCA (see below). The common practice is\n",
        "to center and scale each gene before performing PCA. This exact scaling\n",
        "called Z-score normalization is very useful for PCA, clustering and\n",
        "plotting heatmaps. Additionally, we can use regression to remove any\n",
        "unwanted sources of variation from the dataset, such as `cell cycle`,\n",
        "`sequencing depth`, `percent mitochondria` etc. This is achieved by\n",
        "doing a generalized linear regression using these parameters as\n",
        "co-variates in the model. Then the residuals of the model are taken as\n",
        "the *regressed data*. Although perhaps not in the best way, batch effect\n",
        "regression can also be done here. By default, variables are scaled in\n",
        "the PCA step and is not done separately. But it could be achieved by\n",
        "running the commands below:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: scale\n",
        "#run this line if you get the \"AttributeError: swapaxes not found\" \n",
        "# adata = adata.copy()\n",
        "\n",
        "# regress out unwanted variables\n",
        "sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])\n",
        "\n",
        "# scale data, clip values exceeding standard deviation 10.\n",
        "sc.pp.scale(adata, max_value=10)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PCA\n",
        "\n",
        "Performing PCA has many useful applications and interpretations, which\n",
        "much depends on the data used. In the case of single-cell data, we want\n",
        "to segregate samples based on gene expression patterns in the data.\n",
        "\n",
        "To run PCA, you can use the function `pca()`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: pca\n",
        "sc.tl.pca(adata, svd_solver='arpack')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We then plot the first principal components."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: pca-plot\n",
        "# plot more PCS\n",
        "sc.pl.pca(adata, color='sample', components = ['1,2','3,4','5,6','7,8'], ncols=2)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To identify genes that contribute most to each PC, one can retrieve the\n",
        "loading matrix information."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: pca-loadings\n",
        "#Plot loadings\n",
        "sc.pl.pca_loadings(adata, components=[1,2,3,4,5,6,7,8])\n",
        "\n",
        "# OBS! only plots the positive axes genes from each PC!!"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The function to plot loading genes only plots genes on the positive\n",
        "axes. Instead plot as a heatmaps, with genes on both positive and\n",
        "negative side, one per pc, and plot their expression amongst cells\n",
        "ordered by their position along the pc."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: pca-loadings-heatmap\n",
        "# adata.obsm[\"X_pca\"] is the embeddings\n",
        "# adata.uns[\"pca\"] is pc variance\n",
        "# adata.varm['PCs'] is the loadings\n",
        "\n",
        "genes = adata.var['gene_ids']\n",
        "\n",
        "for pc in [1,2,3,4]:\n",
        "    g = adata.varm['PCs'][:,pc-1]\n",
        "    o = np.argsort(g)\n",
        "    sel = np.concatenate((o[:10],o[-10:])).tolist()\n",
        "    emb = adata.obsm['X_pca'][:,pc-1]\n",
        "    # order by position on that pc\n",
        "    tempdata = adata[np.argsort(emb),]\n",
        "    sc.pl.heatmap(tempdata, var_names = genes[sel].index.tolist(), groupby='predicted_doublets', swap_axes = True, use_raw=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also plot the amount of variance explained by each PC."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: pca-elbow\n",
        "sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Based on this plot, we can see that the top 8 PCs retain a lot of\n",
        "information, while other PCs contain progressively less. However, it is\n",
        "still advisable to use more PCs since they might contain information\n",
        "about rare cell types (such as platelets and DCs in this dataset)\n",
        "\n",
        "## tSNE\n",
        "\n",
        "We will now run [BH-tSNE](https://arxiv.org/abs/1301.3342)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: tsne\n",
        "sc.tl.tsne(adata, n_pcs = 30)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We plot the tSNE scatterplot colored by dataset. We can clearly see the\n",
        "effect of batches present in the dataset."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: tsne-plot\n",
        "sc.pl.tsne(adata, color='sample')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## UMAP\n",
        "\n",
        "The UMAP implementation in SCANPY uses a neighborhood graph as the\n",
        "distance matrix, so we need to first calculate the graph."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: neighbors\n",
        "sc.pp.neighbors(adata, n_pcs = 30, n_neighbors = 20)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now run [UMAP](https://arxiv.org/abs/1802.03426) for cell\n",
        "embeddings."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: umap\n",
        "sc.tl.umap(adata)\n",
        "sc.pl.umap(adata, color='sample')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "UMAP is plotted colored per dataset. Although less distinct as in the\n",
        "tSNE, we still see quite an effect of the different batches in the data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: umap10\n",
        "# run with 10 components, save to a new object so that the umap with 2D is not overwritten.\n",
        "umap10 = sc.tl.umap(adata, n_components=10, copy=True)\n",
        "fig, axs = plt.subplots(1, 3, figsize=(10, 4), constrained_layout=True)\n",
        "\n",
        "sc.pl.umap(adata, color='sample',  title=\"UMAP\",\n",
        "           show=False, ax=axs[0], legend_loc=None)\n",
        "sc.pl.umap(umap10, color='sample', title=\"UMAP10\", show=False,\n",
        "           ax=axs[1], components=['1,2'], legend_loc=None)\n",
        "sc.pl.umap(umap10, color='sample', title=\"UMAP10\",\n",
        "           show=False, ax=axs[2], components=['3,4'], legend_loc=None)\n",
        "\n",
        "# we can also plot the umap with neighbor edges\n",
        "sc.pl.umap(adata, color='sample', title=\"UMAP\", edges=True)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now plot PCA, UMAP and tSNE side by side for comparison. Have a\n",
        "look at the UMAP and tSNE. What similarities/differences do you see? Can\n",
        "you explain the differences based on what you learned during the\n",
        "lecture? Also, we can conclude from the dimensionality reductions that\n",
        "our dataset contains a batch effect that needs to be corrected before\n",
        "proceeding to clustering and differential gene expression analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-all\n",
        "fig, axs = plt.subplots(2, 2, figsize=(10, 8), constrained_layout=True)\n",
        "sc.pl.pca(adata, color='sample', components=['1,2'], ax=axs[0, 0], show=False)\n",
        "sc.pl.tsne(adata, color='sample', components=['1,2'], ax=axs[0, 1], show=False)\n",
        "sc.pl.umap(adata, color='sample', components=['1,2'], ax=axs[1, 0], show=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we can compare the PCA, tSNE and UMAP.\n",
        "\n",
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> We have now done Variable gene selection, PCA and UMAP with the\n",
        "> settings we selected for you. Test a few different ways of selecting\n",
        "> variable genes, number of PCs for UMAP and check how it influences\n",
        "> your embedding.\n",
        "\n",
        "</div>\n",
        "\n",
        "## Genes of interest\n",
        "\n",
        "Let's plot some marker genes for different cell types onto the\n",
        "embedding.\n",
        "\n",
        "  Markers                    Cell Type\n",
        "  -------------------------- -------------------\n",
        "  CD3E                       T cells\n",
        "  CD3E CD4                   CD4+ T cells\n",
        "  CD3E CD8A                  CD8+ T cells\n",
        "  GNLY, NKG7                 NK cells\n",
        "  MS4A1                      B cells\n",
        "  CD14, LYZ, CST3, MS4A7     CD14+ Monocytes\n",
        "  FCGR3A, LYZ, CST3, MS4A7   FCGR3A+ Monocytes\n",
        "  FCER1A, CST3               DCs"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-markers\n",
        "sc.pl.umap(adata, color=[\"CD3E\", \"CD4\", \"CD8A\", \"GNLY\",\"NKG7\", \"MS4A1\",\"CD14\",\"LYZ\",\"CST3\",\"MS4A7\",\"FCGR3A\"])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The default is to plot gene expression in the normalized and\n",
        "log-transformed data. You can also plot it on the scaled and corrected\n",
        "data by using `use_raw=False`. However, not all of these genes are\n",
        "included in the variable gene set, and hence are not included in the\n",
        "scaled `adata.X`, so we first need to filter them."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-markers2\n",
        "genes  = [\"CD3E\", \"CD4\", \"CD8A\", \"GNLY\",\"NKG7\", \"MS4A1\",\"CD14\",\"LYZ\",\"CST3\",\"MS4A7\",\"FCGR3A\"]\n",
        "var_genes = adata.var.highly_variable\n",
        "var_genes.index[var_genes]\n",
        "varg = [x for x in genes if x in var_genes.index[var_genes]]\n",
        "sc.pl.umap(adata, color=varg, use_raw=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div>\n",
        "\n",
        "> **Discuss**\n",
        ">\n",
        "> Select some of your dimensionality reductions and plot some of the QC\n",
        "> stats that were calculated in the previous lab. Can you see if some of\n",
        "> the separation in your data is driven by quality of the cells?\n",
        "\n",
        "</div>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: plot-qc\n",
        "sc.pl.umap(adata, color=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'pct_counts_hb'], ncols=3,use_raw=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Save data\n",
        "\n",
        "We can finally save the object for use in future steps."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: save\n",
        "adata.write_h5ad('data/covid/results/scanpy_covid_qc_dr.h5ad')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Note\n",
        "\n",
        "Just as a reminder, you need to keep in mind what you have in the X\n",
        "matrix. After these operations you have an X matrix with only variable\n",
        "genes, that are normalized, logtransformed and scaled.\n",
        "\n",
        "We stored the expression of all genes in `raw.X` after doing\n",
        "lognormalization so that matrix is a sparse matrix with logtransformed\n",
        "values."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: check-data\n",
        "print(adata.X.shape)\n",
        "print(adata.raw.X.shape)\n",
        "\n",
        "print(adata.X[:3,:3])\n",
        "print(adata.raw.X[:10,:10])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Session info\n",
        "\n",
        "```{=html}\n",
        "<details>\n",
        "```\n",
        "```{=html}\n",
        "<summary>\n",
        "```\n",
        "Click here\n",
        "```{=html}\n",
        "</summary>\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#| label: session\n",
        "sc.logging.print_versions()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "```{=html}\n",
        "</details>\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}