{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scanpy: Quality control" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get data\n", "\n", "In this tutorial, we will run all tutorials with a set of 6 PBMC 10x datasets from 3 covid-19 patients and 3 healthy controls, the samples have been subsampled to 1500 cells per sample. They are part of the github repo and if you have cloned the repo they should be available in folder: `labs/data/covid_data_GSE149689`. Instructions on how to download them can also be found in the Precourse material.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:03.598067Z", "iopub.status.busy": "2023-01-26T13:35:03.597882Z", "iopub.status.idle": "2023-01-26T13:35:03.659248Z", "shell.execute_reply": "2023-01-26T13:35:03.658749Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n", "total 44376\n", "drwxr-xr-x@ 8 asabjor staff 256 Nov 26 2020 \u001b[34m.\u001b[m\u001b[m\n", "drwxr-xr-x@ 18 asabjor staff 576 Jan 29 2021 \u001b[34m..\u001b[m\u001b[m\n", "-rw-r--r--@ 1 asabjor staff 4391693 Nov 26 2020 Normal_PBMC_13.h5\n", "-rw-r--r--@ 1 asabjor staff 3806925 Nov 26 2020 Normal_PBMC_14.h5\n", "-rw-r--r--@ 1 asabjor staff 3806384 Nov 26 2020 Normal_PBMC_5.h5\n", "-rw-r--r--@ 1 asabjor staff 3426598 Nov 26 2020 nCoV_PBMC_1.h5\n", "-rw-r--r--@ 1 asabjor staff 3169573 Nov 26 2020 nCoV_PBMC_15.h5\n", "-rw-r--r--@ 1 asabjor staff 4105636 Nov 26 2020 nCoV_PBMC_17.h5\n" ] } ], "source": [ "%%bash\n", "# create a data directory.\n", "mkdir -p data/raw\n", "\n", "# first check if the files are there\n", "count=$(ls -l data/raw/*.h5 | grep -v ^d | wc -l )\n", "echo $count\n", "\n", "# if not 4 files, fetch the files from github.\n", "if ((\"$count\" < 6)); then\n", " cd data/raw\n", " curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_13.h5\n", " curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_14.h5\n", " curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_5.h5\n", " curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_15.h5\n", " curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_17.h5\n", " curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_1.h5\n", " cd ../..\n", "fi\n", "\n", "ls -lGa data/raw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With data in place, now we can start loading libraries we will use in this tutorial.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:03.681013Z", "iopub.status.busy": "2023-01-26T13:35:03.680815Z", "iopub.status.idle": "2023-01-26T13:35:05.484090Z", "shell.execute_reply": "2023-01-26T13:35:05.483630Z" }, "scrolled": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scanpy as sc\n", "\n", "\n", "sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)\n", "# gives error!! sc.logging.print_versions()\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:07.382602Z", "iopub.status.busy": "2023-01-26T13:35:07.382275Z", "iopub.status.idle": "2023-01-26T13:35:07.389589Z", "shell.execute_reply": "2023-01-26T13:35:07.389124Z" } }, "outputs": [], "source": [ "sc.settings.set_figure_params(dpi=80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can first load the data individually by reading directly from HDF5 file format (.h5).\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:07.392048Z", "iopub.status.busy": "2023-01-26T13:35:07.391888Z", "iopub.status.idle": "2023-01-26T13:35:08.377362Z", "shell.execute_reply": "2023-01-26T13:35:08.376920Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reading ./data/raw/nCoV_PBMC_1.h5\n", " (0:00:00)\n", "reading ./data/raw/nCoV_PBMC_15.h5\n", " (0:00:00)\n", "reading ./data/raw/nCoV_PBMC_17.h5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/readwrite.py:281: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata = AnnData(\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.\n", " utils.warn_names_duplicates(\"var\")\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/readwrite.py:281: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata = AnnData(\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.\n", " utils.warn_names_duplicates(\"var\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " (0:00:00)\n", "reading ./data/raw/Normal_PBMC_5.h5\n", " (0:00:00)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/readwrite.py:281: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata = AnnData(\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.\n", " utils.warn_names_duplicates(\"var\")\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/readwrite.py:281: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata = AnnData(\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.\n", " utils.warn_names_duplicates(\"var\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "reading ./data/raw/Normal_PBMC_13.h5\n", " (0:00:00)\n", "reading ./data/raw/Normal_PBMC_14.h5\n", " (0:00:00)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/readwrite.py:281: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata = AnnData(\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.\n", " utils.warn_names_duplicates(\"var\")\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/readwrite.py:281: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata = AnnData(\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.\n", " utils.warn_names_duplicates(\"var\")\n" ] } ], "source": [ "data_cov1 = sc.read_10x_h5('./data/raw/nCoV_PBMC_1.h5')\n", "data_cov1.var_names_make_unique()\n", "data_cov15 = sc.read_10x_h5('./data/raw/nCoV_PBMC_15.h5')\n", "data_cov15.var_names_make_unique()\n", "data_cov17 = sc.read_10x_h5('./data/raw/nCoV_PBMC_17.h5')\n", "data_cov17.var_names_make_unique()\n", "data_ctrl5 = sc.read_10x_h5('./data/raw/Normal_PBMC_5.h5')\n", "data_ctrl5.var_names_make_unique()\n", "data_ctrl13 = sc.read_10x_h5('./data/raw/Normal_PBMC_13.h5')\n", "data_ctrl13.var_names_make_unique()\n", "data_ctrl14 = sc.read_10x_h5('./data/raw/Normal_PBMC_14.h5')\n", "data_ctrl14.var_names_make_unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Create one merged object\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:08.379820Z", "iopub.status.busy": "2023-01-26T13:35:08.379665Z", "iopub.status.idle": "2023-01-26T13:35:08.746480Z", "shell.execute_reply": "2023-01-26T13:35:08.746028Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n", "/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n" ] } ], "source": [ "# add some metadata\n", "data_cov1.obs['type']=\"Covid\"\n", "data_cov1.obs['sample']=\"covid_1\"\n", "data_cov15.obs['type']=\"Covid\"\n", "data_cov15.obs['sample']=\"covid_15\"\n", "data_cov17.obs['type']=\"Covid\"\n", "data_cov17.obs['sample']=\"covid_17\"\n", "data_ctrl5.obs['type']=\"Ctrl\"\n", "data_ctrl5.obs['sample']=\"ctrl_5\"\n", "data_ctrl13.obs['type']=\"Ctrl\"\n", "data_ctrl13.obs['sample']=\"ctrl_13\"\n", "data_ctrl14.obs['type']=\"Ctrl\"\n", "data_ctrl14.obs['sample']=\"ctrl_14\"\n", "\n", "\n", "# merge into one object.\n", "adata = data_cov1.concatenate(data_cov15, data_cov17, data_ctrl5, data_ctrl13, data_ctrl14)\n", "\n", "# and delete individual datasets to save space\n", "del(data_cov1, data_cov15, data_cov17)\n", "del(data_ctrl5, data_ctrl13, data_ctrl14)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " You can print a summary of the datasets in the Scanpy object, or a summary of the whole object.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:08.749203Z", "iopub.status.busy": "2023-01-26T13:35:08.749028Z", "iopub.status.idle": "2023-01-26T13:35:08.753772Z", "shell.execute_reply": "2023-01-26T13:35:08.753384Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "covid_1 1500\n", "covid_15 1500\n", "covid_17 1500\n", "ctrl_5 1500\n", "ctrl_13 1500\n", "ctrl_14 1500\n", "Name: sample, dtype: int64\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 9000 × 33538\n", " obs: 'type', 'sample', 'batch'\n", " var: 'gene_ids', 'feature_types', 'genome'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(adata.obs['sample'].value_counts())\n", "\n", "adata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculate QC\n", "\n", "Having the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitocondrial and ribosomal genes per cell and add to the metadata. This will be helpfull to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this, and here manually calculate the proportion of mitochondrial reads and add to the metadata table.\n", "\n", "Citing from \"Simple Single Cell\" workflows (Lun, McCarthy & Marioni, 2017): \"High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.\"\n", "\n", " First, let Scanpy calculate some general qc-stats for genes and cells with the function `sc.pp.calculate_qc_metrics`, similar to `calculateQCmetrics` in Scater. It can also calculate proportion of counts for specific gene populations, so first we need to define which genes are mitochondrial, ribosomal and hemoglogin.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-01-26T13:35:08.756686Z", "iopub.status.busy": "2023-01-26T13:35:08.756536Z", "iopub.status.idle": "2023-01-26T13:35:08.789166Z", "shell.execute_reply": "2023-01-26T13:35:08.788738Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | gene_ids | \n", "feature_types | \n", "genome | \n", "mt | \n", "ribo | \n", "hb | \n", "
---|---|---|---|---|---|---|
MIR1302-2HG | \n", "ENSG00000243485 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
FAM138A | \n", "ENSG00000237613 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
OR4F5 | \n", "ENSG00000186092 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
AL627309.1 | \n", "ENSG00000238009 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
AL627309.3 | \n", "ENSG00000239945 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
AC233755.2 | \n", "ENSG00000277856 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
AC233755.1 | \n", "ENSG00000275063 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
AC240274.1 | \n", "ENSG00000271254 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
AC213203.1 | \n", "ENSG00000277475 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
FAM231C | \n", "ENSG00000268674 | \n", "Gene Expression | \n", "GRCh38 | \n", "False | \n", "False | \n", "False | \n", "
33538 rows × 6 columns
\n", "