{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d08fc5ef-cc6f-48be-93ca-b73ddccc8ec4",
   "metadata": {},
   "source": [
    "# Detect doublets using Scrublet\n",
    "\n",
    "Our cell hashing processes catch and remove many doublets that are generated by mixing of cells from different samples, but some percentage of doublets (~7-8%) will be generated by collisions of cells from the same sample, and will not be detected.\n",
    "\n",
    "To detect and remove these, we'll utilize the Scrublet package. Scrublet's process for doublet identification is described in this publication:\n",
    "\n",
    "Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst 8, 281–291.e9 (2019)\n",
    "\n",
    "We'll use scrublet's integration into the scanpy package's [scanpy.external tools](https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.pp.scrublet.html#scanpy.external.pp.scrublet).\n",
    "\n",
    "Here, we apply scrublet to each of our .h5 files, and store the results in HISE for downstream analysis steps."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fae0abc-31d9-483e-8081-8e5c82f8f633",
   "metadata": {},
   "source": [
    "## Load Packages\n",
    "\n",
    "`anndata`: Data structures for scRNA-seq  \n",
    "`concurrent.futures`: parallelization methods  \n",
    "`datetime`: date and time functions  \n",
    "`h5py`: HDF5 file I/O  \n",
    "`hisepy`: The HISE SDK for Python  \n",
    "`numpy`: Mathematical data structures and computation  \n",
    "`os`: operating system calls  \n",
    "`pandas`: DataFrame data structures  \n",
    "`re`: Regular expressions  \n",
    "`scanpy`: scRNA-seq analysis  \n",
    "`scipy.sparse`: Spare matrix data structures  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4ebc2cd4-cc2c-41ac-ab73-e72142c094fd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "warnings.simplefilter(action='ignore', category=UserWarning)\n",
    "\n",
    "import anndata\n",
    "import concurrent.futures\n",
    "from concurrent.futures import ThreadPoolExecutor\n",
    "from datetime import date\n",
    "import h5py\n",
    "import hisepy\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd \n",
    "import re\n",
    "import scanpy as sc\n",
    "import scanpy.external as sce\n",
    "import scipy.sparse as scs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e2862dd-94e1-492d-9718-d7b0d98cbeba",
   "metadata": {},
   "source": [
    "## Read sample metadata from HISE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b5079642-7422-4031-9a81-3558bc076145",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'\n",
    "file_query = hisepy.reader.read_files(\n",
    "    [sample_meta_file_uuid]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d44946c1-2a26-4c9d-9567-ea8212ab2f33",
   "metadata": {},
   "outputs": [],
   "source": [
    "meta_data = file_query['values']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36a8c51a-b8ee-4c03-b686-67efb98b7753",
   "metadata": {},
   "source": [
    "## Helper functions\n",
    "\n",
    "These functions will retrieve data from HISE and read as AnnData for use with scrublet, and for reading and applying scrublet to each file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a0c3aded-0603-4602-a212-790a99ef9c56",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define a function to read count data\n",
    "def read_mat(h5_con):\n",
    "    mat = scs.csc_matrix(\n",
    "        (h5_con['matrix']['data'][:], # Count values\n",
    "         h5_con['matrix']['indices'][:], # Row indices\n",
    "         h5_con['matrix']['indptr'][:]), # Pointers for column positions\n",
    "        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions\n",
    "    )\n",
    "    return mat\n",
    "\n",
    "# define a function to read obeservation metadata (i.e. cell metadata)\n",
    "def read_obs(h5con):\n",
    "    bc = h5con['matrix']['barcodes'][:]\n",
    "    bc = [x.decode('UTF-8') for x in bc]\n",
    "\n",
    "    # Initialized the DataFrame with cell barcodes\n",
    "    obs_df = pd.DataFrame({ 'barcodes' : bc })\n",
    "    obs_df = obs_df.set_index('barcodes', drop = False)\n",
    "    obs_df['barcodes'] = obs_df['barcodes'].astype(\"category\")\n",
    "\n",
    "    return obs_df\n",
    "\n",
    "# define a function to construct anndata object from a h5 file\n",
    "def read_h5_anndata(h5_con):\n",
    "    #h5_con = h5py.File(h5_file, mode = 'r')\n",
    "    # extract the expression matrix\n",
    "    mat = read_mat(h5_con)\n",
    "    # extract gene names\n",
    "    genes = h5_con['matrix']['features']['name'][:]\n",
    "    genes = [x.decode('UTF-8') for x in genes]\n",
    "    # extract metadata\n",
    "    obs_df = read_obs(h5_con)\n",
    "    # construct anndata\n",
    "    adata = anndata.AnnData(mat.T,\n",
    "                            obs = obs_df)\n",
    "    # make sure the gene names aligned\n",
    "    adata.var_names = genes\n",
    "\n",
    "    adata.var_names_make_unique()\n",
    "    return adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "95f61765-d761-4247-86ff-6a3c608c3178",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_adata(uuid):\n",
    "    # Load the file using HISE\n",
    "    res = hisepy.reader.read_files([uuid])\n",
    "\n",
    "    # If there's an error, read_files returns a list instead of a dictionary.\n",
    "    # We should raise and exception with the message when this happens.\n",
    "    if(isinstance(res, list)):\n",
    "        error_message = res[0]['message']\n",
    "        raise Exception(error_message)\n",
    "    \n",
    "    # Read the file to adata\n",
    "    h5_con = res['values'][0]\n",
    "    adata = read_h5_anndata(h5_con)\n",
    "    \n",
    "    # Clean up the file now that we're done with it\n",
    "    h5_con.close()\n",
    "\n",
    "    return(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "616e22b5-e5a0-4dd1-8417-746d35f72eab",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def process_file(file_uuid):\n",
    "    adata = get_adata(file_uuid)\n",
    "    sc.external.pp.scrublet(\n",
    "        adata,\n",
    "        random_state = 3030,\n",
    "        verbose = False\n",
    "    )\n",
    "    result = adata.obs[['barcodes','predicted_doublet','doublet_score']]\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a01a10e0-ec49-485a-9ebb-06e0f76fb8ff",
   "metadata": {},
   "source": [
    "## Apply to each sample in parallel\n",
    "\n",
    "Here, we'll use `concurrent.futures` to apply the function above to our samples in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3d334e1a-a13c-41d2-9107-ec3e2fa4e490",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "results = []\n",
    "file_uuids = meta_data['file.id'].tolist()\n",
    "with ThreadPoolExecutor(max_workers = 20) as executor:  \n",
    "    for result in executor.map(process_file, file_uuids):\n",
    "        results.append(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "25d06af9-9857-4a99-88ce-83c0a1d99654",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "final_result = pd.concat(results, ignore_index = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "83f7adfb-8481-45aa-a729-5c196ee0c728",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "predicted_doublet\n",
       "False    2065171\n",
       "True       27907\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prediction_counts = final_result['predicted_doublet'].value_counts()\n",
    "prediction_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "0f934524-85d3-4662-a671-ff4ab7c9cbd2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.013332995712534363"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prediction_counts[1] / sum(prediction_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1beebca5-fc7d-4be8-9d9a-938ec6aa6abd",
   "metadata": {},
   "source": [
    "## Save results to CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ff5a2914-f319-40a6-85c7-48d0c2cfa779",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "out_dir = 'output'\n",
    "if not os.path.isdir(out_dir):\n",
    "    os.makedirs(out_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "53bb1f48-d0bc-41d4-a906-0c6ebc18568d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "out_file = 'output/ref_scrublet_results_{d}.csv'.format(d = date.today())\n",
    "final_result.to_csv(out_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22e34309-e898-49d6-bce1-5afcae065748",
   "metadata": {},
   "source": [
    "## Upload assembled data to HISE\n",
    "\n",
    "Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "ee59a238-6093-47f4-8a88-1c389a4c788c",
   "metadata": {},
   "outputs": [],
   "source": [
    "study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'\n",
    "title = 'Reference Scrublet results'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "47a49d96-16b9-4122-98a8-1c103d1f55c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "in_files = [sample_meta_file_uuid] + meta_data['file.id'].to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "4f96ed0f-036e-4700-ad4d-b93fa6155cd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "out_files = [out_file]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3f3e0841-c366-448c-b096-d0c26031448e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "you are trying to upload file_ids... ['output/ref_scrublet_results_2024-02-18.csv']. Do you truly want to proceed?\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "(y/n) y\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'trace_id': '6b0ec5e3-86ec-41a5-9900-9c52a464681c',\n",
       " 'files': ['output/ref_scrublet_results_2024-02-18.csv']}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hisepy.upload.upload_files(\n",
    "    files = out_files,\n",
    "    study_space_id = study_space_uuid,\n",
    "    title = title,\n",
    "    input_file_ids = in_files\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "cdb4af83-9148-4c73-a229-274fcf777fb1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<details>\n",
       "<summary>Click to view session information</summary>\n",
       "<pre>\n",
       "-----\n",
       "anndata             0.10.3\n",
       "h5py                3.10.0\n",
       "hisepy              0.3.0\n",
       "numpy               1.24.0\n",
       "pandas              2.1.4\n",
       "scanpy              1.9.6\n",
       "scipy               1.11.4\n",
       "session_info        1.0.0\n",
       "-----\n",
       "</pre>\n",
       "<details>\n",
       "<summary>Click to view modules imported as dependencies</summary>\n",
       "<pre>\n",
       "PIL                         10.0.1\n",
       "annoy                       NA\n",
       "anyio                       NA\n",
       "arrow                       1.3.0\n",
       "asttokens                   NA\n",
       "attr                        23.2.0\n",
       "attrs                       23.2.0\n",
       "babel                       2.14.0\n",
       "beatrix_jupyterlab          NA\n",
       "brotli                      NA\n",
       "cachetools                  5.3.1\n",
       "certifi                     2023.11.17\n",
       "cffi                        1.16.0\n",
       "charset_normalizer          3.3.2\n",
       "cloudpickle                 2.2.1\n",
       "colorama                    0.4.6\n",
       "comm                        0.1.4\n",
       "cryptography                41.0.7\n",
       "cycler                      0.10.0\n",
       "cython_runtime              NA\n",
       "dateutil                    2.8.2\n",
       "db_dtypes                   1.1.1\n",
       "debugpy                     1.8.0\n",
       "decorator                   5.1.1\n",
       "defusedxml                  0.7.1\n",
       "deprecated                  1.2.14\n",
       "exceptiongroup              1.2.0\n",
       "executing                   2.0.1\n",
       "fastjsonschema              NA\n",
       "fqdn                        NA\n",
       "google                      NA\n",
       "greenlet                    2.0.2\n",
       "grpc                        1.58.0\n",
       "grpc_status                 NA\n",
       "idna                        3.6\n",
       "igraph                      0.10.8\n",
       "importlib_metadata          NA\n",
       "ipykernel                   6.28.0\n",
       "ipython_genutils            0.2.0\n",
       "ipywidgets                  8.1.1\n",
       "isoduration                 NA\n",
       "jedi                        0.19.1\n",
       "jinja2                      3.1.2\n",
       "joblib                      1.3.2\n",
       "json5                       NA\n",
       "jsonpointer                 2.4\n",
       "jsonschema                  4.20.0\n",
       "jsonschema_specifications   NA\n",
       "jupyter_events              0.9.0\n",
       "jupyter_server              2.12.1\n",
       "jupyterlab_server           2.25.2\n",
       "jwt                         2.8.0\n",
       "kiwisolver                  1.4.5\n",
       "lazy_loader                 NA\n",
       "leidenalg                   0.10.1\n",
       "llvmlite                    0.41.0\n",
       "lz4                         4.3.2\n",
       "markupsafe                  2.1.3\n",
       "matplotlib                  3.8.0\n",
       "matplotlib_inline           0.1.6\n",
       "mpl_toolkits                NA\n",
       "mpmath                      1.3.0\n",
       "natsort                     8.4.0\n",
       "nbformat                    5.9.2\n",
       "numba                       0.58.0\n",
       "opentelemetry               NA\n",
       "overrides                   NA\n",
       "packaging                   23.2\n",
       "parso                       0.8.3\n",
       "pexpect                     4.8.0\n",
       "pickleshare                 0.7.5\n",
       "pkg_resources               NA\n",
       "platformdirs                4.1.0\n",
       "plotly                      5.18.0\n",
       "prettytable                 3.9.0\n",
       "prometheus_client           NA\n",
       "prompt_toolkit              3.0.42\n",
       "proto                       NA\n",
       "psutil                      NA\n",
       "ptyprocess                  0.7.0\n",
       "pure_eval                   0.2.2\n",
       "pyarrow                     13.0.0\n",
       "pycparser                   2.21\n",
       "pydev_ipython               NA\n",
       "pydevconsole                NA\n",
       "pydevd                      2.9.5\n",
       "pydevd_file_utils           NA\n",
       "pydevd_plugins              NA\n",
       "pydevd_tracing              NA\n",
       "pygments                    2.17.2\n",
       "pynvml                      NA\n",
       "pyparsing                   3.1.1\n",
       "pyreadr                     0.5.0\n",
       "pythonjsonlogger            NA\n",
       "pytz                        2023.3.post1\n",
       "referencing                 NA\n",
       "requests                    2.31.0\n",
       "rfc3339_validator           0.1.4\n",
       "rfc3986_validator           0.1.1\n",
       "rpds                        NA\n",
       "scrublet                    NA\n",
       "send2trash                  NA\n",
       "shapely                     1.8.5.post1\n",
       "six                         1.16.0\n",
       "skimage                     0.21.0\n",
       "sklearn                     1.3.2\n",
       "sniffio                     1.3.0\n",
       "socks                       1.7.1\n",
       "sparse                      0.14.0\n",
       "sql                         NA\n",
       "sqlalchemy                  2.0.21\n",
       "sqlparse                    0.4.4\n",
       "stack_data                  0.6.2\n",
       "sympy                       1.12\n",
       "termcolor                   NA\n",
       "texttable                   1.7.0\n",
       "threadpoolctl               3.2.0\n",
       "torch                       2.1.2+cu121\n",
       "torchgen                    NA\n",
       "tornado                     6.3.3\n",
       "tqdm                        4.66.1\n",
       "traitlets                   5.9.0\n",
       "typing_extensions           NA\n",
       "uri_template                NA\n",
       "urllib3                     1.26.18\n",
       "wcwidth                     0.2.12\n",
       "webcolors                   1.13\n",
       "websocket                   1.7.0\n",
       "wrapt                       1.15.0\n",
       "xarray                      2023.12.0\n",
       "yaml                        6.0.1\n",
       "zipp                        NA\n",
       "zmq                         25.1.2\n",
       "zoneinfo                    NA\n",
       "zstandard                   0.22.0\n",
       "</pre>\n",
       "</details> <!-- seems like this ends pre, so might as well be explicit -->\n",
       "<pre>\n",
       "-----\n",
       "IPython             8.19.0\n",
       "jupyter_client      8.6.0\n",
       "jupyter_core        5.6.1\n",
       "jupyterlab          4.0.10\n",
       "notebook            6.5.4\n",
       "-----\n",
       "Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]\n",
       "Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31\n",
       "-----\n",
       "Session information updated at 2024-02-18 05:05\n",
       "</pre>\n",
       "</details>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import session_info\n",
    "session_info.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cded88d7-f4a5-4a73-bbcc-0f853d2d51e4",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}