{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e588eade-1a0b-4475-b64f-e34872402893",
   "metadata": {},
   "source": [
    "# Label cell types using CellTypist Models\n",
    "\n",
    "To build our reference, we would like to start with labels that originate from published cell type references. \n",
    "\n",
    "One of the approaches for this cell type labeling is CellTypist, a model-based approach to cell type labeling.  \n",
    "\n",
    "CellTypist is described [on their website](https://www.celltypist.org/), and in this publication:  \n",
    "\n",
    "Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022)\n",
    "\n",
    "Here, we'll load in our cells in batches, and assign cell types based on 3 available CellTypist models (descriptions are from celltypist.org):  \n",
    "\n",
    "- Immune_All_High:\n",
    "    - 32 types\n",
    "    - immune populations combined from 20 tissues of 18 studies  \n",
    "- Immune_All_Low:  \n",
    "    - 98 types\n",
    "    - immune sub-populations combined from 20 tissues of 18 studies  \n",
    "- Healthy_COVID19_PBMC:\n",
    "    - 51 types\n",
    "    - peripheral blood mononuclear cell types from healthy and COVID-19 individuals"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3a83eb8-37dd-49ad-9670-fff2365d5d13",
   "metadata": {},
   "source": [
    "## Load Packages\n",
    "\n",
    "`anndata`: Data structures for scRNA-seq  \n",
    "`celltypist`: Model-based cell type annotation  \n",
    "`concurrent.futures`: parallelization methods  \n",
    "`datetime`: date and time functions  \n",
    "`h5py`: HDF5 file I/O  \n",
    "`hisepy`: The HISE SDK for Python  \n",
    "`numpy`: Mathematical data structures and computation  \n",
    "`os`: operating system calls  \n",
    "`pandas`: DataFrame data structures  \n",
    "`re`: Regular expressions  \n",
    "`scanpy`: scRNA-seq analysis  \n",
    "`scipy.sparse`: Spare matrix data structures  \n",
    "`shutil`: Shell utilities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5bf3c687-3777-495f-8a9a-39e1e65fd449",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import anndata\n",
    "import celltypist\n",
    "from celltypist import models\n",
    "import concurrent.futures\n",
    "from datetime import date\n",
    "import h5py\n",
    "import hisepy\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd \n",
    "import re\n",
    "import scanpy as sc\n",
    "import scipy.sparse as scs\n",
    "import shutil"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1dd8976-9454-49e0-932f-e8a13dd8c768",
   "metadata": {},
   "source": [
    "## Obtain CellTypist Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2de69a63-45c8-40f4-aef7-738830d82e8f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json\n",
      "📚 Total models in list: 44\n",
      "📂 Storing models in /root/.celltypist/data/models\n",
      "💾 Total models to download: 3\n",
      "💾 Downloading model [1/3]: Immune_All_Low.pkl\n",
      "💾 Downloading model [2/3]: Immune_All_High.pkl\n",
      "💾 Downloading model [3/3]: Healthy_COVID19_PBMC.pkl\n"
     ]
    }
   ],
   "source": [
    "models.download_models(\n",
    "    force_update = True,\n",
    "    model = ['Immune_All_High.pkl',\n",
    "             'Immune_All_Low.pkl',\n",
    "             'Healthy_COVID19_PBMC.pkl']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5a4a141-780e-4eab-b9c1-d7b5865dfabd",
   "metadata": {},
   "source": [
    "## Read sample metadata from HISE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6f55a5a2-a5e0-42ce-a34c-9add7cbef948",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'\n",
    "file_query = hisepy.reader.read_files(\n",
    "    [sample_meta_file_uuid]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "21ce05a3-f4dc-4475-923a-f7049b833326",
   "metadata": {},
   "outputs": [],
   "source": [
    "meta_data = file_query['values']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f1bbcac-82f0-4c06-b295-5655c85c0c96",
   "metadata": {},
   "source": [
    "## Helper functions\n",
    "\n",
    "These functions will retrieve data for a batch of samples, assemble a joint AnnData object, perform normalization and log transformation, then generate predictions for each of the 3 models retrieved, above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8a81ae7c-7930-4b56-a3b2-0f6cbfe711c3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define a function to read count data\n",
    "def read_mat(h5_con):\n",
    "    mat = scs.csc_matrix(\n",
    "        (h5_con['matrix']['data'][:], # Count values\n",
    "         h5_con['matrix']['indices'][:], # Row indices\n",
    "         h5_con['matrix']['indptr'][:]), # Pointers for column positions\n",
    "        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions\n",
    "    )\n",
    "    return mat\n",
    "\n",
    "# define a function to read obeservation metadata (i.e. cell metadata)\n",
    "def read_obs(h5con):\n",
    "    bc = h5con['matrix']['barcodes'][:]\n",
    "    bc = [x.decode('UTF-8') for x in bc]\n",
    "\n",
    "    # Initialized the DataFrame with cell barcodes\n",
    "    obs_df = pd.DataFrame({ 'barcodes' : bc })\n",
    "\n",
    "    # Get the list of available metadata columns\n",
    "    obs_columns = h5con['matrix']['observations'].keys()\n",
    "\n",
    "    # For each column\n",
    "    for col in obs_columns:\n",
    "        # Read the values\n",
    "        values = h5con['matrix']['observations'][col][:]\n",
    "        # Check for byte storage\n",
    "        if(isinstance(values[0], (bytes, bytearray))):\n",
    "            # Decode byte strings\n",
    "            values = [x.decode('UTF-8') for x in values]\n",
    "        # Add column to the DataFrame\n",
    "        obs_df[col] = values\n",
    "\n",
    "    obs_df = obs_df.set_index('barcodes', drop = False)\n",
    "    \n",
    "    return obs_df\n",
    "\n",
    "# define a function to construct anndata object from a h5 file\n",
    "def read_h5_anndata(h5_con):\n",
    "    #h5_con = h5py.File(h5_file, mode = 'r')\n",
    "    # extract the expression matrix\n",
    "    mat = read_mat(h5_con)\n",
    "    # extract gene names\n",
    "    genes = h5_con['matrix']['features']['name'][:]\n",
    "    genes = [x.decode('UTF-8') for x in genes]\n",
    "    # extract metadata\n",
    "    obs_df = read_obs(h5_con)\n",
    "    # construct anndata\n",
    "    adata = anndata.AnnData(mat.T,\n",
    "                             obs = obs_df)\n",
    "    # make sure the gene names aligned\n",
    "    adata.var_names = genes\n",
    "\n",
    "    adata.var_names_make_unique()\n",
    "    return adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1bbb3440-41ca-4942-850d-1ad053da164d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_adata(uuid):\n",
    "    # Load the file using HISE\n",
    "    res = hisepy.reader.read_files([uuid])\n",
    "\n",
    "    # If there's an error, read_files returns a list instead of a dictionary.\n",
    "    # We should raise and exception with the message when this happens.\n",
    "    if(isinstance(res, list)):\n",
    "        error_message = res[0]['message']\n",
    "        raise Exception(error_message)\n",
    "    \n",
    "    # Read the file to adata\n",
    "    h5_con = res['values'][0]\n",
    "    adata = read_h5_anndata(h5_con)\n",
    "    \n",
    "    # Close the file now that we're done with it\n",
    "    h5_con.close()\n",
    "\n",
    "    return(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8bcd92d2-af9e-4e6f-8f51-450796d40476",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_prediction(adata, model, model_name, out_dir = \"output\"):\n",
    "    # Perform prediction\n",
    "    predictions = celltypist.annotate(\n",
    "        adata, \n",
    "        model = model, \n",
    "        majority_voting = True)\n",
    "\n",
    "    # Make output directory\n",
    "    model_dir = \"{d}/{m}\".format(d = out_dir, m = model_name)\n",
    "    if not os.path.isdir(model_dir):\n",
    "        os.makedirs(model_dir)\n",
    "\n",
    "    # Write output per sample\n",
    "    samples = adata.obs['pbmc_sample_id'].unique()\n",
    "    for sample_id in samples:\n",
    "        barcodes = adata.obs[adata.obs['pbmc_sample_id'] == sample_id].index.tolist()\n",
    "        sample_results = predictions.predicted_labels.loc[barcodes,:]\n",
    "        out_file = \"{d}/{s}_{m}.csv\".format(d = model_dir, s = sample_id, m = model_name)\n",
    "        sample_results.to_csv(out_file)\n",
    "\n",
    "def process_data(meta_data_sub):\n",
    "    out_dir = \"output\"\n",
    "    \n",
    "    # Load cells from HISE .h5 files\n",
    "    results = []\n",
    "    for file_uuid in meta_data_sub:\n",
    "        result = get_adata(file_uuid)\n",
    "        results.append(result)\n",
    "    adata = anndata.concat(results)\n",
    "    del results\n",
    "    \n",
    "    # Normalize data\n",
    "    sc.pp.normalize_total(adata, target_sum=1e4)\n",
    "    sc.pp.log1p(adata)\n",
    "    adata.obs.index = adata.obs['barcodes']\n",
    "    \n",
    "    # Predict cell types\n",
    "    run_prediction(adata, \"Immune_All_Low.pkl\", \"Low\", out_dir)\n",
    "    run_prediction(adata, \"Immune_All_High.pkl\", \"High\", out_dir)\n",
    "    run_prediction(adata, \"Healthy_COVID19_PBMC.pkl\", \"Covid_Healthy\", out_dir)\n",
    "    \n",
    "    del adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d79d0ec-ec3a-4644-a149-a3058255904f",
   "metadata": {},
   "source": [
    "## Apply across batches\n",
    "\n",
    "Here, we'll generate the batches, then use `concurrent.futures` to apply the function above to our batches in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4e28d2e7-a974-4721-8d86-ff79605aa155",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "out_dir = 'output'\n",
    "if not os.path.isdir(out_dir):\n",
    "    os.makedirs(out_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "de4c140e-5ff1-45e4-9393-dd636d9a7eed",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "meta_data_subsets = []\n",
    "for i in range(0, len(meta_data), 10):\n",
    "    subset_uuids = meta_data[\"file.id\"][i:i + 10]\n",
    "    meta_data_subsets.append(subset_uuids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "06510b9e-62f9-4fd9-812c-5f94603b3b29",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "🔬 Input data has 156449 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🔬 Input data has 179276 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 193957 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 207788 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 199483 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 173666 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 187700 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🔬 Input data has 200104 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 198214 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🔬 Input data has 187235 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🔬 Input data has 209206 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "🖋️ Predicting labels\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "🖋️ Predicting labels\n",
      "🖋️ Predicting labels\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "🖋️ Predicting labels\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Can not detect a neighborhood graph, will construct one before the over-clustering\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 156449 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 187235 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 187700 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🖋️ Predicting labels\n",
      "🔬 Input data has 198214 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 200104 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 209206 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 179276 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "🖋️ Predicting labels\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 173666 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 187235 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 156449 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 207788 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🔬 Input data has 199483 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 193957 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 5967 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 187700 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 198214 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 200104 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 209206 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 179276 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 173666 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 207788 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 199483 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 30\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🔬 Input data has 193957 cells and 33538 genes\n",
      "🔗 Matching reference genes in the model\n",
      "🧬 3443 features used for prediction\n",
      "⚖️ Scaling input data\n",
      "🖋️ Predicting labels\n",
      "✅ Prediction done!\n",
      "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n",
      "⛓️ Over-clustering input data with resolution set to 25\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n",
      "🗳️ Majority voting the predictions\n",
      "✅ Majority voting done!\n"
     ]
    }
   ],
   "source": [
    "# Process each subset in parallel\n",
    "pool_executor = concurrent.futures.ProcessPoolExecutor(max_workers = 11)\n",
    "with pool_executor as executor:\n",
    "    \n",
    "    futures = []\n",
    "    for meta_data_sub in meta_data_subsets:\n",
    "        futures.append(executor.submit(process_data, meta_data_sub))\n",
    "\n",
    "    # Check for errors when parallel processes return results\n",
    "    for future in concurrent.futures.as_completed(futures):\n",
    "        try:\n",
    "            future.result()\n",
    "        except Exception as e:\n",
    "            print(f'Error: {e}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74db89fd-2f92-4148-9a6d-f115c1efe52e",
   "metadata": {},
   "source": [
    "## Assemble results\n",
    "\n",
    "For each model, we'll assemble the results as a .csv file that we can utilize later for subclustering and analysis of major cell classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fa2c2832-1f92-4830-95c1-b87d44c6e8d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "models = ['High', 'Low', 'Covid_Healthy']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "942b3d87-2fbb-461d-8f04-89c13118d01c",
   "metadata": {},
   "outputs": [],
   "source": [
    "out_files = []\n",
    "for model in models:\n",
    "    model_path = 'output/{m}'.format(m = model)\n",
    "    model_files = os.listdir(model_path)\n",
    "    model_list = []\n",
    "    for model_file in model_files:\n",
    "        df = pd.read_csv('output/{m}/{f}'.format(m = model, f = model_file))\n",
    "        model_list.append(df)\n",
    "    model_df = pd.concat(model_list)\n",
    "\n",
    "    out_file = 'output/ref_celltypist_labels_{m}_{d}.csv'.format(m = model, d = date.today())\n",
    "    out_files.append(out_file)\n",
    "    \n",
    "    model_df.to_csv(out_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bd746b1-8066-44a5-8613-f401e6fce07e",
   "metadata": {},
   "source": [
    "## Upload assembled data to HISE\n",
    "\n",
    "Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6ded63cd-7268-4c1c-8ce8-bc466dbd937c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function upload_files in module hisepy.upload:\n",
      "\n",
      "upload_files(files: list, study_space_id: str = None, project: str = None, title: str = None, input_file_ids=None, input_sample_ids=None, file_types=None, store=None, destination=None, do_prompt: bool = True)\n",
      "    Uploads files to a specified study.\n",
      "    \n",
      "    Parameters:\n",
      "        files (list): absolute filepath of file to be uploaded\n",
      "        study_space_id (str): ID that pertains to a study in the collaboration space (optional)\n",
      "        project (str): project short name (required if study space is not specified)\n",
      "        title (str): 10+ character title for upload result \n",
      "        input_file_ids (list): fileIds from HISE that were utilized to generate a user's result\n",
      "        input_sample_ids (list): sampleIds from HISE that were utilized to generate a user's result\n",
      "        file_types (str): filetype of uploaded files \n",
      "        store (str): Which store ('project' or 'permanent') to use for the files (default in 'project')\n",
      "        destination (str): Destination folder for the files \n",
      "        do_prompt (bool): whether or not to prompt for user's input, asking to proceed.\n",
      "    Returns: \n",
      "        dictionary with keys [\"trace_id\", \"files\"]\n",
      "    Example: \n",
      "        hp.upload_files(files=['/home/jupyter/upload_file.csv'],\n",
      "                        study_space_id='f2f03ecb-5a1d-4995-8db9-56bd18a36aba',\n",
      "                        title='a upload title',\n",
      "                        input_file_ids=['9f6d7ab5-1c7b-4709-9455-3d8ffffbb6c8'])\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(hisepy.upload.upload_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "66e3e599-872c-4677-8bee-08d04003958e",
   "metadata": {},
   "outputs": [],
   "source": [
    "study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'\n",
    "title = 'Ref. CellTypist Predictions {d}'.format(d = date.today())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "4bffe38f-4223-4919-b46a-a4b8536dde06",
   "metadata": {},
   "outputs": [],
   "source": [
    "in_files = [sample_meta_file_uuid] + meta_data['file.id'].to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ce9877bf-cc8a-4b92-8b9e-ba6f4964f79f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2da66a1a-17cc-498b-9129-6858cf639caf',\n",
       " 'fec489f9-9a74-4635-aa91-d2bf09d1faec',\n",
       " '7c0c7979-eebd-4aba-b5b2-6e76b4643623',\n",
       " '40efd03a-cb2f-4677-af42-a056cbfe5a17',\n",
       " '68fbcd34-1d63-461d-8195-df5b8dc61b31']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "in_files[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "544fb074-02d5-4ca4-968a-780710e5f48b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/ref_celltypist_labels_High_2024-02-18.csv',\n",
       " 'output/ref_celltypist_labels_Low_2024-02-18.csv',\n",
       " 'output/ref_celltypist_labels_Covid_Healthy_2024-02-18.csv']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "out_files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "80a524b6-7c4c-48c1-8b8c-d05494737bb1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "you are trying to upload file_ids... ['output/ref_celltypist_labels_High_2024-02-18.csv', 'output/ref_celltypist_labels_Low_2024-02-18.csv', 'output/ref_celltypist_labels_Covid_Healthy_2024-02-18.csv']. Do you truly want to proceed?\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "(y/n) y\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'trace_id': '60c20ada-f8aa-4c7f-ae24-8973a487a491',\n",
       " 'files': ['output/ref_celltypist_labels_High_2024-02-18.csv',\n",
       "  'output/ref_celltypist_labels_Low_2024-02-18.csv',\n",
       "  'output/ref_celltypist_labels_Covid_Healthy_2024-02-18.csv']}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hisepy.upload.upload_files(\n",
    "    files = out_files,\n",
    "    study_space_id = study_space_uuid,\n",
    "    title = title,\n",
    "    input_file_ids = in_files\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "1df57b01-3914-48ae-9cc2-f03faec7473c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<details>\n",
       "<summary>Click to view session information</summary>\n",
       "<pre>\n",
       "-----\n",
       "anndata             0.10.3\n",
       "celltypist          1.6.1\n",
       "h5py                3.10.0\n",
       "hisepy              0.3.0\n",
       "numpy               1.24.0\n",
       "pandas              2.1.4\n",
       "scanpy              1.9.6\n",
       "scipy               1.11.4\n",
       "session_info        1.0.0\n",
       "-----\n",
       "</pre>\n",
       "<details>\n",
       "<summary>Click to view modules imported as dependencies</summary>\n",
       "<pre>\n",
       "PIL                         10.0.1\n",
       "anyio                       NA\n",
       "arrow                       1.3.0\n",
       "asttokens                   NA\n",
       "attr                        23.2.0\n",
       "attrs                       23.2.0\n",
       "babel                       2.14.0\n",
       "beatrix_jupyterlab          NA\n",
       "brotli                      NA\n",
       "cachetools                  5.3.1\n",
       "certifi                     2023.11.17\n",
       "cffi                        1.16.0\n",
       "charset_normalizer          3.3.2\n",
       "cloudpickle                 2.2.1\n",
       "colorama                    0.4.6\n",
       "comm                        0.1.4\n",
       "cryptography                41.0.7\n",
       "cycler                      0.10.0\n",
       "cython_runtime              NA\n",
       "dateutil                    2.8.2\n",
       "db_dtypes                   1.1.1\n",
       "debugpy                     1.8.0\n",
       "decorator                   5.1.1\n",
       "defusedxml                  0.7.1\n",
       "deprecated                  1.2.14\n",
       "exceptiongroup              1.2.0\n",
       "executing                   2.0.1\n",
       "fastjsonschema              NA\n",
       "fqdn                        NA\n",
       "google                      NA\n",
       "greenlet                    2.0.2\n",
       "grpc                        1.58.0\n",
       "grpc_status                 NA\n",
       "idna                        3.6\n",
       "igraph                      0.10.8\n",
       "importlib_metadata          NA\n",
       "ipykernel                   6.28.0\n",
       "ipython_genutils            0.2.0\n",
       "ipywidgets                  8.1.1\n",
       "isoduration                 NA\n",
       "jedi                        0.19.1\n",
       "jinja2                      3.1.2\n",
       "joblib                      1.3.2\n",
       "json5                       NA\n",
       "jsonpointer                 2.4\n",
       "jsonschema                  4.20.0\n",
       "jsonschema_specifications   NA\n",
       "jupyter_events              0.9.0\n",
       "jupyter_server              2.12.1\n",
       "jupyterlab_server           2.25.2\n",
       "jwt                         2.8.0\n",
       "kiwisolver                  1.4.5\n",
       "leidenalg                   0.10.1\n",
       "llvmlite                    0.41.0\n",
       "lz4                         4.3.2\n",
       "markupsafe                  2.1.3\n",
       "matplotlib                  3.8.0\n",
       "matplotlib_inline           0.1.6\n",
       "mpl_toolkits                NA\n",
       "mpmath                      1.3.0\n",
       "natsort                     8.4.0\n",
       "nbformat                    5.9.2\n",
       "numba                       0.58.0\n",
       "opentelemetry               NA\n",
       "overrides                   NA\n",
       "packaging                   23.2\n",
       "parso                       0.8.3\n",
       "pexpect                     4.8.0\n",
       "pickleshare                 0.7.5\n",
       "pkg_resources               NA\n",
       "platformdirs                4.1.0\n",
       "plotly                      5.18.0\n",
       "prettytable                 3.9.0\n",
       "prometheus_client           NA\n",
       "prompt_toolkit              3.0.42\n",
       "proto                       NA\n",
       "psutil                      NA\n",
       "ptyprocess                  0.7.0\n",
       "pure_eval                   0.2.2\n",
       "pyarrow                     13.0.0\n",
       "pydev_ipython               NA\n",
       "pydevconsole                NA\n",
       "pydevd                      2.9.5\n",
       "pydevd_file_utils           NA\n",
       "pydevd_plugins              NA\n",
       "pydevd_tracing              NA\n",
       "pygments                    2.17.2\n",
       "pynvml                      NA\n",
       "pyparsing                   3.1.1\n",
       "pyreadr                     0.5.0\n",
       "pythonjsonlogger            NA\n",
       "pytz                        2023.3.post1\n",
       "referencing                 NA\n",
       "requests                    2.31.0\n",
       "rfc3339_validator           0.1.4\n",
       "rfc3986_validator           0.1.1\n",
       "rpds                        NA\n",
       "send2trash                  NA\n",
       "shapely                     1.8.5.post1\n",
       "six                         1.16.0\n",
       "sklearn                     1.3.2\n",
       "sniffio                     1.3.0\n",
       "socks                       1.7.1\n",
       "sql                         NA\n",
       "sqlalchemy                  2.0.21\n",
       "sqlparse                    0.4.4\n",
       "stack_data                  0.6.2\n",
       "sympy                       1.12\n",
       "termcolor                   NA\n",
       "texttable                   1.7.0\n",
       "threadpoolctl               3.2.0\n",
       "torch                       2.1.2+cu121\n",
       "torchgen                    NA\n",
       "tornado                     6.3.3\n",
       "tqdm                        4.66.1\n",
       "traitlets                   5.9.0\n",
       "typing_extensions           NA\n",
       "uri_template                NA\n",
       "urllib3                     1.26.18\n",
       "wcwidth                     0.2.12\n",
       "webcolors                   1.13\n",
       "websocket                   1.7.0\n",
       "wrapt                       1.15.0\n",
       "xarray                      2023.12.0\n",
       "yaml                        6.0.1\n",
       "zipp                        NA\n",
       "zmq                         25.1.2\n",
       "zoneinfo                    NA\n",
       "zstandard                   0.22.0\n",
       "</pre>\n",
       "</details> <!-- seems like this ends pre, so might as well be explicit -->\n",
       "<pre>\n",
       "-----\n",
       "IPython             8.19.0\n",
       "jupyter_client      8.6.0\n",
       "jupyter_core        5.6.1\n",
       "jupyterlab          4.0.10\n",
       "notebook            6.5.4\n",
       "-----\n",
       "Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]\n",
       "Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31\n",
       "-----\n",
       "Session information updated at 2024-02-18 03:29\n",
       "</pre>\n",
       "</details>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import session_info\n",
    "session_info.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a739960e-997b-46f8-9cd4-90bdeb3d672c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}