{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6476a09c-5b62-437e-9f28-baba9d37df9c",
   "metadata": {},
   "source": [
    "# Combine raw data across all samples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4b276cf-c764-46dd-bbf2-367a9376e37f",
   "metadata": {},
   "source": [
    "In this notebook, we'll use the file manifest we previously assembled to retrieve all of the human PBMC data that we'll use to assemble our reference dataset.\n",
    "\n",
    "Here, we'll retrieve data from each each individual sample, stored in HDF5 format in HISE, and concatenate them into a single AnnData object."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efe71b68-9e3e-4860-99e3-7da0238565c5",
   "metadata": {},
   "source": [
    "## Load Packages\n",
    "\n",
    "anndata: Data structures for scRNA-seq  \n",
    "datetime: Date/Time methods  \n",
    "h5py: HDF5 file I/O  \n",
    "hisepy: The HISE SDK for Python  \n",
    "numpy: Mathematical data structures and computation  \n",
    "os: operating system calls  \n",
    "pandas: DataFrame data structures  \n",
    "re: Regular expressions  \n",
    "scanpy: scRNA-seq analysis  \n",
    "scipy.sparse: Spare matrix data structures  \n",
    "shutil: Shell utilities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3f29b888-622d-41b4-838e-fe977defc551",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import anndata\n",
    "from datetime import date\n",
    "import h5py\n",
    "import hisepy\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "from pandas.api.types import is_object_dtype\n",
    "import re\n",
    "import scanpy as sc\n",
    "import scipy.sparse as scs\n",
    "import shutil"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a019a1f9-7e5e-406e-be95-6437bfb4251f",
   "metadata": {},
   "source": [
    "## Helper functions\n",
    "\n",
    "These functions assist in reading from our pipeline .h5 file format into AnnData format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "80c98f5b-1226-4549-b360-50901a977a8b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define a function to read count data\n",
    "def read_mat(h5_con):\n",
    "    mat = scs.csc_matrix(\n",
    "        (h5_con['matrix']['data'][:], # Count values\n",
    "         h5_con['matrix']['indices'][:], # Row indices\n",
    "         h5_con['matrix']['indptr'][:]), # Pointers for column positions\n",
    "        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions\n",
    "    )\n",
    "    return mat\n",
    "\n",
    "# define a function to read obeservation metadata (i.e. cell metadata)\n",
    "def read_obs(h5con):\n",
    "    bc = h5con['matrix']['barcodes'][:]\n",
    "    bc = [x.decode('UTF-8') for x in bc]\n",
    "\n",
    "    # Initialized the DataFrame with cell barcodes\n",
    "    obs_df = pd.DataFrame({ 'barcodes' : bc })\n",
    "\n",
    "    # Get the list of available metadata columns\n",
    "    obs_columns = h5con['matrix']['observations'].keys()\n",
    "\n",
    "    # For each column\n",
    "    for col in obs_columns:\n",
    "        # Read the values\n",
    "        values = h5con['matrix']['observations'][col][:]\n",
    "        # Check for byte storage\n",
    "        if(isinstance(values[0], (bytes, bytearray))):\n",
    "            # Decode byte strings\n",
    "            values = [x.decode('UTF-8') for x in values]\n",
    "        # Add column to the DataFrame\n",
    "        obs_df[col] = values\n",
    "\n",
    "    obs_df = obs_df.set_index('barcodes', drop = False)\n",
    "    \n",
    "    return obs_df\n",
    "\n",
    "# define a function to construct anndata object from a h5 file\n",
    "def read_h5_anndata(h5_con):\n",
    "    #h5_con = h5py.File(h5_file, mode = 'r')\n",
    "    # extract the expression matrix\n",
    "    mat = read_mat(h5_con)\n",
    "    # extract gene names\n",
    "    genes = h5_con['matrix']['features']['name'][:]\n",
    "    genes = [x.decode('UTF-8') for x in genes]\n",
    "    # extract metadata\n",
    "    obs_df = read_obs(h5_con)\n",
    "    # construct anndata\n",
    "    adata = anndata.AnnData(mat.T,\n",
    "                             obs = obs_df)\n",
    "    # make sure the gene names aligned\n",
    "    adata.var_names = genes\n",
    "\n",
    "    adata.var_names_make_unique()\n",
    "    return adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d358bb45-6771-49eb-b5ed-9527b0622fe5",
   "metadata": {},
   "source": [
    "## Retrieve file list from HISE\n",
    "\n",
    "First, we'll pull the manifest of samples and associated files that we want to retrieve for building our reference dataset. We previously assembled this file and loaded it into HISE via a Watchfolder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3224bf08-4707-4ee4-8d8f-dca07d127ff4",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'\n",
    "file_query = hisepy.reader.read_files(\n",
    "    [sample_meta_file_uuid]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "82a3f37c-a683-47ff-930b-07b085fc75f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "meta_data = file_query['values']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb0067c6-9f0d-4a3f-bb6c-4e6e17da895a",
   "metadata": {},
   "source": [
    "read_files will return a dictionary with an entry, `values`, that contains a list of `h5py.File()` objects. We can use these directly to read in each .h5 file to an AnnData object with our helper function, `read_h5_anndata()`, defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cb069cb3-7785-4488-b25c-c4b63581c708",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_adata(uuid):\n",
    "    # Load the file using HISE\n",
    "    res = hisepy.reader.read_files([uuid])\n",
    "    \n",
    "    # Read the file to adata\n",
    "    h5_con = res['values'][0]\n",
    "    adata = read_h5_anndata(h5_con)\n",
    "    \n",
    "    # Clean up the file now that we're done with it\n",
    "    h5_con.close()\n",
    "\n",
    "    return(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d62b5249-cdca-43d3-862a-ed72b2794e0d",
   "metadata": {},
   "source": [
    "Here, we'll iterate over each file in our manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0c2fb32d-7755-4029-a41e-eb809e096e28",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_list = []\n",
    "for i in range(meta_data.shape[0]):\n",
    "    uuid = meta_data['file.id'][i]\n",
    "    adata_list.append(get_adata(uuid))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f9d495a-1511-4d04-9bd3-54a667870e44",
   "metadata": {},
   "source": [
    "Concatenate all of the datasets into a single object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2eaafbce-41aa-438a-89a3-08a84f82682d",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = anndata.concat(adata_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e42db6f-c08d-4602-914d-95e78c11f19e",
   "metadata": {},
   "source": [
    "## Add sample metadata\n",
    "\n",
    "When retrieved from HISE, our .h5 files have a sample identifier (`pbmc_sample_id`), but don't carry other sample and subject metadata. Let's add this information from our `meta_data` DataFrame to make this information more complete.\n",
    "\n",
    "First, we'll convert `pbmc_sample_id` to `sample.sampleKitGuid` using a regular expression. PBMC samples are derived from kits in our LIMS system, so both share the same numerical core. The difference is that there can be multiple PBMC samples collected at the same time, so PBMC samples are prefixed with `PB` to indicate their sample type, and suffixed with `-XX` to indicate an aliquot number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "dfd0b514-c743-47c6-9e36-8437059dc68d",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs = adata.obs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "76ad771b-9bf7-46af-85ed-fc26f93c111e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_to_kit(sample):\n",
    "    kit = re.sub('PB([0-9]+)-.+','KT\\\\1',sample)\n",
    "    return(kit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2212d3c1-3481-4e07-9cdf-b72f8efb7a3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs['sample.sampleKitGuid'] = [sample_to_kit(sample) for sample in obs['pbmc_sample_id']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "489c9347-9d21-4186-9be6-f4d205472f15",
   "metadata": {},
   "source": [
    "We only need to keep some of the metadata columns that pertain to cohort, subject, and sample. We'll also keep the originating File GUID to help us keep track of provenance. Let's select just these columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "b71d77a2-6509-4098-951d-79ae0ddf3477",
   "metadata": {},
   "outputs": [],
   "source": [
    "keep_meta = [\n",
    "    'cohort.cohortGuid',\n",
    "    'subject.subjectGuid', 'subject.biologicalSex', \n",
    "    'subject.race', 'subject.ethnicity', 'subject.birthYear',\n",
    "    'sample.sampleKitGuid', 'sample.visitName', 'sample.drawDate',\n",
    "    'file.id'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7f78958b-5b7e-4b0f-b564-139558fff263",
   "metadata": {},
   "outputs": [],
   "source": [
    "selected_meta = meta_data[keep_meta]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1614b0cd-7792-4e4c-a7db-c0e6a21fe90a",
   "metadata": {},
   "source": [
    "Now, we'll join this metadata to our observations using those sample IDs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "636de8f0-3506-4411-88c1-3c34eb712614",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs = obs.merge(\n",
    "    selected_meta,\n",
    "    how = 'left',\n",
    "    on = 'sample.sampleKitGuid'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f2c5f04-af26-45bd-85e3-9abe61799e8c",
   "metadata": {},
   "source": [
    "## Retrieve labs, labels and doublet detection from HISE\n",
    "\n",
    "In previous steps, we retrieved CMV and BMI clinical lab data and performed label transfer with CellTypist and Seurat, as well as doublet detection using Scrublet. We'll retrieve these results and add them to the observations in our AnnData file to assist with cell type labeling.\n",
    "\n",
    "For this purpose, we'll use the CellTypist labels generated using the Immune All Low model, which has (ironically) high-resolution cell type labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c765e755-ec87-423a-b6cc-c5c5bedf6149",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_uuids = {\n",
    "    'cmv': 'be338216-0c90-47df-923c-7f4d7c35bac4',\n",
    "    'bmi': '9e39e86d-025c-4ad1-bbfb-63302b99c3f1',\n",
    "    'celltypist': 'dbe10e73-4959-480a-8618-e40652901ab7',\n",
    "    'seurat': 'ef5baaf1-1364-4e07-b01c-0252fd917fbb',\n",
    "    'scrublet': '69d4e3a1-ff03-4173-a7e1-e4acff1f3d09',\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "12a3688e-2067-4051-9e10-3acc69b517cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_dfs = {}\n",
    "for name,uuid in obs_uuids.items():\n",
    "    hise_res = hisepy.reader.read_files([uuid])\n",
    "    hise_df = hise_res['values']\n",
    "    obs_dfs[name] = hise_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b26da4c-c410-47a7-861d-bfd90a516e0f",
   "metadata": {},
   "source": [
    "We only need to retain some of the columns from each to assemble our cell metadata. Let's make and apply some column selections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "6031e2b2-e7ea-4bb4-b8e8-b9ca494399c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "keep_columns = {\n",
    "    'cmv': ['subject.subjectGuid', 'subject.cmv'],\n",
    "    'bmi': ['subject.subjectGuid', 'subject.bmi'],\n",
    "    'celltypist': ['barcodes', 'predicted_labels'],\n",
    "    'seurat': ['barcodes',\n",
    "               'predicted.celltype.l1', 'predicted.celltype.l1.score',\n",
    "               'predicted.celltype.l2', 'predicted.celltype.l2.score',\n",
    "               'predicted.celltype.l2.5', 'predicted.celltype.l2.5.score',\n",
    "               'predicted.celltype.l3', 'predicted.celltype.l3.score'\n",
    "              ],\n",
    "    'scrublet': ['barcodes', 'predicted_doublet', 'doublet_score']\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "2e913e01-1b32-4958-9fc0-661bc8e027d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "for name,df in obs_dfs.items():\n",
    "    keep_cols = keep_columns[name]\n",
    "    selected_df = df[keep_cols]\n",
    "    obs_dfs[name] = selected_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14410794-c00a-4f89-8f10-5de8bf559e57",
   "metadata": {},
   "source": [
    "To make the sources clear later, let's rename the celltypist and seurat columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "ccea7f10-75aa-4a29-9d98-b34521f4ed91",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_dfs['celltypist'] = obs_dfs['celltypist'].rename(\n",
    "    columns = {'predicted_labels': 'celltypist.low'}\n",
    ")\n",
    "obs_dfs['seurat'] = obs_dfs['seurat'].rename(\n",
    "    columns = {\n",
    "        'predicted.celltype.l1': 'seurat.l1', 'predicted.celltype.l1.score': 'seurat.l1.score',\n",
    "        'predicted.celltype.l2': 'seurat.l2', 'predicted.celltype.l2.score': 'seurat.l2.score',\n",
    "        'predicted.celltype.l2.5': 'seurat.l2.5', 'predicted.celltype.l2.5.score': 'seurat.l2.5.score',\n",
    "        'predicted.celltype.l3': 'seurat.l3', 'predicted.celltype.l3.score': 'seurat.l3.score'\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "509c7725-3695-4a81-bd03-19daa007fa0b",
   "metadata": {},
   "source": [
    "And now we can add our sample metadata and these labels to the observations stored in the anndata object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "815d8c82-e256-4e85-b4bd-f5958b78e966",
   "metadata": {},
   "outputs": [],
   "source": [
    "join_on_dict = {\n",
    "    'cmv': 'subject.subjectGuid',\n",
    "    'bmi': 'subject.subjectGuid',\n",
    "    'celltypist': 'barcodes',\n",
    "    'seurat': 'barcodes',\n",
    "    'scrublet': 'barcodes'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "da8d7290-90aa-4987-b1d9-794e9f6f7433",
   "metadata": {},
   "outputs": [],
   "source": [
    "for name,df in obs_dfs.items():\n",
    "    join_col = join_on_dict[name]\n",
    "    obs = obs.merge(\n",
    "        df,\n",
    "        how = 'left',\n",
    "        on = join_col\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38e75da6-fd0d-4894-9360-ed995216d693",
   "metadata": {},
   "source": [
    "To keep things tidy, we'll also drop the `seurat_pbmc_type`, `seurat_pbmc_type_score`, and UMAP coordinates generated by our processing pipeline.\n",
    "\n",
    "These cell type assignments are from a now-outdated reference dataset, and the UMAP coordinates are generated for viewing individual samples - not helpful for our full dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "eaacf12f-e352-49da-a2a7-89b9b904b807",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs = obs.drop([\n",
    "    'seurat_pbmc_type','seurat_pbmc_type_score',\n",
    "    'umap_1', 'umap_2'\n",
    "], axis = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77e2677e-ec03-496b-988d-57d63ee463e4",
   "metadata": {},
   "source": [
    "Next, we'll convert all of our text columns to categorical. This is used to make storage of text data more efficient when we write our output file, as we'll only need to store a single instance of our strings.\n",
    "\n",
    "We do this for all columns except `barcodes`, which we need to retain as a string type for use as an index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "6acfe026-363c-4709-9f1a-d04bc5fed1ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_obs = obs\n",
    "for i in range(cat_obs.shape[1]):\n",
    "    col_name = cat_obs.dtypes.index.tolist()[i]\n",
    "    col_type = cat_obs.dtypes[col_name]\n",
    "    if col_name == 'barcodes':\n",
    "        cat_obs[col_name] = cat_obs[col_name].astype(str)\n",
    "    elif is_object_dtype(col_type):\n",
    "        cat_obs[col_name] = cat_obs[col_name].astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "e030708d-3946-4194-8564-11feab1ff308",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>barcodes</th>\n",
       "      <th>batch_id</th>\n",
       "      <th>cell_name</th>\n",
       "      <th>cell_uuid</th>\n",
       "      <th>chip_id</th>\n",
       "      <th>hto_barcode</th>\n",
       "      <th>hto_category</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>n_mito_umis</th>\n",
       "      <th>n_reads</th>\n",
       "      <th>...</th>\n",
       "      <th>seurat.l1</th>\n",
       "      <th>seurat.l1.score</th>\n",
       "      <th>seurat.l2</th>\n",
       "      <th>seurat.l2.score</th>\n",
       "      <th>seurat.l2.5</th>\n",
       "      <th>seurat.l2.5.score</th>\n",
       "      <th>seurat.l3</th>\n",
       "      <th>seurat.l3.score</th>\n",
       "      <th>predicted_doublet</th>\n",
       "      <th>doublet_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>cf71f47048b611ea8957bafe6d70929e</td>\n",
       "      <td>B001</td>\n",
       "      <td>weathered_pernicious_polliwog</td>\n",
       "      <td>cf71f47048b611ea8957bafe6d70929e</td>\n",
       "      <td>B001-P1C1</td>\n",
       "      <td>TGATGGCCTATTGGG</td>\n",
       "      <td>singlet</td>\n",
       "      <td>1081</td>\n",
       "      <td>115</td>\n",
       "      <td>9307</td>\n",
       "      <td>...</td>\n",
       "      <td>other T</td>\n",
       "      <td>0.634406</td>\n",
       "      <td>MAIT</td>\n",
       "      <td>0.634406</td>\n",
       "      <td>MAIT</td>\n",
       "      <td>0.634406</td>\n",
       "      <td>MAIT</td>\n",
       "      <td>0.634406</td>\n",
       "      <td>False</td>\n",
       "      <td>0.045226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>cf71f54248b611ea8957bafe6d70929e</td>\n",
       "      <td>B001</td>\n",
       "      <td>untidy_emulsive_hamadryad</td>\n",
       "      <td>cf71f54248b611ea8957bafe6d70929e</td>\n",
       "      <td>B001-P1C1</td>\n",
       "      <td>TGATGGCCTATTGGG</td>\n",
       "      <td>singlet</td>\n",
       "      <td>1923</td>\n",
       "      <td>178</td>\n",
       "      <td>22729</td>\n",
       "      <td>...</td>\n",
       "      <td>CD4 T</td>\n",
       "      <td>0.974953</td>\n",
       "      <td>CD4 TCM</td>\n",
       "      <td>0.974953</td>\n",
       "      <td>CD4 TCM</td>\n",
       "      <td>0.974953</td>\n",
       "      <td>CD4 TCM_1</td>\n",
       "      <td>0.974953</td>\n",
       "      <td>False</td>\n",
       "      <td>0.110978</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>cf71fa1048b611ea8957bafe6d70929e</td>\n",
       "      <td>B001</td>\n",
       "      <td>impatient_familial_cuckoo</td>\n",
       "      <td>cf71fa1048b611ea8957bafe6d70929e</td>\n",
       "      <td>B001-P1C1</td>\n",
       "      <td>TGATGGCCTATTGGG</td>\n",
       "      <td>singlet</td>\n",
       "      <td>1246</td>\n",
       "      <td>204</td>\n",
       "      <td>11107</td>\n",
       "      <td>...</td>\n",
       "      <td>Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>CD14 Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>CD14 Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>CD14 Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>False</td>\n",
       "      <td>0.047836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>cf71fb7848b611ea8957bafe6d70929e</td>\n",
       "      <td>B001</td>\n",
       "      <td>long_weakminded_roebuck</td>\n",
       "      <td>cf71fb7848b611ea8957bafe6d70929e</td>\n",
       "      <td>B001-P1C1</td>\n",
       "      <td>TGATGGCCTATTGGG</td>\n",
       "      <td>singlet</td>\n",
       "      <td>1118</td>\n",
       "      <td>77</td>\n",
       "      <td>12990</td>\n",
       "      <td>...</td>\n",
       "      <td>CD4 T</td>\n",
       "      <td>0.995058</td>\n",
       "      <td>CD4 TCM</td>\n",
       "      <td>0.950569</td>\n",
       "      <td>CD4 TCM</td>\n",
       "      <td>0.950569</td>\n",
       "      <td>CD4 TCM_1</td>\n",
       "      <td>0.684051</td>\n",
       "      <td>False</td>\n",
       "      <td>0.040517</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>cf71ffba48b611ea8957bafe6d70929e</td>\n",
       "      <td>B001</td>\n",
       "      <td>dastardly_wintery_airedale</td>\n",
       "      <td>cf71ffba48b611ea8957bafe6d70929e</td>\n",
       "      <td>B001-P1C1</td>\n",
       "      <td>TGATGGCCTATTGGG</td>\n",
       "      <td>singlet</td>\n",
       "      <td>1965</td>\n",
       "      <td>363</td>\n",
       "      <td>15979</td>\n",
       "      <td>...</td>\n",
       "      <td>Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>CD14 Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>CD14 Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>CD14 Mono</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>False</td>\n",
       "      <td>0.046076</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                           barcodes batch_id                      cell_name  \\\n",
       "0  cf71f47048b611ea8957bafe6d70929e     B001  weathered_pernicious_polliwog   \n",
       "1  cf71f54248b611ea8957bafe6d70929e     B001      untidy_emulsive_hamadryad   \n",
       "2  cf71fa1048b611ea8957bafe6d70929e     B001      impatient_familial_cuckoo   \n",
       "3  cf71fb7848b611ea8957bafe6d70929e     B001        long_weakminded_roebuck   \n",
       "4  cf71ffba48b611ea8957bafe6d70929e     B001     dastardly_wintery_airedale   \n",
       "\n",
       "                          cell_uuid    chip_id      hto_barcode hto_category  \\\n",
       "0  cf71f47048b611ea8957bafe6d70929e  B001-P1C1  TGATGGCCTATTGGG      singlet   \n",
       "1  cf71f54248b611ea8957bafe6d70929e  B001-P1C1  TGATGGCCTATTGGG      singlet   \n",
       "2  cf71fa1048b611ea8957bafe6d70929e  B001-P1C1  TGATGGCCTATTGGG      singlet   \n",
       "3  cf71fb7848b611ea8957bafe6d70929e  B001-P1C1  TGATGGCCTATTGGG      singlet   \n",
       "4  cf71ffba48b611ea8957bafe6d70929e  B001-P1C1  TGATGGCCTATTGGG      singlet   \n",
       "\n",
       "   n_genes  n_mito_umis  n_reads  ...  seurat.l1 seurat.l1.score  seurat.l2  \\\n",
       "0     1081          115     9307  ...    other T        0.634406       MAIT   \n",
       "1     1923          178    22729  ...      CD4 T        0.974953    CD4 TCM   \n",
       "2     1246          204    11107  ...       Mono        1.000000  CD14 Mono   \n",
       "3     1118           77    12990  ...      CD4 T        0.995058    CD4 TCM   \n",
       "4     1965          363    15979  ...       Mono        1.000000  CD14 Mono   \n",
       "\n",
       "  seurat.l2.score seurat.l2.5 seurat.l2.5.score  seurat.l3 seurat.l3.score  \\\n",
       "0        0.634406        MAIT          0.634406       MAIT        0.634406   \n",
       "1        0.974953     CD4 TCM          0.974953  CD4 TCM_1        0.974953   \n",
       "2        1.000000   CD14 Mono          1.000000  CD14 Mono        1.000000   \n",
       "3        0.950569     CD4 TCM          0.950569  CD4 TCM_1        0.684051   \n",
       "4        1.000000   CD14 Mono          1.000000  CD14 Mono        1.000000   \n",
       "\n",
       "  predicted_doublet doublet_score  \n",
       "0             False      0.045226  \n",
       "1             False      0.110978  \n",
       "2             False      0.047836  \n",
       "3             False      0.040517  \n",
       "4             False      0.046076  \n",
       "\n",
       "[5 rows x 38 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_obs.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "e0362710-c7ba-4bd5-a021-9bf3c8b92b2b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['barcodes', 'batch_id', 'cell_name', 'cell_uuid', 'chip_id',\n",
       "       'hto_barcode', 'hto_category', 'n_genes', 'n_mito_umis', 'n_reads',\n",
       "       'n_umis', 'original_barcodes', 'pbmc_sample_id', 'pool_id', 'well_id',\n",
       "       'sample.sampleKitGuid', 'cohort.cohortGuid', 'subject.subjectGuid',\n",
       "       'subject.biologicalSex', 'subject.race', 'subject.ethnicity',\n",
       "       'subject.birthYear', 'sample.visitName', 'sample.drawDate', 'file.id',\n",
       "       'subject.cmv', 'subject.bmi', 'celltypist.low', 'seurat.l1',\n",
       "       'seurat.l1.score', 'seurat.l2', 'seurat.l2.score', 'seurat.l2.5',\n",
       "       'seurat.l2.5.score', 'seurat.l3', 'seurat.l3.score',\n",
       "       'predicted_doublet', 'doublet_score'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_obs.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f3b2216-484b-42b2-b5d9-f6855649ac55",
   "metadata": {},
   "source": [
    "Finally, we need to reinstate the cell barcodes as the index of the DataFrame, and substitute our new metadata for the observations in the AnnData object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "4373f2e7-d85a-4385-92cc-0830c32049c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_obs = cat_obs.set_index('barcodes', drop = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "93cdba4e-6a2a-4724-af8b-216ca9cb307a",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.obs = cat_obs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "f86d17ee-9307-4d7a-9aed-13f25ea59f33",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "out_dir = 'output'\n",
    "if not os.path.isdir(out_dir):\n",
    "    os.makedirs(out_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e8791a8-3ab4-4e4a-9c06-54ecce67c173",
   "metadata": {},
   "source": [
    "Save combined object to .h5ad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "ef9c2838-e688-444c-aef3-ce3e6cc980dd",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "out_h5ad = 'output/ref_pbmc_raw_{d}.h5ad'.format(d = date.today())\n",
    "adata.write_h5ad(out_h5ad)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8af22691-19e0-42a5-8a26-0a62eecdd442",
   "metadata": {},
   "source": [
    "We'll also save just the observations as a .csv file in case we need it for generating figures about the full, raw dataset and labels.\n",
    "\n",
    "For CSV, we don't need the version with categorical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e8b0a2c-f148-491d-a97a-c870d1dccdd2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "out_csv = 'output/ref_pbmc_raw_meta_{d}.csv'.format(d = date.today())\n",
    "obs.to_csv(out_csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7458418-4ecd-4b34-a318-aa7426613d02",
   "metadata": {},
   "source": [
    "## Upload assembled data to HISE\n",
    "\n",
    "Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "c77a54a4-598e-48c3-802f-c471682b2daa",
   "metadata": {},
   "outputs": [],
   "source": [
    "study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'\n",
    "title = 'Assembled Raw PBMC .h5ad {d}'.format(d = date.today())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "7ee25cfb-68b5-4b87-8ee1-54b1a59df1fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_uuid_list = list(obs_uuids.values())[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "8b7bdf5d-cffb-400c-8fd2-41ce3b1ed39f",
   "metadata": {},
   "outputs": [],
   "source": [
    "in_files = [sample_meta_file_uuid] + [obs_uuid_list] + meta_data['file.id'].to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "db1d5b7b-42e6-41c0-9a40-885a9e906228",
   "metadata": {},
   "outputs": [],
   "source": [
    "out_files = [out_h5ad, out_csv]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "4942284a-bf25-4327-aa13-0d54a006dd48",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "you are trying to upload file_ids... ['output/ref_pbmc_raw_2024-02-18.h5ad', 'output/ref_pbmc_raw_meta_2024-02-18.csv']. Do you truly want to proceed?\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "(y/n) y\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'trace_id': 'a0719172-a68d-4b02-896e-9eaa02a0710b',\n",
       " 'files': ['output/ref_pbmc_raw_2024-02-18.h5ad',\n",
       "  'output/ref_pbmc_raw_meta_2024-02-18.csv']}"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hisepy.upload.upload_files(\n",
    "    files = out_files,\n",
    "    study_space_id = study_space_uuid,\n",
    "    title = title,\n",
    "    input_file_ids = in_files\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "9a681faf-5c56-42c5-8fb1-7987ac2384cf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<details>\n",
       "<summary>Click to view session information</summary>\n",
       "<pre>\n",
       "-----\n",
       "anndata             0.10.3\n",
       "h5py                3.10.0\n",
       "hisepy              0.3.0\n",
       "numpy               1.24.0\n",
       "pandas              2.1.4\n",
       "scanpy              1.9.6\n",
       "scipy               1.11.4\n",
       "session_info        1.0.0\n",
       "-----\n",
       "</pre>\n",
       "<details>\n",
       "<summary>Click to view modules imported as dependencies</summary>\n",
       "<pre>\n",
       "PIL                         10.0.1\n",
       "anyio                       NA\n",
       "arrow                       1.3.0\n",
       "asttokens                   NA\n",
       "attr                        23.2.0\n",
       "attrs                       23.2.0\n",
       "babel                       2.14.0\n",
       "beatrix_jupyterlab          NA\n",
       "brotli                      NA\n",
       "cachetools                  5.3.1\n",
       "certifi                     2023.11.17\n",
       "cffi                        1.16.0\n",
       "charset_normalizer          3.3.2\n",
       "cloudpickle                 2.2.1\n",
       "colorama                    0.4.6\n",
       "comm                        0.1.4\n",
       "cryptography                41.0.7\n",
       "cycler                      0.10.0\n",
       "cython_runtime              NA\n",
       "dateutil                    2.8.2\n",
       "db_dtypes                   1.1.1\n",
       "debugpy                     1.8.0\n",
       "decorator                   5.1.1\n",
       "defusedxml                  0.7.1\n",
       "deprecated                  1.2.14\n",
       "exceptiongroup              1.2.0\n",
       "executing                   2.0.1\n",
       "fastjsonschema              NA\n",
       "fqdn                        NA\n",
       "google                      NA\n",
       "greenlet                    2.0.2\n",
       "grpc                        1.58.0\n",
       "grpc_status                 NA\n",
       "idna                        3.6\n",
       "igraph                      0.10.8\n",
       "importlib_metadata          NA\n",
       "ipykernel                   6.28.0\n",
       "ipython_genutils            0.2.0\n",
       "ipywidgets                  8.1.1\n",
       "isoduration                 NA\n",
       "jedi                        0.19.1\n",
       "jinja2                      3.1.2\n",
       "joblib                      1.3.2\n",
       "json5                       NA\n",
       "jsonpointer                 2.4\n",
       "jsonschema                  4.20.0\n",
       "jsonschema_specifications   NA\n",
       "jupyter_events              0.9.0\n",
       "jupyter_server              2.12.1\n",
       "jupyterlab_server           2.25.2\n",
       "jwt                         2.8.0\n",
       "kiwisolver                  1.4.5\n",
       "leidenalg                   0.10.1\n",
       "llvmlite                    0.41.0\n",
       "lz4                         4.3.2\n",
       "markupsafe                  2.1.3\n",
       "matplotlib                  3.8.0\n",
       "matplotlib_inline           0.1.6\n",
       "mpl_toolkits                NA\n",
       "mpmath                      1.3.0\n",
       "natsort                     8.4.0\n",
       "nbformat                    5.9.2\n",
       "numba                       0.58.0\n",
       "opentelemetry               NA\n",
       "overrides                   NA\n",
       "packaging                   23.2\n",
       "parso                       0.8.3\n",
       "pexpect                     4.8.0\n",
       "pickleshare                 0.7.5\n",
       "pkg_resources               NA\n",
       "platformdirs                4.1.0\n",
       "plotly                      5.18.0\n",
       "prettytable                 3.9.0\n",
       "prometheus_client           NA\n",
       "prompt_toolkit              3.0.42\n",
       "proto                       NA\n",
       "psutil                      NA\n",
       "ptyprocess                  0.7.0\n",
       "pure_eval                   0.2.2\n",
       "pyarrow                     13.0.0\n",
       "pydev_ipython               NA\n",
       "pydevconsole                NA\n",
       "pydevd                      2.9.5\n",
       "pydevd_file_utils           NA\n",
       "pydevd_plugins              NA\n",
       "pydevd_tracing              NA\n",
       "pygments                    2.17.2\n",
       "pynvml                      NA\n",
       "pyparsing                   3.1.1\n",
       "pyreadr                     0.5.0\n",
       "pythonjsonlogger            NA\n",
       "pytz                        2023.3.post1\n",
       "referencing                 NA\n",
       "requests                    2.31.0\n",
       "rfc3339_validator           0.1.4\n",
       "rfc3986_validator           0.1.1\n",
       "rpds                        NA\n",
       "send2trash                  NA\n",
       "shapely                     1.8.5.post1\n",
       "six                         1.16.0\n",
       "sklearn                     1.3.2\n",
       "sniffio                     1.3.0\n",
       "socks                       1.7.1\n",
       "sql                         NA\n",
       "sqlalchemy                  2.0.21\n",
       "sqlparse                    0.4.4\n",
       "stack_data                  0.6.2\n",
       "sympy                       1.12\n",
       "termcolor                   NA\n",
       "texttable                   1.7.0\n",
       "threadpoolctl               3.2.0\n",
       "torch                       2.1.2+cu121\n",
       "torchgen                    NA\n",
       "tornado                     6.3.3\n",
       "tqdm                        4.66.1\n",
       "traitlets                   5.9.0\n",
       "typing_extensions           NA\n",
       "uri_template                NA\n",
       "urllib3                     1.26.18\n",
       "wcwidth                     0.2.12\n",
       "webcolors                   1.13\n",
       "websocket                   1.7.0\n",
       "wrapt                       1.15.0\n",
       "xarray                      2023.12.0\n",
       "yaml                        6.0.1\n",
       "zipp                        NA\n",
       "zmq                         25.1.2\n",
       "zoneinfo                    NA\n",
       "zstandard                   0.22.0\n",
       "</pre>\n",
       "</details> <!-- seems like this ends pre, so might as well be explicit -->\n",
       "<pre>\n",
       "-----\n",
       "IPython             8.19.0\n",
       "jupyter_client      8.6.0\n",
       "jupyter_core        5.6.1\n",
       "jupyterlab          4.0.10\n",
       "notebook            6.5.4\n",
       "-----\n",
       "Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]\n",
       "Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31\n",
       "-----\n",
       "Session information updated at 2024-02-18 22:17\n",
       "</pre>\n",
       "</details>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import session_info\n",
    "session_info.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9bcad40-7abe-44b9-b143-a6afa67f0bdf",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}