{ "cells": [ { "cell_type": "markdown", "id": "57326d56-45eb-4303-9795-d5d2e8681a0e", "metadata": {}, "source": [ "# Retrieve CMV and BMI from HISE\n", "\n", "The Python version of the HISE SDK allows retrieval of sample metadata regardless of pipeline sample processing. We'll use this SDK to pull these results." ] }, { "cell_type": "markdown", "id": "b6e5d026-459a-47d1-9e19-7ed0176b0f7c", "metadata": {}, "source": [ "## Load packages\n", "\n", "datetime: Used to add today's date to our output files \n", "hisepy: the HISE SDK \n", "os: Operating System files (used to make an output folder) \n", "pandas: DataFrames for Python \n", "session_info: displays the versioning Python and all of the packages we used \n", "warnings: Used to suppress some annoying warnings that don't impact data retrieval" ] }, { "cell_type": "code", "execution_count": 1, "id": "6de7b123-3be3-4f50-af56-0ec25876e472", "metadata": { "tags": [] }, "outputs": [], "source": [ "from datetime import date\n", "import hisepy\n", "import os\n", "import pandas\n", "import session_info\n", "\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)" ] }, { "cell_type": "code", "execution_count": 2, "id": "262fd40a-b683-45f3-97b2-5a7807e18d12", "metadata": { "tags": [] }, "outputs": [], "source": [ "out_dir = 'output'\n", "if not os.path.isdir(out_dir):\n", " os.makedirs(out_dir)" ] }, { "cell_type": "markdown", "id": "f3133de8-a04e-4d5f-8804-6d85ef30d317", "metadata": {}, "source": [ "## Read sample metadata from HISE" ] }, { "cell_type": "code", "execution_count": 3, "id": "68586ea2-533d-4986-b95e-15ac90ff78f1", "metadata": {}, "outputs": [], "source": [ "sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'\n", "res = hisepy.reader.read_files([sample_meta_file_uuid])" ] }, { "cell_type": "code", "execution_count": 4, "id": "c50a9dd8-5a35-44ed-bf5f-7d3141185578", "metadata": {}, "outputs": [], "source": [ "sample_meta = res['values']" ] }, { "cell_type": "markdown", "id": "31bb9bb5-d4b9-490b-a9a1-2117e9f549b9", "metadata": {}, "source": [ "## Query HISE to get data\n", "\n", "First, we make a dictionary (with curly braces) that defines what we want to get. In this case, we want to get all of the samples from the BR2 cohort. In this query, cohort is found as 'cohortGuid'.\n", "\n", "Each entry in the dictionary has to be a list (square braces), even if it has a single entry." ] }, { "cell_type": "code", "execution_count": 5, "id": "1cca0a1d-6c36-4c87-baac-9d56e1785362", "metadata": { "tags": [] }, "outputs": [], "source": [ "subject_ids = sample_meta['subject.subjectGuid'].tolist()\n", "query_dict = {\n", " 'subjectGuid': subject_ids\n", "}" ] }, { "cell_type": "markdown", "id": "e31f5a9d-13e3-4b12-9b8e-b04a143fd4d3", "metadata": {}, "source": [ "Now, we send this dictionary to HISE via hisepy" ] }, { "cell_type": "code", "execution_count": 6, "id": "064bbc80-759e-4ce2-a233-1d69d88d0722", "metadata": { "tags": [] }, "outputs": [], "source": [ "sample_data = hisepy.reader.read_samples(\n", " query_dict = query_dict\n", ")" ] }, { "cell_type": "markdown", "id": "d27644b0-29a2-4791-89c9-70f56122f934", "metadata": {}, "source": [ "What we get back is a dictionary containing multiple kinds of information.\n", "\n", "We can see what these are called with they `.keys()` method:" ] }, { "cell_type": "code", "execution_count": 7, "id": "f83c118e-5729-45ba-a5fd-020698c31563", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['metadata', 'specimens', 'survey', 'labResults'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.keys()" ] }, { "cell_type": "markdown", "id": "3fd8f27f-2b67-4f66-a737-e06213772e74", "metadata": {}, "source": [ "CMV and the weight and height data we'll need for our analysis are available in labResults.\n", "\n", "CMV status is stored as 'CMV IgG Serology Result Interpretation' or 'CMV Ab Screen Result'. \n", "\n", "Height and weight are simply 'Height' and 'Weight'.\n", "\n", "Let's select columns from these results to get just subject, sample kit, CMV status, height, and weight." ] }, { "cell_type": "markdown", "id": "37c8b8e1-3d87-430a-8bdd-f7c6f30cad4f", "metadata": {}, "source": [ "For CMV status, the results can be assigned to different samples, and to different columns. We'll select only rows where there's a result and then check that we have results for all of our subjects." ] }, { "cell_type": "code", "execution_count": 8, "id": "e67191f2-dee3-4f8b-b693-1144a0875211", "metadata": {}, "outputs": [], "source": [ "keep_columns = ['subjectGuid', 'CMV IgG Serology Result Interpretation', 'CMV Ab Screen Result']\n", "cmv_data = sample_data['labResults'][keep_columns]" ] }, { "cell_type": "markdown", "id": "c1880740-ec52-46bf-9fa9-9c162fe4eef2", "metadata": {}, "source": [ "Keep rows where either of the result columns have Positive or Negative values:" ] }, { "cell_type": "code", "execution_count": 9, "id": "31bf4bca-46f8-41b5-87bb-b228fb5f2171", "metadata": {}, "outputs": [], "source": [ "keep_rows = []\n", "keep_values = ['Negative','Positive']\n", "for i in range(cmv_data.shape[0]):\n", " ig = cmv_data['CMV IgG Serology Result Interpretation'][i]\n", " ab = cmv_data['CMV Ab Screen Result'][i]\n", " if (ig in keep_values) or (ab in keep_values):\n", " keep_rows.append(True)\n", " else:\n", " keep_rows.append(False)" ] }, { "cell_type": "code", "execution_count": 10, "id": "216fe404-e8c3-4d6a-91e5-4525cd819ab6", "metadata": {}, "outputs": [], "source": [ "cmv_data = cmv_data.iloc[keep_rows,:]" ] }, { "cell_type": "markdown", "id": "c0bf9a52-45aa-484f-89f0-d0b7f77384f3", "metadata": {}, "source": [ "Keep rows where the subjects correspond to the subjects under study:" ] }, { "cell_type": "code", "execution_count": 11, "id": "e4561f06-9881-4d0d-81ec-26b4f6c6a5f0", "metadata": {}, "outputs": [], "source": [ "keep_rows = cmv_data['subjectGuid'].isin(sample_meta['subject.subjectGuid'])\n", "cmv_data = cmv_data.iloc[keep_rows.tolist(),:]" ] }, { "cell_type": "code", "execution_count": 12, "id": "75631c0b-0009-4511-a181-7c225f7f6566", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(115, 3)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cmv_data.shape" ] }, { "cell_type": "markdown", "id": "52ea58dd-b4b9-433d-9a29-bfa046ca5774", "metadata": {}, "source": [ "Some subjects have multiple entries. We'll combine these, and assign Positive if any entry is Positive, and Negative otherwise." ] }, { "cell_type": "code", "execution_count": 13, "id": "097e28da-c417-439e-90c3-77507a1d5574", "metadata": {}, "outputs": [], "source": [ "ig_results = []\n", "ab_results = []\n", "cmv_results = []\n", "for s in sample_meta['subject.subjectGuid'].tolist():\n", " sample_cmv = cmv_data.iloc[cmv_data.subjectGuid.isin([s]).tolist(),:]\n", " ig = sample_cmv['CMV IgG Serology Result Interpretation'].tolist()\n", " ab = sample_cmv['CMV Ab Screen Result'].tolist()\n", " if 'Positive' in ig:\n", " ig_results.append('Positive')\n", " elif 'Negative' in ig:\n", " ig_results.append('Negative')\n", " else:\n", " ig_results.append('NA')\n", "\n", " if 'Positive' in ab:\n", " ab_results.append('Positive')\n", " elif 'Negative' in ab:\n", " ab_results.append('Negative')\n", " else:\n", " ab_results.append('NA')\n", "\n", " both = ig + ab\n", " if 'Positive' in both:\n", " cmv_results.append('Positive')\n", " else:\n", " cmv_results.append('Negative')" ] }, { "cell_type": "code", "execution_count": 14, "id": "fb683c10-2240-468b-b5d7-42eb2f8ac7f4", "metadata": {}, "outputs": [], "source": [ "cmv_df = pandas.DataFrame(\n", " {\n", " 'subject.subjectGuid': sample_meta['subject.subjectGuid'].tolist(),\n", " 'CMV IgG Serology Result Interpretation': ig_results,\n", " 'CMV Ab Screen Result': ab_results,\n", " 'subject.cmv': cmv_results\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "id": "8bddce94-e3f7-4a94-9fc8-713d7851ae4c", "metadata": {}, "outputs": [], "source": [ "cmv_file = 'output/ref_subject_cmv_lab_results_{date}.csv'.format(date = date.today())\n", "cmv_df.to_csv(cmv_file)" ] }, { "cell_type": "markdown", "id": "9fd6f744-e3ea-4c37-a667-313644d3c66d", "metadata": {}, "source": [ "Height and weight can be selected using the specific sampleKitGuid that corresponds to the samples in our reference." ] }, { "cell_type": "code", "execution_count": 16, "id": "e75eef75-b367-4b32-be89-5e537345b359", "metadata": { "tags": [] }, "outputs": [], "source": [ "keep_columns = ['subjectGuid', 'sampleKitGuid', 'Height', 'Weight']\n", "hw_data = sample_data['labResults'][keep_columns]" ] }, { "cell_type": "code", "execution_count": 17, "id": "6c2735b7-cb0e-4dfb-8ac0-4f7bc9e21d6a", "metadata": {}, "outputs": [], "source": [ "keep_rows = hw_data['sampleKitGuid'].isin(sample_meta['sample.sampleKitGuid']).tolist()\n", "hw_data = hw_data.iloc[keep_rows,:]" ] }, { "cell_type": "markdown", "id": "5c291440-ddee-4a92-972b-98a400cfdf3e", "metadata": {}, "source": [ "There are some duplicate entries with missing values. We'll remove these." ] }, { "cell_type": "code", "execution_count": 18, "id": "21af9f98-a6a7-4906-a1f1-132c77be2aa3", "metadata": {}, "outputs": [], "source": [ "keep_rows = [not x for x in hw_data['Height'].isnull()]\n", "hw_data = hw_data.iloc[keep_rows,:]" ] }, { "cell_type": "code", "execution_count": 19, "id": "895f3157-24d5-492f-8e03-f074880c8c40", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(104, 4)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hw_data.shape" ] }, { "cell_type": "markdown", "id": "9e6b9072-8d0a-494f-a325-33ea658b6f87", "metadata": {}, "source": [ "Based on the number of remaining rows, we'll be missing values for 4 samples.\n", "\n", "We can compute BMI based on Height (in cm) and weight (in kg): \n", "BMI = weight / (height^2) where weight is in kg and height is in meters. \n", "So, we'll use BMI = Weight / ( (Height / 100)^2 ) to account for the use of cm in height." ] }, { "cell_type": "code", "execution_count": 20, "id": "060d5796-b8de-4e81-8846-8eecf4362b21", "metadata": {}, "outputs": [], "source": [ "h_results = []\n", "w_results = []\n", "bmi_results = []\n", "for s in sample_meta['subject.subjectGuid'].tolist():\n", " sample_hw = hw_data.iloc[hw_data.subjectGuid.isin([s]).tolist(),:]\n", "\n", " if sample_hw.shape[0] == 0:\n", " h_results.append('NA')\n", " w_results.append('NA')\n", " bmi_results.append('NA')\n", " else:\n", " h = sample_hw['Height'].tolist()[0]\n", " if h == '':\n", " h_results.append('NA')\n", " else:\n", " h = float(h)\n", " h_results.append(h)\n", " \n", " w = sample_hw['Weight'].tolist()[0]\n", " if w == '':\n", " w_results.append('NA')\n", " else:\n", " w = float(w)\n", " w_results.append(w)\n", "\n", " if isinstance(h, str) | isinstance(w, str):\n", " bmi_results.append('NA')\n", " else:\n", " bmi = w / ( pow(h / 100,2) )\n", " bmi_results.append(bmi)" ] }, { "cell_type": "code", "execution_count": 21, "id": "021ba663-d05b-4a28-8acf-92af9c45b4fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "108" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(bmi_results)" ] }, { "cell_type": "code", "execution_count": 22, "id": "330d2e99-692d-4da2-b47c-0060da814664", "metadata": {}, "outputs": [], "source": [ "hw_df = pandas.DataFrame(\n", " {\n", " 'subject.subjectGuid': sample_meta['subject.subjectGuid'].tolist(),\n", " 'sample.sampleKitGuid': sample_meta['sample.sampleKitGuid'].tolist(),\n", " 'Height': h_results,\n", " 'Weight': w_results,\n", " 'subject.bmi': bmi_results\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 23, "id": "f2d3a7b1-562d-462d-a6d8-5c04ac8061df", "metadata": {}, "outputs": [], "source": [ "hw_df = hw_df.sort_values('subject.subjectGuid')" ] }, { "cell_type": "code", "execution_count": 24, "id": "0ef6884f-89fb-4ad1-97f2-14d37b0d3689", "metadata": {}, "outputs": [], "source": [ "hw_file = 'output/ref_subject_bmi_results_{date}.csv'.format(date = date.today())\n", "hw_df.to_csv(hw_file)" ] }, { "cell_type": "markdown", "id": "bd609830-e93c-4c01-ad46-dd49d8e00b18", "metadata": {}, "source": [ "## Upload results to HISE\n", "\n", "Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps." ] }, { "cell_type": "code", "execution_count": 25, "id": "cde4e3fd-2bb4-4f1b-870e-b257f78720ce", "metadata": {}, "outputs": [], "source": [ "study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'\n", "title = 'PBMC Ref. CMV and BMI clinical labs {d}'.format(d = date.today())" ] }, { "cell_type": "code", "execution_count": 26, "id": "8f84f4a4-a460-4f4e-95e3-dde293d3b4ec", "metadata": {}, "outputs": [], "source": [ "in_files = [sample_meta_file_uuid]" ] }, { "cell_type": "code", "execution_count": 27, "id": "0d44f303-85fd-4d1e-bd2d-72592f8cd4d8", "metadata": {}, "outputs": [], "source": [ "out_files = [cmv_file, hw_file]" ] }, { "cell_type": "code", "execution_count": 29, "id": "d209d7f8-891a-4de1-81d4-9ed322c48c5a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "you are trying to upload file_ids... ['output/ref_subject_cmv_lab_results_2024-02-18.csv', 'output/ref_subject_bmi_results_2024-02-18.csv']. Do you truly want to proceed?\n" ] }, { "name": "stdin", "output_type": "stream", "text": [ "(y/n) y\n" ] }, { "data": { "text/plain": [ "{'trace_id': 'f8ff0c05-b47c-4476-a98c-fa580728f317',\n", " 'files': ['output/ref_subject_cmv_lab_results_2024-02-18.csv',\n", " 'output/ref_subject_bmi_results_2024-02-18.csv']}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hisepy.upload.upload_files(\n", " files = out_files,\n", " study_space_id = study_space_uuid,\n", " title = title,\n", " input_file_ids = in_files\n", ")" ] }, { "cell_type": "code", "execution_count": 30, "id": "9fc1fdb1-90ae-4ab0-b8f6-778260f7ef63", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "Click to view session information\n", "
\n",
       "-----\n",
       "hisepy              0.3.0\n",
       "pandas              2.1.4\n",
       "session_info        1.0.0\n",
       "-----\n",
       "
\n", "
\n", "Click to view modules imported as dependencies\n", "
\n",
       "PIL                         10.0.1\n",
       "anyio                       NA\n",
       "arrow                       1.3.0\n",
       "asttokens                   NA\n",
       "attr                        23.2.0\n",
       "attrs                       23.2.0\n",
       "babel                       2.14.0\n",
       "beatrix_jupyterlab          NA\n",
       "brotli                      NA\n",
       "cachetools                  5.3.1\n",
       "certifi                     2023.11.17\n",
       "cffi                        1.16.0\n",
       "charset_normalizer          3.3.2\n",
       "cloudpickle                 2.2.1\n",
       "colorama                    0.4.6\n",
       "comm                        0.1.4\n",
       "cryptography                41.0.7\n",
       "cycler                      0.10.0\n",
       "cython_runtime              NA\n",
       "dateutil                    2.8.2\n",
       "db_dtypes                   1.1.1\n",
       "debugpy                     1.8.0\n",
       "decorator                   5.1.1\n",
       "defusedxml                  0.7.1\n",
       "deprecated                  1.2.14\n",
       "exceptiongroup              1.2.0\n",
       "executing                   2.0.1\n",
       "fastjsonschema              NA\n",
       "fqdn                        NA\n",
       "google                      NA\n",
       "greenlet                    2.0.2\n",
       "grpc                        1.58.0\n",
       "grpc_status                 NA\n",
       "h5py                        3.10.0\n",
       "idna                        3.6\n",
       "importlib_metadata          NA\n",
       "ipykernel                   6.28.0\n",
       "ipython_genutils            0.2.0\n",
       "ipywidgets                  8.1.1\n",
       "isoduration                 NA\n",
       "jedi                        0.19.1\n",
       "jinja2                      3.1.2\n",
       "json5                       NA\n",
       "jsonpointer                 2.4\n",
       "jsonschema                  4.20.0\n",
       "jsonschema_specifications   NA\n",
       "jupyter_events              0.9.0\n",
       "jupyter_server              2.12.1\n",
       "jupyterlab_server           2.25.2\n",
       "jwt                         2.8.0\n",
       "kiwisolver                  1.4.5\n",
       "markupsafe                  2.1.3\n",
       "matplotlib                  3.8.0\n",
       "matplotlib_inline           0.1.6\n",
       "mpl_toolkits                NA\n",
       "nbformat                    5.9.2\n",
       "numpy                       1.26.2\n",
       "opentelemetry               NA\n",
       "overrides                   NA\n",
       "packaging                   23.2\n",
       "parso                       0.8.3\n",
       "pexpect                     4.8.0\n",
       "pickleshare                 0.7.5\n",
       "pkg_resources               NA\n",
       "platformdirs                4.1.0\n",
       "plotly                      5.18.0\n",
       "prettytable                 3.9.0\n",
       "prometheus_client           NA\n",
       "prompt_toolkit              3.0.42\n",
       "proto                       NA\n",
       "psutil                      NA\n",
       "ptyprocess                  0.7.0\n",
       "pure_eval                   0.2.2\n",
       "pyarrow                     13.0.0\n",
       "pydev_ipython               NA\n",
       "pydevconsole                NA\n",
       "pydevd                      2.9.5\n",
       "pydevd_file_utils           NA\n",
       "pydevd_plugins              NA\n",
       "pydevd_tracing              NA\n",
       "pygments                    2.17.2\n",
       "pyparsing                   3.1.1\n",
       "pyreadr                     0.5.0\n",
       "pythonjsonlogger            NA\n",
       "pytz                        2023.3.post1\n",
       "referencing                 NA\n",
       "requests                    2.31.0\n",
       "rfc3339_validator           0.1.4\n",
       "rfc3986_validator           0.1.1\n",
       "rpds                        NA\n",
       "send2trash                  NA\n",
       "shapely                     1.8.5.post1\n",
       "six                         1.16.0\n",
       "sniffio                     1.3.0\n",
       "socks                       1.7.1\n",
       "sql                         NA\n",
       "sqlalchemy                  2.0.21\n",
       "sqlparse                    0.4.4\n",
       "stack_data                  0.6.2\n",
       "termcolor                   NA\n",
       "tornado                     6.3.3\n",
       "tqdm                        4.66.1\n",
       "traitlets                   5.9.0\n",
       "typing_extensions           NA\n",
       "uri_template                NA\n",
       "urllib3                     1.26.18\n",
       "wcwidth                     0.2.12\n",
       "webcolors                   1.13\n",
       "websocket                   1.7.0\n",
       "wrapt                       1.15.0\n",
       "xarray                      2023.12.0\n",
       "yaml                        6.0.1\n",
       "zipp                        NA\n",
       "zmq                         25.1.2\n",
       "zoneinfo                    NA\n",
       "zstandard                   0.22.0\n",
       "
\n", "
\n", "
\n",
       "-----\n",
       "IPython             8.19.0\n",
       "jupyter_client      8.6.0\n",
       "jupyter_core        5.6.1\n",
       "jupyterlab          4.0.10\n",
       "notebook            6.5.4\n",
       "-----\n",
       "Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]\n",
       "Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31\n",
       "-----\n",
       "Session information updated at 2024-02-18 02:15\n",
       "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "session_info.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "a9a72aa3-e50a-4f31-8b22-84179ae6bec4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }