{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CytoMod\n",
"\n",
"Welcome to the CytoMod example Jupyter Notebook!
\n",
"The full paper describing the method can be found at
\n",
"https://www.frontiersin.org/articles/10.3389/fimmu.2019.01338/
\n",
"This notebook runs over an example dataset from the FLU09 study.
\n",
"In order to run the notebook yourself with your own data, download the CytoMod folder from https://github.com/liel-cohen/CytoMod. The folder contains a folder data_files/data that contains files named cytokine_data.xlsx and patient_data.xlsx, which hold the data for the notebook analysis. In the folder you have downloaded, you can replace these files with your own data files while following the format instructions bellow.\n",
"\n",
"* cytokine_data.xlsx: cytokine_data.xlsx: the first column is the subject IDs\n",
"(named PTID in the example file) which will be converted\n",
"to row indexes. If your dataset has no subject IDs,\n",
"change 'indexCol=0' to 'indexCol=None', in both cy_data and patients_data\n",
"initialization (under the Import data header).\n",
"The next columns are your cytokines data. Each column should have\n",
"the raw cytokine measurment for the subject indicated in the specific row.
\n",
"\n",
"* patient_data.xlsx: Optional file for patient outcomes, for the associations to outcomes analysis. The first column is the subject IDs (named PTID in the example file). The instructions regarding the IDs are the same as for the cytokine_data.xlsx file. This dataframe should contain outcome variables to be analyzed in the associations to outcomes analysis. It may also contain covariate variables for controlling the regression models built for the associations calculation.
\n",
"Make sure binary columns contain 0 and 1 values, or True and False values\n",
"(and cells with unknown values are left empty)
\n",
"\n",
"A folder named 'output' will be created by the code inside the data_files folder. The code writes all results and figures into this folder.
\n",
"See the Define arguments section for further instructions for this analysis.
\n",
"\n",
"The code was written using the Anaconda3 Python interpreter and packages.
\n",
"Recommended versions: Python 3.7.1, Pandas 0.23.4, Numpy 1.16.2
\n",
"The palettable module (https://pypi.org/project/palettable/) must also be installed.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import pandas as pd\n",
"sys.path.append(os.path.join(os.getcwd(), 'cytomod', 'otherTools'))\n",
"import matplotlib.pyplot as plt\n",
"import cytomod\n",
"import cytomod.run_gap_statistic as gap_stat\n",
"import cytomod.assoc_to_outcome as outcome\n",
"from cytomod import plotting as cyplot\n",
"from hclusterplot import plotHColCluster\n",
"import tools\n",
"import numpy as np\n",
"import random"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"warnings.simplefilter('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Define input arguments\n",
"* args.name_data\n",
" Name of dataset/cohort (for writing files)\n",
"* args.name_compartment\n",
" Name of compartment from which cytokines were extracted, e.g., serum (for writing files)\n",
"* args.cytokines\n",
" List of cytokines to be analyzed. If None, will analyze all cytokines in the cytokine_data file.\n",
"* args.log_transform\n",
" Boolean indicating whether to perform a log (base 10) transformation (True) or not (False).\n",
"* args.outcomes\n",
" Optional. Names of outcome variables from the patients_data.xlsx data-frame to be analyzed.\n",
" If list is left empty (i.e., []), will not perform the associations to outcomes analysis.\n",
" This code supports binary or continuous outcome variables. Note - binary and continuous variables must be analyzed separately, i.e., you can analyze binary *or* continuous variables.\n",
" (Binary outcome columns should only contain 0 and 1, or True and False values.)\n",
"* args.logistic\n",
" Set to True if outcomes variables are binary (then, logistic regression will be used).\n",
" Set to False if outcomes variables are continuous (then, linear regression will be used).\n",
" *According to chosen value, see \"associations to outcomes\" -> \"figures\" for correct definition of colorscale_value and colorscale_labels variables, which define the colorscale for associations figures\n",
"* args.covariates\n",
" Optional. Names of covariate variables (columns) from the patients_data.xlsx data-frame \n",
" to be controlled for in the regression models. If list is left empty (i.e., []), \n",
" will not controll the associations to outcomes analysis with any covariate variables.\n",
" Categorical covariates should be inserted as dummy variables. Binary columns should only contain 0 and 1, or True and False values.\n",
"* args.log_column_names\n",
" List with names of covariate columns to be log-transformed, only if args.log_transform = True. \n",
" If there are no columns you wish to transform, leave empty (i.e., [])\n",
"* args.max_testing_k\n",
" Maximal number of clusters to test the gap statistic for.\n",
"* args.max_final_k\n",
" The maximal number of clusters that can be chosen based on the\n",
" gap statistic.\n",
"* args.recalculate_best_k \n",
" Boolean. Set to True if you want the gap statistic (for finding the best K)\n",
" to be recalcultaed in the current run.\n",
" After calculation, the calculated best K will be saved to files,\n",
" and used by the code until the next time you decide to recalculate them,\n",
" or if the best K files are deleted. If no best K files are found, will\n",
" recalculate best K anyway.\n",
"* args.seed\t\n",
" Seed for random numbers stream set before cytomod calculations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Manually define input arguments "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"args = tools.Object()\n",
"\n",
"args.name_data = 'FLU09'\n",
"args.name_compartment = 'Plasma'\n",
"\n",
"args.log_transform = True\n",
"args.max_testing_k = 8\n",
"args.max_final_k = 6 # Must be <= max_testing_k\n",
"args.recalculate_modules = True\n",
"args.outcomes = ['FluPositive'] # names of outcome columns\n",
"args.logistic = True # True if outcomes are binary. False if outcomes are continuous.\n",
"args.covariates = ['Age'] # names of regression covariates to control for\n",
"args.log_column_names = ['Age'] # or empty list: []\n",
"args.cytokines = None # if none, will analyze all\n",
"\n",
"args.seed = 1234"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Folder Paths"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data files are read from folder: C:\\Users\\liel-\\Dropbox\\PyCharm\\PycharmProjectsNew\\CytoMod_git\\data_files\\data \n",
"\n",
"Output will be saved to folder: C:\\Users\\liel-\\Dropbox\\PyCharm\\PycharmProjectsNew\\CytoMod_git\\data_files\\output\n"
]
}
],
"source": [
"args.path_files = os.path.join(os.getcwd(), 'data_files')\n",
"\n",
"args.paths = {'files': os.path.join(os.getcwd(), 'data_files'),\n",
" 'data': os.path.join(os.getcwd(), 'data_files', 'data'),\n",
" 'gap_statistic': os.path.join(os.getcwd(), 'data_files', 'output', 'gap_statistic'),\n",
" 'clustering': os.path.join(os.getcwd(), 'data_files', 'output', 'clustering'),\n",
" 'clustering_info': os.path.join(os.getcwd(), 'data_files', 'output', 'clustering', 'info'),\n",
" 'clustering_figures': os.path.join(os.getcwd(), 'data_files', 'output', 'clustering', 'figures'),\n",
" 'correlation_figures': os.path.join(os.getcwd(), 'data_files', 'output', 'correlations'),\n",
" 'association_figures': os.path.join(os.getcwd(), 'data_files', 'output', 'associations'),\n",
" }\n",
"\n",
"tools.create_folder(args.paths['gap_statistic'])\n",
"tools.create_folder(args.paths['clustering_info'])\n",
"tools.create_folder(args.paths['clustering_figures'])\n",
"tools.create_folder(args.paths['correlation_figures'])\n",
"tools.create_folder(args.paths['association_figures'])\n",
"\n",
"print('Data files are read from folder:', os.path.join(os.getcwd(), 'data_files', 'data'),'\\n')\n",
"print('Output will be saved to folder:', os.path.join(os.getcwd(), 'data_files', 'output'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Make sure input arguments are valid"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"assert type(args.name_data) is str\n",
"assert type(args.name_compartment) is str\n",
"assert type(args.log_transform) is bool\n",
"assert type(args.logistic) is bool\n",
"assert type(args.max_testing_k) is int\n",
"assert type(args.max_final_k) is int\n",
"assert args.max_final_k <= args.max_testing_k\n",
"assert type(args.outcomes) is list\n",
"assert type(args.covariates) is list\n",
"\n",
"for col_name in args.outcomes + args.covariates + args.log_column_names:\n",
" assert type(col_name) is str\n",
" tools.assert_column_exists_in_path(os.path.join(args.paths['data'], 'patient_data.xlsx'), col_name)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Args are valid!\n"
]
}
],
"source": [
"# If you got here -\n",
"print(\"Args are valid!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Import data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Cytokine data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"cy_data = tools.read_excel(os.path.join(args.paths['data'], 'cytokine_data.xlsx'), indexCol=0)\n",
"cy_data.dropna(axis='index', how='all', inplace=True)\n",
"\n",
"if args.cytokines is None:\n",
" args.cytokines = list(cy_data.columns)\n",
"\n",
"# Only cytokines contained in args.cytokines list\n",
"cy_data = cy_data[args.cytokines]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | EGF | \n", "Eotaxin | \n", "FGF-2 | \n", "FLT3L | \n", "FKN | \n", "GCSF | \n", "GM-CSF | \n", "GRO | \n", "IFNα2 | \n", "IFNγ | \n", "... | \n", "MCP1 | \n", "MCP3 | \n", "MDC | \n", "MIP1α | \n", "MIP1β | \n", "sCD40-L | \n", "TGFα | \n", "TNFα | \n", "TNFβ | \n", "VEGF | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
3200 | \n", "5.24 | \n", "24.34 | \n", "39.02 | \n", "1.94 | \n", "20.86 | \n", "39.52 | \n", "6.60 | \n", "98.41 | \n", "8.12 | \n", "9.97 | \n", "... | \n", "199.33 | \n", "9.66 | \n", "775.28 | \n", "3.01 | \n", "18.40 | \n", "530.95 | \n", "1.39 | \n", "15.25 | \n", "2.09 | \n", "113.09 | \n", "
3202 | \n", "179.16 | \n", "39.41 | \n", "28.62 | \n", "2.50 | \n", "49.11 | \n", "20.92 | \n", "356.27 | \n", "1019.90 | \n", "27.21 | \n", "8.36 | \n", "... | \n", "2737.32 | \n", "67.07 | \n", "1158.00 | \n", "171.51 | \n", "150.73 | \n", "10859.95 | \n", "3.61 | \n", "15.93 | \n", "1.61 | \n", "77.21 | \n", "
3204 | \n", "191.72 | \n", "42.49 | \n", "7.68 | \n", "0.67 | \n", "33.36 | \n", "10.44 | \n", "22.71 | \n", "2038.28 | \n", "25.98 | \n", "7.36 | \n", "... | \n", "159.52 | \n", "6.66 | \n", "1867.52 | \n", "8.43 | \n", "35.67 | \n", "11849.41 | \n", "2.36 | \n", "4.91 | \n", "0.53 | \n", "45.34 | \n", "
3206 | \n", "132.00 | \n", "93.76 | \n", "33.89 | \n", "0.47 | \n", "128.00 | \n", "67.87 | \n", "147.00 | \n", "4132.00 | \n", "27.43 | \n", "13.54 | \n", "... | \n", "301.00 | \n", "23.69 | \n", "1139.00 | \n", "208.00 | \n", "110.00 | \n", "13420.00 | \n", "24.26 | \n", "14.16 | \n", "0.59 | \n", "32.60 | \n", "
3209 | \n", "12.91 | \n", "18.75 | \n", "17.37 | \n", "2.50 | \n", "2.62 | \n", "12.83 | \n", "231.27 | \n", "149.67 | \n", "6.48 | \n", "63.64 | \n", "... | \n", "98.76 | \n", "7.09 | \n", "965.21 | \n", "49.08 | \n", "1.51 | \n", "250.07 | \n", "1.90 | \n", "0.68 | \n", "1.61 | \n", "400.35 | \n", "
5 rows × 37 columns
\n", "