{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_quickly_compare_cohorts.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "74MKlxovBc4m"
   },
   "source": [
    "# ISB-CGC Community Notebooks\n",
    "\n",
    "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
    "\n",
    "```\n",
    "Title:   How to quickly compare cohorts\n",
    "Author:  Lauren Hagen\n",
    "Created: 2020-04-09\n",
    "URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_quickly_compare_cohorts.ipython\n",
    "Purpose: Comparing cohorts with survival curves and histograms with data from BigQuery or the WebApp.\n",
    "Notes: \n",
    "```\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vjpTjuMgCGZX"
   },
   "source": [
    "# Overview\n",
    "\n",
    "In this notebook, we will compare two cohorts with survival curves and feature comparisons. We will be using data from the [Lung Adenocarcinoma](https://portal.gdc.cancer.gov/projects/TCGA-LUAD) (LUAD) and [Lung Squamous Cell Carcinoma](https://portal.gdc.cancer.gov/projects/TCGA-LUSC) (LUSC) projects from the TCGA program.\n",
    "\n",
    "This notebook can handle multiple or different cohorts from the TCGA program. Add or change the cohorts with the 'Load Cohort' code block."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pKgzqmeWHD0X"
   },
   "source": [
    "# Authorization, install or load packages, and create client\n",
    "\n",
    "Before we get started, we will need to load the BigQuery module, authenticate ourselves, create a client variable, and import and load necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "TG0KUon3BVa9",
    "outputId": "7ecaf1f1-210c-41bf-ea71-54ee023d43c8"
   },
   "outputs": [],
   "source": [
    "# Load the BigQuery Module\n",
    "from google.cloud import bigquery\n",
    "\n",
    "from google.colab import auth\n",
    "try:\n",
    "  auth.authenticate_user()\n",
    "  print('You have been successfully authenticated!')\n",
    "except:\n",
    "  print('You have not been authenticated.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "zm2e4pBjeqJh",
    "outputId": "e6f78a99-c7c2-4d6f-c676-063243fd66b2"
   },
   "outputs": [],
   "source": [
    "# Create a variable for which client to use with BigQuery\n",
    "project_num = 'your_project_number' # Replace with your project ID\n",
    "if project_num == 'your_project_number':\n",
    "  print('Please update the project number with your Google Cloud Project')\n",
    "else:\n",
    "  client = bigquery.Client(project_num)\n",
    "  print('Client Variable Created')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "1MKk7D8JjXXj",
    "outputId": "ac99f520-416b-461a-d9c3-4d32b1157631"
   },
   "outputs": [],
   "source": [
    "# Install the lifelines package, if necessary\n",
    "!pip install lifelines -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "EK8B0CGgZm4G"
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "from lifelines import KaplanMeierFitter\n",
    "kmf = KaplanMeierFitter()\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SPt6Tq7CIh8t"
   },
   "source": [
    "# Load Cohort\n",
    "First we will need to load our cohort into a data frame. The cohort can either be made using the WebApp or a SQL query to the TCGA clinical tables. A cohort is a list of case barcodes from the TCGA program. The first code block can be updated to include 2 or more cohorts. We are going to use the sample queries filled in below as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "fl2U89WQVR-i"
   },
   "outputs": [],
   "source": [
    "# Load Cohort\n",
    "# Update this code bolck with your cohorts\n",
    "# Fill in how the cohort will be created\n",
    "# SQL or File\n",
    "cohort_type = \"SQL\"\n",
    "\n",
    "# Load a list of case_barocodes from our WebApp\n",
    "\n",
    "file1 = 'your_first_file_location_here' # update with your file location, if using one\n",
    "# Or insert a SQL query\n",
    "query1 = \"\"\"\n",
    "  SELECT\n",
    "    case_barcode\n",
    "  FROM\n",
    "   `isb-cgc.TCGA_bioclin_v0.Clinical`\n",
    "  WHERE\n",
    "    disease_code = 'LUAD'\n",
    "  \"\"\"\n",
    "file2 = 'your_second_file_location_here' # update with your file location, if using one\n",
    "# Or insert a SQL query\n",
    "query2 = \"\"\"\n",
    "  SELECT\n",
    "    case_barcode\n",
    "  FROM\n",
    "   `isb-cgc.TCGA_bioclin_v0.Clinical`\n",
    "  WHERE\n",
    "    disease_code = 'LUSC'\n",
    "  \"\"\"\n",
    "\n",
    "# Update with files or queries used\n",
    "cohort = {\"Cohort_1\": query1, \"Cohort_2\": query2}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "24ftqstiQ8Kb"
   },
   "outputs": [],
   "source": [
    "# Function to create a combined labled list of cohorts for comparison\n",
    "def load_cohort(cohorts, cohort_type):\n",
    "  final_list = pd.DataFrame()\n",
    "\n",
    "  for cohort in cohorts.keys():\n",
    "    if cohort_type == 'SQL':\n",
    "      cohort_query_request = client.query(cohorts.get(cohort))\n",
    "      barcodes = cohort_query_request.result().to_dataframe()\n",
    "    else:\n",
    "      cohort_raw = pd.read_csv(cohorts.get(cohort), header=1)\n",
    "      barcodes = pd.DataFrame({'case_barcode': cohort_raw['Case Barcode']})\n",
    "    barcodes['cohort'] = cohort\n",
    "    final_list = pd.concat([final_list, barcodes])\n",
    "  return final_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TQFaY-KCRwsY"
   },
   "outputs": [],
   "source": [
    "combined_list = load_cohort(cohort, cohort_type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "skFfn1EmWmmm"
   },
   "source": [
    "# Survival Curve\n",
    "First, we will compare groups with a survival curve. We will be using a Kaplan Meier Curve from the [`lifelines` package](lifelines.readthedocs.io).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4IF9rY7JWDXP"
   },
   "outputs": [],
   "source": [
    "# Pull the vital status, days to death, and grouping column\n",
    "# into a data frame\n",
    "sc_query = \"\"\"\n",
    "SELECT\n",
    "  case_barcode,\n",
    "  vital_status,\n",
    "  days_to_death\n",
    "FROM\n",
    "  `isb-cgc.TCGA_bioclin_v0.Clinical`\n",
    "WHERE\n",
    "  case_barcode IN ('{}') AND\n",
    "  vital_status IS NOT NULL\n",
    "\"\"\".format(\"','\".join(combined_list['case_barcode']))\n",
    "sc_query_request = client.query(sc_query)\n",
    "survival_curve = pd.merge(sc_query_request.result().to_dataframe(),\n",
    "                          combined_list, how=\"inner\", on=\"case_barcode\")\n",
    "\n",
    "# Fill in NAs in days_to_death with the max from the days to death\n",
    "T = survival_curve['days_to_death'].fillna(survival_curve['days_to_death'].max())\n",
    "# Convert the vital status to numbers\n",
    "E = survival_curve['vital_status'].replace({'Alive':0, 'Dead':1})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 583
    },
    "colab_type": "code",
    "id": "gpqtgKcwkLND",
    "outputId": "f30faaf0-823a-4b64-cb00-b13bd07ac106"
   },
   "outputs": [],
   "source": [
    "# Plot the survivial curve\n",
    "fig=plt.figure(figsize=(13, 8), dpi= 80)\n",
    "plt.style.use('seaborn-colorblind')\n",
    "ax = plt.subplot(111,\n",
    "                 title = \"Survival Curve\")\n",
    "\n",
    "for r in survival_curve['cohort'].sort_values().unique():\n",
    "  if (r != None):\n",
    "    cohort = survival_curve['cohort'] == r\n",
    "    kmf.fit(T.loc[cohort], E.loc[cohort], label=r)\n",
    "    kmf.plot(ax=ax, )\n",
    "  else:\n",
    "    print(\"\")\n",
    "\n",
    "ax.set_ylabel(\"Percent Survival\")\n",
    "ax.set_xlabel(\"Days\")\n",
    "\n",
    "# if you would like to save this plot as an image, uncomment the next two lines\n",
    "# survival_curve_file = \"file_name_here.png\"\n",
    "# plt.savefig(survival_curve_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "30vBFYJDPPrX"
   },
   "source": [
    "## Survival Curve Summary\n",
    "\n",
    "We now have our two cohorts plotted with a survival curve. We can see that the percent survival falls faster for Cohort 2 (LUSC) than Cohort 1 (LUAD). Below is a section on compare the case counts for different features within the data.\n",
    "\n",
    "Listed below are some useful resources on Survival Curves:\n",
    "- [Kaplan-Meier Estimator Wikipedia](https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator)\n",
    "- [Kaplan Meier curves: an introduction](https://towardsdatascience.com/kaplan-meier-curves-c5768e349479)\n",
    "- [Nature: A practical guide to understanding Kaplan-Meier curves by Rich J, Neely J, Paniello R, Voelker C, Nussenbaum B, Wang E](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3932959/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "7i4N-7VY8gaF"
   },
   "source": [
    "# Feature Comparisons\n",
    "\n",
    "We may also like to view the distribution of the patient's ages, gender, and vital status between the two data sets. We can do this by creating bar charts for each feature.\n",
    "\n",
    "First, we will get the data from BigQuery for the desired features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZCh8WaYxWqOD"
   },
   "outputs": [],
   "source": [
    "# Create a query to retrieve the cohort features\n",
    "hist_query = \"\"\"\n",
    "SELECT\n",
    "  case_barcode,\n",
    "  gender,\n",
    "  vital_status,\n",
    "  age_at_diagnosis,\n",
    "  ethnicity,\n",
    "  pathologic_stage\n",
    "FROM\n",
    "  `isb-cgc.TCGA_bioclin_v0.Clinical`\n",
    "WHERE\n",
    "  case_barcode IN ('{}')\n",
    "\"\"\".format(\"', '\".join(combined_list['case_barcode']))\n",
    "hist_query_request = client.query(hist_query)\n",
    "compare = hist_query_request.result().to_dataframe()\n",
    "compare = pd.merge(compare, combined_list, how=\"inner\", on=\"case_barcode\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tvPLDbrQ4uCU"
   },
   "source": [
    "Next we will want to view if there are any missing values in our cohorts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 156
    },
    "colab_type": "code",
    "id": "TQgN6XQ429FT",
    "outputId": "6d39e0ab-d26a-4011-dc59-0860185cab77"
   },
   "outputs": [],
   "source": [
    "# View the number of missing records in the data set\n",
    "compare.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3HA5En1Y_MCS"
   },
   "source": [
    "Then we will need to transform the data a little bit, so that we are able to make the categorical histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Fd5YOh3_29fF"
   },
   "outputs": [],
   "source": [
    "comp_gender = pd.DataFrame()\n",
    "comp_vital_status = pd.DataFrame()\n",
    "comp_ethnicity = pd.DataFrame()\n",
    "comp_pathologic_stage = pd.DataFrame()\n",
    "\n",
    "for i in compare['cohort'].sort_values().unique():\n",
    "  df = compare['cohort'] == i\n",
    "  comp_gender[i] = compare['gender'].loc[df].value_counts()\n",
    "  comp_vital_status[i] = compare['vital_status'].loc[df].value_counts()\n",
    "  comp_pathologic_stage[i] = compare['pathologic_stage'].loc[df].value_counts()\n",
    "  comp_ethnicity[i] = compare['ethnicity'].loc[df].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZoJ5Vu0s_eOR"
   },
   "source": [
    "Finally, we will plot the features on bar charts.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 887
    },
    "colab_type": "code",
    "id": "RYqvIE5lZysv",
    "outputId": "adce6129-5834-4b23-bd1f-ee6deee60bdd"
   },
   "outputs": [],
   "source": [
    "# Set the style of the plots\n",
    "fig=plt.figure(dpi=100, figsize=(15, 8))\n",
    "plt.style.use('seaborn-colorblind')\n",
    "\n",
    "# Create the Age of Diagnosis Comparison plot\n",
    "sub1 = plt.subplot(151)\n",
    "for i in compare['cohort'].unique():\n",
    "  df = compare['cohort'] == i\n",
    "  df2 = compare.loc[df]\n",
    "  df2.hist('age_at_diagnosis', ax=sub1)\n",
    "\n",
    "sub1.set_title(\"Age at Diagnosis Comparison\")\n",
    "sub1.set_ylabel(\"Count of Cases\")\n",
    "sub1.set_xlabel(\"Age at Diagnosis\")\n",
    "\n",
    "# Create the Gender Comparison plot\n",
    "sub2 = plt.subplot(152)\n",
    "plot2 = comp_gender.plot(kind='bar', title = \"Gender Comparison\", ax=sub2)\n",
    "plot2.set_xlabel(\"Gender\")\n",
    "\n",
    "# Create the Ethnicity Comparison plot\n",
    "sub3 = plt.subplot(153)\n",
    "plot3 = comp_ethnicity.plot(kind='bar', title = \"Ethnicity Comparison\", ax=sub3)\n",
    "plot3.set_xlabel(\"ethnicity\")\n",
    "\n",
    "# Create the Vital Status Comparison plt\n",
    "sub4 = plt.subplot(154)\n",
    "plot4 = comp_vital_status.plot(kind=\"bar\", title = \"Vital Status Comparison\", ax=sub4)\n",
    "plot4.set_xlabel(\"Vital Status\")\n",
    "\n",
    "# Create the Pathological Stage Comparison plot\n",
    "sub5 = plt.subplot(155)\n",
    "plot5 = comp_pathologic_stage.plot(kind='bar', title = \"Pathological Stage Comparison\", ax=sub5)\n",
    "plot5.set_xlabel(\"pathologic_stage\")\n",
    "\n",
    "# if you would like to save this plot as an image, uncomment the next two lines\n",
    "# histogram_file = \"file_name_here.png\"\n",
    "# plt.savefig(histogram_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kPxJpRVXgQHg"
   },
   "source": [
    "# That's all folks!\n",
    "\n",
    "So, that's it for a quick cohort comparison. If you need help or have a comment, reach out to us at feedback@isb-cgc.org\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Qz97oVzTx4Ri"
   },
   "source": [
    "This notebook was inspired by the [ICGC Cohort Comparison Analysis tool](https://dcc.icgc.org/analysis)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyPWCutpBt22fl/LGy42gNb+",
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "How_to_quickly_compare_cohorts.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}