{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ona5p1IgVCrb"
      },
      "source": [
        "# ISB-CGC Community Notebooks\n",
        "\n",
        "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
        "\n",
        "```\n",
        "Title:   How to create cohorts\n",
        "Author:  Lauren Hagen\n",
        "Created: 2019-06-20\n",
        "Updated: 2023-12\n",
        "Purpose: Basic overview of creating cohorts with IDB-CGC BigQuery\n",
        "URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb\n",
        "Notes:   This notebook was adapted from work by Sheila Reynolds, 'How to Create TCGA Cohorts part 1' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.\n",
        "```\n",
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a href=\"https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ],
      "metadata": {
        "id": "LlRBaxSIkc53"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# How to Create Cohorts\n",
        "\n",
        "This notebook demonstrates how to create a cohort (list) of patients from the [Genomic Data Commons (GDC)](https://portal.gdc.cancer.gov/) using ISB-CGC public BigQuery tables. We will use the clinical and file tables to create a curated list of patients to discover associated files and data. More information on the tables and data that this notebook explores can be found in our Documentation at:\n",
        "- [ISB-CGC BigQuery Tables](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html)\n",
        "- [Programs and Data Sets](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Hosted-Data.html)\n",
        "- [Case and File Metadata](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/data/FileMetadata.html)\n",
        "- [Clinical, Biospecimen and Processed -Omics Data Sets](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Hosted-Data.html#clinical-biospecimen-and-processed-omics-data-sets)\n"
      ],
      "metadata": {
        "id": "EswFBsCacRsG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Initialize Notebook Environment\n",
        "Before beginning, we first need to load dependencies and authenticate to BigQuery. You will need to have access to a [Google Cloud Platform (GCP)](https://cloud.google.com/?hl=en) project in order to use BigQuery."
      ],
      "metadata": {
        "id": "EIHjFVg1kymy"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Install Dependencies"
      ],
      "metadata": {
        "id": "ElAE7F9Dk95_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# GCP libraries\n",
        "from google.cloud import bigquery\n",
        "from google.colab import auth"
      ],
      "metadata": {
        "id": "Bg0hBpGMk7CI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Authenticate\n",
        "In order to utilize BigQuery, we must obtain authorization to BigQuery and Google Cloud.\n",
        "\n",
        "[Alternative authentication methods](https://googleapis.github.io/google-cloud-python/latest/core/auth.html)"
      ],
      "metadata": {
        "id": "bLLYm1XFlAs0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# if you're using Google Colab, authenticate to gcloud with the following\n",
        "auth.authenticate_user()\n",
        "\n",
        "# alternatively, use the gcloud SDK\n",
        "#!gcloud auth application-default login"
      ],
      "metadata": {
        "id": "VbutWXlMlDCd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Google project ID\n",
        "Set your own Google project ID for use with this notebook."
      ],
      "metadata": {
        "id": "XT8fbJwdlF5u"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create a variable for which client to use with BigQuery\n",
        "project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project number"
      ],
      "metadata": {
        "id": "V5mN_upblIfi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### BigQuery Client"
      ],
      "metadata": {
        "id": "x6sSNl2nlKrY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "if project_id == 'YOUR_PROJECT_ID_CHANGE_ME': # checking that project id was changed\n",
        "  print('Please update the project number with your Google Cloud Project')\n",
        "else: client = bigquery.Client(project_id)"
      ],
      "metadata": {
        "id": "PmlBR6_MlMoP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Create a Simple Cohort\n",
        "To create this cohort, we are going to explore TCGA data to find projects with patients (cases) that have a history of smoking and the number of years they smoked. We then will explore the data types available in the top three projects."
      ],
      "metadata": {
        "id": "-Aje4HeMCgyq"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Determine which Projects to Use\n",
        "We will use the TCGA clinical table to find projects with the most data for the number of years smoked. This query can be used to look at any feature in the clinical table. The query uses COUNT and GROUP BY to calculate the number of patients with data."
      ],
      "metadata": {
        "id": "8CxFbJwlFhMr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create query string\n",
        "query_clinical = \"\"\"\n",
        "  SELECT\n",
        "    proj__project_id,\n",
        "    COUNT(proj__project_id) AS n\n",
        "  FROM\n",
        "    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`\n",
        "  WHERE\n",
        "    exp__years_smoked is not Null\n",
        "  GROUP BY\n",
        "    proj__project_id\n",
        "  ORDER BY\n",
        "    n DESC\n",
        " \"\"\"\n",
        "\n",
        "# Query BigQuery with an output to a dataframe\n",
        "result_clinical_projects = client.query(query_clinical).to_dataframe()\n",
        "\n",
        "# Create a variable of the three most populated projects\n",
        "project_ids = \"', '\".join(result_clinical_projects['proj__project_id'][0:3])\n",
        "\n",
        "# Print results\n",
        "print(result_clinical_projects)\n",
        "print(f\"The number of total rows: {len(result_clinical_projects)}\")"
      ],
      "metadata": {
        "id": "sb1WYFYgPeRq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Create a List of Cases with Selected Projects\n",
        "The [TCGA-LUSC (Lung Squamous Cell Carcinoma)](https://portal.gdc.cancer.gov/projects/TCGA-LUSC), [TCGA-LUAD (Lung Adenocarcinoma)](https://portal.gdc.cancer.gov/projects/TCGA-LUAD), and [TCGA-HNSC (Head and Neck Squamous Cell Carcinoma)](https://portal.gdc.cancer.gov/projects/TCGA-HNSC) projects are the top three projects with patients that have data for number of years smoking. The patients in these projects will form the base of our cohort. A cohort can be created by a number of different filters such as patients that smoked more than 10 years or the primary disease type."
      ],
      "metadata": {
        "id": "gM3z9bljqc1C"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create query string\n",
        "query_case = f\"\"\"\n",
        "  SELECT\n",
        "    proj__project_id,\n",
        "    case_id,\n",
        "    exp__years_smoked\n",
        "  FROM\n",
        "    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`\n",
        "  WHERE\n",
        "    proj__project_id IN ('{project_ids}') AND exp__years_smoked is not Null\n",
        "  \"\"\"\n",
        "\n",
        "# Query BigQuery with an output to a dataframe\n",
        "clinical_case_ids = client.query(query_case).to_dataframe()\n",
        "\n",
        "# Create a variable of the case ids\n",
        "case_ids = \"', '\".join(clinical_case_ids['case_id'])\n",
        "\n",
        "# Print results\n",
        "print(clinical_case_ids.head(5))\n",
        "print(f\"The number of total rows: {len(clinical_case_ids)}\")"
      ],
      "metadata": {
        "id": "rwmimB-hG3ji"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### View the Data Types Available\n",
        "Now that we have a cohort of case ids, we can use that list to discover the available data types with the file table."
      ],
      "metadata": {
        "id": "Q_OIEJXHw2Yy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query_file_types = f\"\"\"\n",
        "  SELECT\n",
        "    file.project_short_name,\n",
        "    file.data_category,\n",
        "    file.data_type,\n",
        "    file.data_format,\n",
        "    COUNT(file.file_gdc_id) as n\n",
        "  FROM\n",
        "    `isb-cgc-bq.GDC_case_file_metadata.fileData_active_current` AS file\n",
        "  WHERE\n",
        "    associated_entities__case_gdc_id IN ('{case_ids}')\n",
        "  GROUP BY\n",
        "    file.project_short_name,\n",
        "    file.data_category,\n",
        "    file.data_type,\n",
        "    file.data_format\n",
        "  ORDER BY file.data_type, n DESC\n",
        "  \"\"\"\n",
        "result_file_types = client.query(query_file_types).to_dataframe().head(5)\n",
        "print(result_file_types)"
      ],
      "metadata": {
        "id": "vMwK5jQfE03T"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can then use the cohort of case ids to create a table with RNA seq data for the associated aliquots.\n",
        "\n",
        "*Note: We used `LIMIT 50` in the query to limit the amount of data that is returned by the query*"
      ],
      "metadata": {
        "id": "Z8PSTiTkZ7dT"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query_rna = f\"\"\"\n",
        "  SELECT\n",
        "    aliquot_gdc_id,\n",
        "    fpkm_unstranded\n",
        "  FROM\n",
        "    `isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35`\n",
        "  WHERE\n",
        "    \tcase_gdc_id IN ('{case_ids}')\n",
        "  LIMIT 50\n",
        "\"\"\"\n",
        "results_rna_seq = client.query(query_rna).to_dataframe()\n",
        "print(results_rna_seq.head(5))"
      ],
      "metadata": {
        "id": "wzZLtPS5fkX_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The final query to create the table of RNA seq data can also be created with one query with a join between the clinical table and the RNA seq table.\n",
        "\n",
        "```\n",
        "  SELECT\n",
        "    c.case_id,\n",
        "    c.exp__years_smoked,\n",
        "    r.fpkm_unstranded\n",
        "  FROM\n",
        "    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37` AS c\n",
        "  JOIN\n",
        "    `isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35` AS r\n",
        "  ON\n",
        "    c.case_id = r.case_gdc_id\n",
        "  WHERE\n",
        "    proj__project_id IN ('TCGA-LUSC',\n",
        "      'TCGA-LUAD',\n",
        "      'TCGA-HNSC')\n",
        "    AND exp__years_smoked IS NOT NULL\n",
        "```"
      ],
      "metadata": {
        "id": "Toq0M5HhdeLg"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Closing\n",
        "Thank you for working through this notebook. We hope that you found this exercise to be helpful in finding relevant cohorts for your studies. Please explore the ISB-CGC ecosystem at [isb-cgc.org](isb-cgc.org).\n",
        "\n",
        "For questions, comments, or troubleshooting, please contact us at feedback@isb-cgc.org. We are especially keen on learning about your particular use-cases, and how we can help you take advantage of the latest in cloud-computing technologies to answer your research questions. Also, check out our virtual [Office Hours on Tuesdays and Thursdays](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/office_hours.html)."
      ],
      "metadata": {
        "id": "Vs9gHPheAz2N"
      }
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}