{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "_thGR656x6aK"
},
"source": [
"# ISB-CGC Community Notebooks\n",
"\n",
"Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
"\n",
"```\n",
"Title: How to create a complext cohort\n",
"Author: Lauren Hagen\n",
"Created: 2019-08-12\n",
"Purpose: More complex overview of creating cohorts\n",
"URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_a_complex_cohort.ipynb\n",
"Notes: This notebook was adapted from work by Sheila Reynolds, 'How to Create TCGA Cohorts part 3' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.\n",
"```\n",
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "w9Gaxslcx3sl"
},
"source": [
"This notebook will construct a cohort for a single tumor type based on data availability, while also taking into consideration annotations about the patients or samples.\n",
"\n",
"As you've seen already, in order to work with BigQuery, the first thing we need to do is import the bigquery module:\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Ve3ratDweAWv"
},
"outputs": [],
"source": [
"from google.cloud import bigquery"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ny1gH0J1fEJn"
},
"source": [
"And we will need to Authenticate ourselves."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 360
},
"colab_type": "code",
"id": "HaNQWUAJfDwi",
"outputId": "6f3cc6c2-73e1-4fd8-cb94-7be6f1a8135b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Go to the following link in your browser:\n",
"\n",
" https://accounts.google.com/o/oauth2/auth?code_challenge=acrfkWr7xOoV243XaGJFWtUaFIPXXMWRTbZtsKTMmGE&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth\n",
"\n",
"\n",
"Enter verification code: 4/qQGcKBsJfiBHp-IaAtot6ViuE6xFuruVxGq60zoersfxKp3re1Qfy6k\n",
"\n",
"Credentials saved to file: [/content/.config/application_default_credentials.json]\n",
"\n",
"These credentials will be used by any library that requests\n",
"Application Default Credentials.\n",
"\n",
"To generate an access token for other uses, run:\n",
" gcloud auth application-default print-access-token\n",
"\n",
"\n",
"To take a quick anonymous survey, run:\n",
" $ gcloud alpha survey\n",
"\n"
]
}
],
"source": [
"!gcloud auth application-default login"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9bUoHgrLeAWy"
},
"source": [
"Just so that this doesn't get buried in the code below, we are going to specify our tumor-type of interest here. In TCGA each tumor-type is also a separate *study* within the TCGA *project*. The studies are referred to based on the 2-4 letter tumor-type abbreviation. A complete list of all study abbreviations, with the full study name can be found on this [page](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations). For this particular exercise, we will look at the \"Breast invasive carcinoma\" study, abbreviated BRCA:"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "G-bO1RKYeAWz"
},
"outputs": [],
"source": [
"studyName = \"TCGA-BRCA\""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "_79s6zMPeAW1"
},
"source": [
"More information the the BRCA study can be found [here](https://portal.gdc.cancer.gov/projects/TCGA-BRCA). In this notebook, we are going to wind up making use of all of the available data types, so let's have a look at the entire **`TCGA_hg38_data_v0`** dataset in the query below. The tables and data sets available from ISB-CGC in BigQuery can also be explored without login with the [ISB-CGC BigQuery Table Searcher](https://isb-cgc.appspot.com/bq_meta_search/)."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Md6FYl7asQrO"
},
"outputs": [],
"source": [
"# Create a client to access the data within BigQuery\n",
"client = bigquery.Client('isb-cgc')\n",
"\n",
"# Create a variable of datasets \n",
"datasets = list(client.list_datasets())\n",
"# Create a variable for the name of the project\n",
"project = client.project"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 646
},
"colab_type": "code",
"id": "rtFEmuPJeAW1",
"outputId": "35be998d-2787-49cb-eb48-2d5f024fabcf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tables:\n",
"\tCopy_Number_Segment_Masked\n",
"\tCopy_Number_Segment_Masked_r14\n",
"\tDNA_Methylation\n",
"\tDNA_Methylation_chr1\n",
"\tDNA_Methylation_chr10\n",
"\tDNA_Methylation_chr11\n",
"\tDNA_Methylation_chr12\n",
"\tDNA_Methylation_chr13\n",
"\tDNA_Methylation_chr14\n",
"\tDNA_Methylation_chr15\n",
"\tDNA_Methylation_chr16\n",
"\tDNA_Methylation_chr17\n",
"\tDNA_Methylation_chr18\n",
"\tDNA_Methylation_chr19\n",
"\tDNA_Methylation_chr2\n",
"\tDNA_Methylation_chr20\n",
"\tDNA_Methylation_chr21\n",
"\tDNA_Methylation_chr22\n",
"\tDNA_Methylation_chr3\n",
"\tDNA_Methylation_chr4\n",
"\tDNA_Methylation_chr5\n",
"\tDNA_Methylation_chr6\n",
"\tDNA_Methylation_chr7\n",
"\tDNA_Methylation_chr8\n",
"\tDNA_Methylation_chr9\n",
"\tDNA_Methylation_chrX\n",
"\tDNA_Methylation_chrY\n",
"\tProtein_Expression\n",
"\tRNAseq_Gene_Expression\n",
"\tSomatic_Mutation\n",
"\tSomatic_Mutation_DR10\n",
"\tSomatic_Mutation_DR6\n",
"\tSomatic_Mutation_DR7\n",
"\tmiRNAseq_Expression\n",
"\tmiRNAseq_Isoform_Expression\n",
"\ttcga_metadata_data_hg38_r14\n"
]
}
],
"source": [
"print(\"Tables:\")\n",
"# Create a variable with the list of tables in the dataset\n",
"tables = list(client.list_tables('TCGA_hg38_data_v0'))\n",
"# If there are tables then print their names,\n",
"# else print that there are no tables\n",
"if tables:\n",
" for table in tables:\n",
" print(\"\\t{}\".format(table.table_id))\n",
"else:\n",
" print(\"\\tThis dataset does not contain any tables.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IIhtHstReAW9"
},
"source": [
"In this next code cell, we define an SQL query called **`DNU_patients`** which finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment'. It will display the output of the query and then save the data to a Pandas DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "gDDETPRpeAW_"
},
"source": [
"Now we'll use the query defined above to get the \"Do Not Use\" list of participants (aka patients):\n",
"\n",
"** Note: you will need to update 'your_project_number' with your project number before continuing with the notebook **"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "tMNFfZGueAW9"
},
"outputs": [],
"source": [
"%%bigquery DNU_patients --your_project_number\n",
"\n",
"SELECT\n",
" case_barcode,\n",
" category AS categoryName,\n",
" classification AS classificationName\n",
"FROM\n",
" `isb-cgc.TCGA_bioclin_v0.Annotations`\n",
"WHERE\n",
" ( entity_type=\"Patient\"\n",
" AND (category=\"History of unacceptable prior treatment related to a prior/other malignancy\"\n",
" OR classification=\"Redaction\" ) )\n",
"GROUP BY\n",
" case_barcode,\n",
" categoryName,\n",
" classificationName\n",
"ORDER BY\n",
" case_barcode"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"colab_type": "code",
"id": "29MqTudpeAXA",
"outputId": "4e344d7c-1401-4203-c4f5-fdec226e2212"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" categoryName | \n",
" classificationName | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 212 | \n",
" 212 | \n",
" 212 | \n",
"
\n",
" \n",
" unique | \n",
" 212 | \n",
" 8 | \n",
" 2 | \n",
"
\n",
" \n",
" top | \n",
" TCGA-BP-5006 | \n",
" History of unacceptable prior treatment relate... | \n",
" Notification | \n",
"
\n",
" \n",
" freq | \n",
" 1 | \n",
" 137 | \n",
" 137 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" case_barcode ... classificationName\n",
"count 212 ... 212\n",
"unique 212 ... 2\n",
"top TCGA-BP-5006 ... Notification\n",
"freq 1 ... 137\n",
"\n",
"[4 rows x 3 columns]"
]
},
"execution_count": 7,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"DNU_patients.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KkbvIZ5aeAXG"
},
"source": [
"Now we're gong to use the query defined previously in a function that builds a \"clean\" list of patients in the specified study, with available molecular data, and without any disqualifying annotations."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "AyCOJDeQeAXG"
},
"outputs": [],
"source": [
"def buildCleanBarcodeList ( studyName, bqDataset, barcodeType, DNUList):\n",
" print(\"in buildCleanBarcodeList ... \", studyName) # Print a start statement\n",
" ULists = [] # Define an empty list for Unique Barcodes\n",
" print(\" --> looping over data tables: \") # Print an indication of when the loop is beginning\n",
" barcodeField = barcodeType # Set the barcodeField for each loop\n",
" # For each dataset in the bqDataset table\n",
" for t in bqDataset:\n",
" currTable = \"`isb-cgc.TCGA_hg38_data_v0.\" + t.table_id + \"`\" # Set the current table\n",
" barcodeField = barcodeType # reset the barcode Field for each loop\n",
" # Try the simple query that will get the barcode list from most of the tables\n",
" try:\n",
" # Set query string\n",
" get_barcode_list = \"\"\"\n",
" SELECT {} as barcode\n",
" FROM {}\n",
" WHERE project_short_name=\"{}\"\n",
" GROUP BY {}\n",
" ORDER BY {}\n",
" \"\"\".format(barcodeField, currTable, studyName, barcodeField, barcodeField)\n",
" # Query BigQuery\n",
" results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)\n",
" # Store the results into a dataframe\n",
" results = results.result().to_dataframe()\n",
" # If the simple query does not run, try joining the two tables to add the\n",
" # project_short_name to the the query as it is not in some of the Methylation tables\n",
" except:\n",
" try:\n",
" get_barcode_list = \"\"\"\n",
" WITH a AS(\n",
" SELECT {}\n",
" FROM {}),\n",
" b AS ( SELECT {}, project_short_name\n",
" FROM `isb-cgc.TCGA_hg38_data_v0.Copy_Number_Segment_Masked`)\n",
" SELECT {} as barcode\n",
" FROM (\n",
" SELECT a.{} AS {}, b.project_short_name AS project_short_name\n",
" FROM a\n",
" JOIN b\n",
" ON a.{} = b.{})\n",
" WHERE project_short_name='{}'\n",
" GROUP BY {}\n",
" ORDER BY {}\n",
" \"\"\".format(barcodeField, currTable, barcodeField, barcodeField, barcodeField, barcodeField, barcodeField, barcodeField, studyName, barcodeField, barcodeField)\n",
" # Query BigQuery\n",
" results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)\n",
" # Store the results into a dataframe\n",
" results = results.result().to_dataframe()\n",
" except:\n",
" try:\n",
" barcodeField = \"sample_barcode_tumor\"\n",
" get_barcode_list = \"\"\"\n",
" SELECT {} as barcode\n",
" FROM {}\n",
" WHERE project_short_name=\"{}\"\n",
" GROUP BY {}\n",
" ORDER BY {}\n",
" \"\"\".format(barcodeField, currTable, studyName, barcodeField, barcodeField)\n",
" # Query BigQuery\n",
" results = bigquery.Client('isb-cgc-02-0001').query(get_barcode_list)\n",
" # Store the results into a dataframe\n",
" results = results.result().to_dataframe()\n",
" except:\n",
" # Create an empty dataframe if none of the above queries return values\n",
" results = pd.DataFrame()\n",
" # If there is data in the result dataframe, add the results to ULists\n",
" if ( len(results) > 0):\n",
" print(\" \", t.table_id, \" --> \", len(results[\"barcode\"]), \"unique barcodes.\")\n",
" ULists += [ results ]\n",
" else:\n",
" print(\" \", t.table_id, \" --> no data returned\")\n",
" print(\" --> we have {} lists to merge\".format(len(ULists)))\n",
" masterList = [] # Create a list for the master list of barcodes\n",
" # Add barcodes to the master list with no repeating barcodes\n",
" for aList in ULists:\n",
" for aBarcode in aList[\"barcode\"]:\n",
" if ( aBarcode not in masterList ):\n",
" masterList += [aBarcode]\n",
" print(\" --> which results in a single list with {} barcodes\".format(len(masterList)))\n",
" print(\" --> removing DNU barcodes\")\n",
" cleanList = []\n",
" # For each barcode in the master check to see if it is in the DNU\n",
" # list and then add it to the clean list\n",
" for aBarcode in masterList:\n",
" if ( aBarcode not in DNUList[barcodeField].tolist() ):\n",
" cleanList += [ aBarcode ]\n",
" else:\n",
" print(\" excluding this barcode\", aBarcode)\n",
" print(\" --> returning a clean list with {} barcodes\".format(len(cleanList)))\n",
" return(cleanList)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 918
},
"colab_type": "code",
"id": "i7afHYgZeAXJ",
"outputId": "96ac43e9-47f0-4815-cf01-bcb12a25f280"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"in buildCleanBarcodeList ... TCGA-BRCA\n",
" --> looping over data tables: \n",
" Copy_Number_Segment_Masked --> 1096 unique barcodes.\n",
" Copy_Number_Segment_Masked_r14 --> 1098 unique barcodes.\n",
" DNA_Methylation --> 1095 unique barcodes.\n",
" DNA_Methylation_chr1 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr10 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr11 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr12 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr13 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr14 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr15 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr16 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr17 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr18 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr19 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr2 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr20 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr21 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr22 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr3 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr4 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr5 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr6 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr7 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr8 --> 1095 unique barcodes.\n",
" DNA_Methylation_chr9 --> 1095 unique barcodes.\n",
" DNA_Methylation_chrX --> 1095 unique barcodes.\n",
" DNA_Methylation_chrY --> 1095 unique barcodes.\n",
" Protein_Expression --> 887 unique barcodes.\n",
" RNAseq_Gene_Expression --> 1092 unique barcodes.\n",
" Somatic_Mutation --> 986 unique barcodes.\n",
" Somatic_Mutation_DR10 --> 986 unique barcodes.\n",
" Somatic_Mutation_DR6 --> 986 unique barcodes.\n",
" Somatic_Mutation_DR7 --> 986 unique barcodes.\n",
" miRNAseq_Expression --> 1079 unique barcodes.\n",
" miRNAseq_Isoform_Expression --> 1079 unique barcodes.\n",
" tcga_metadata_data_hg38_r14 --> 1098 unique barcodes.\n",
" --> we have 36 lists to merge\n",
" --> which results in a single list with 1098 barcodes\n",
" --> removing DNU barcodes\n",
" excluding this barcode TCGA-5L-AAT1\n",
" excluding this barcode TCGA-A8-A084\n",
" excluding this barcode TCGA-A8-A08F\n",
" excluding this barcode TCGA-A8-A08S\n",
" excluding this barcode TCGA-A8-A09E\n",
" excluding this barcode TCGA-A8-A09K\n",
" excluding this barcode TCGA-AR-A2LL\n",
" excluding this barcode TCGA-AR-A2LR\n",
" excluding this barcode TCGA-BH-A0B6\n",
" excluding this barcode TCGA-BH-A1F5\n",
" excluding this barcode TCGA-D8-A146\n",
" --> returning a clean list with 1087 barcodes\n"
]
}
],
"source": [
"barcodeType = \"case_barcode\"\n",
"cleanPatientList = buildCleanBarcodeList ( studyName, tables, barcodeType, DNU_patients )"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "8ySjCPXceAXM"
},
"source": [
"Now we are going to repeat the same process, but at the sample barcode level. Most patients will have provided two samples, a \"primary tumor\" sample, and a \"normal blood\" sample, but in some cases additional or different types of samples may have been provided, and sample-level annotations may exist that should result in samples being excluded from most downstream analyses."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "xKKnb0NHeAXN"
},
"outputs": [],
"source": [
"%%bigquery DNUsamples --project your_project_number\n",
"\n",
"# there are many different types of annotations that are at the \"sample\" level\n",
"# in the Annotations table, and most of them seem like they should be \"disqualifying\"\n",
"# annotations, so for now we will just return all sample barcodes with sample-level\n",
"# annotations\n",
"SELECT\n",
" sample_barcode\n",
"FROM\n",
" `isb-cgc.TCGA_bioclin_v0.Annotations`\n",
"WHERE\n",
" (entity_type=\"Sample\")\n",
"GROUP BY\n",
" sample_barcode\n",
"ORDER BY\n",
" sample_barcode"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"colab_type": "code",
"id": "I-90_WPieAXQ",
"outputId": "02075509-5297-4b39-d177-55610c324051"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sample_barcode | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 113 | \n",
"
\n",
" \n",
" unique | \n",
" 113 | \n",
"
\n",
" \n",
" top | \n",
" TCGA-06-0149-01A | \n",
"
\n",
" \n",
" freq | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sample_barcode\n",
"count 113\n",
"unique 113\n",
"top TCGA-06-0149-01A\n",
"freq 1"
]
},
"execution_count": 16,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"DNUsamples.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "VC1Pg0cseAXS"
},
"source": [
"And now we can re-use the previously defined function get a clean list of sample-level barcodes:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 748
},
"colab_type": "code",
"id": "xHERsvlyeAXT",
"outputId": "053ac9f7-8988-4939-fa50-f41a65ede742"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"in buildCleanBarcodeList ... TCGA-BRCA\n",
" --> looping over data tables: \n",
" Copy_Number_Segment_Masked --> 2219 unique barcodes.\n",
" Copy_Number_Segment_Masked_r14 --> 2224 unique barcodes.\n",
" DNA_Methylation --> 1205 unique barcodes.\n",
" DNA_Methylation_chr1 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr10 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr11 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr12 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr13 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr14 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr15 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr16 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr17 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr18 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr19 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr2 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr20 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr21 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr22 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr3 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr4 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr5 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr6 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr7 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr8 --> 1205 unique barcodes.\n",
" DNA_Methylation_chr9 --> 1205 unique barcodes.\n",
" DNA_Methylation_chrX --> 1205 unique barcodes.\n",
" DNA_Methylation_chrY --> 1205 unique barcodes.\n",
" Protein_Expression --> 937 unique barcodes.\n",
" RNAseq_Gene_Expression --> 1217 unique barcodes.\n",
" Somatic_Mutation --> 986 unique barcodes.\n",
" Somatic_Mutation_DR10 --> 986 unique barcodes.\n",
" Somatic_Mutation_DR6 --> 986 unique barcodes.\n",
" Somatic_Mutation_DR7 --> 986 unique barcodes.\n",
" miRNAseq_Expression --> 1202 unique barcodes.\n",
" miRNAseq_Isoform_Expression --> 1202 unique barcodes.\n",
" tcga_metadata_data_hg38_r14 --> 3298 unique barcodes.\n",
" --> we have 36 lists to merge\n",
" --> which results in a single list with 3339 barcodes\n",
" --> removing DNU barcodes\n",
" excluding this barcode TCGA-B6-A1KC-01A\n",
" --> returning a clean list with 3338 barcodes\n"
]
}
],
"source": [
"barcodeType = \"sample_barcode\"\n",
"cleanSampleList = buildCleanBarcodeList ( studyName, tables, barcodeType, DNUsamples )"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "8CzJFiO7eAXV"
},
"source": [
"Now we're going to double-check first that we keep only sample-level barcodes that correspond to patients in the \"clean\" list of patients, and then we'll also filter the list of patients in case there are patients with no samples remaining in the \"clean\" list of samples."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 595
},
"colab_type": "code",
"id": "BGYtt3xVeAXW",
"outputId": "1a663043-236a-413e-d987-f96b4fd5ff47"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" excluding this sample in the final pass: TCGA-5L-AAT1-01A\n",
" excluding this sample in the final pass: TCGA-5L-AAT1-10A\n",
" excluding this sample in the final pass: TCGA-A8-A084-01A\n",
" excluding this sample in the final pass: TCGA-A8-A084-10A\n",
" excluding this sample in the final pass: TCGA-A8-A08F-01A\n",
" excluding this sample in the final pass: TCGA-A8-A08F-10A\n",
" excluding this sample in the final pass: TCGA-A8-A08S-01A\n",
" excluding this sample in the final pass: TCGA-A8-A08S-10A\n",
" excluding this sample in the final pass: TCGA-A8-A09E-01A\n",
" excluding this sample in the final pass: TCGA-A8-A09E-10A\n",
" excluding this sample in the final pass: TCGA-A8-A09K-01A\n",
" excluding this sample in the final pass: TCGA-A8-A09K-10A\n",
" excluding this sample in the final pass: TCGA-AR-A2LL-01A\n",
" excluding this sample in the final pass: TCGA-AR-A2LL-10A\n",
" excluding this sample in the final pass: TCGA-AR-A2LR-01A\n",
" excluding this sample in the final pass: TCGA-AR-A2LR-10A\n",
" excluding this sample in the final pass: TCGA-BH-A0B6-01A\n",
" excluding this sample in the final pass: TCGA-BH-A0B6-10A\n",
" excluding this sample in the final pass: TCGA-BH-A1F5-01A\n",
" excluding this sample in the final pass: TCGA-BH-A1F5-11A\n",
" excluding this sample in the final pass: TCGA-D8-A146-01A\n",
" excluding this sample in the final pass: TCGA-D8-A146-10A\n",
" excluding this sample in the final pass: TCGA-5L-AAT1-01Z\n",
" excluding this sample in the final pass: TCGA-A8-A084-01Z\n",
" excluding this sample in the final pass: TCGA-A8-A08F-01Z\n",
" excluding this sample in the final pass: TCGA-A8-A08S-01Z\n",
" excluding this sample in the final pass: TCGA-A8-A09E-01Z\n",
" excluding this sample in the final pass: TCGA-A8-A09K-01Z\n",
" excluding this sample in the final pass: TCGA-AR-A2LL-01Z\n",
" excluding this sample in the final pass: TCGA-AR-A2LR-01Z\n",
" excluding this sample in the final pass: TCGA-BH-A0B6-01Z\n",
" excluding this sample in the final pass: TCGA-BH-A1F5-01Z\n",
" excluding this sample in the final pass: TCGA-D8-A146-01Z\n",
" Length of final sample list: 3305 \n"
]
}
],
"source": [
"finalSampleList = []\n",
"for aSample in cleanSampleList:\n",
" aPatient = aSample[:12]\n",
" if ( aPatient not in cleanPatientList ):\n",
" print(\" excluding this sample in the final pass: \", aSample)\n",
" else:\n",
" finalSampleList += [aSample]\n",
" \n",
"print(\" Length of final sample list: {} \".format(len(finalSampleList)))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "FkN18GRQeAXZ",
"outputId": "9db732be-91cf-49f6-c193-88d620febc3b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Lenth of final patient list: 1087 \n"
]
}
],
"source": [
"finalPatientList = []\n",
"for aSample in finalSampleList:\n",
" aPatient = aSample[:12]\n",
" if ( aPatient not in finalPatientList ):\n",
" finalPatientList += [ aPatient ]\n",
" \n",
"print(\" Lenth of final patient list: {} \".format(len(finalPatientList)))\n",
"\n",
"for aPatient in cleanPatientList:\n",
" if ( aPatient not in finalPatientList ):\n",
" print(\" --> patient removed in final pass: \", aPatient)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZA40Mh-zeAXd"
},
"source": [
"We're also interested in knowing what *types* of samples we have. The codes for the major types of samples are:\n",
"- **01** : primary solid tumor\n",
"- **02** : recurrent solid tumor\n",
"- **03** : primary blood derived cancer\n",
"- **06** : metastatic\n",
"- **10** : blood derived normal\n",
"- **11** : solid tissue normal\n",
"\n",
"and a complete list of all sample type codes and their definitions can be found [here](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes)."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 85
},
"colab_type": "code",
"id": "dUF0Io0neAXe",
"outputId": "dbb28f5f-872f-4dac-c85d-d61ea1afbbe0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2145 samples of type 01 \n",
" 8 samples of type 06 \n",
" 991 samples of type 10 \n",
" 161 samples of type 11 \n"
]
}
],
"source": [
"sampleCounts = {}\n",
"for aSample in finalSampleList:\n",
" sType = str(aSample[13:15])\n",
" if ( sType not in sampleCounts ): sampleCounts[sType] = 0\n",
" sampleCounts[sType] += 1\n",
" \n",
"for aKey in sorted(sampleCounts):\n",
" print(\" {} samples of type {} \".format(sampleCounts[aKey], aKey))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ts0EQNfgeAXi"
},
"source": [
"Now we are going to create a simple dataframe with all of the sample barcodes and the associated patient (participant) barcodes so that we can write this to a BigQuery \"cohort\" table."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"colab_type": "code",
"id": "W_2U7LfzeAXj",
"outputId": "3bb33916-9b58-4693-ecc3-7f6f1e69064b"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ParticipantBarcode | \n",
" SampleBarcode | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 3305 | \n",
" 3305 | \n",
"
\n",
" \n",
" unique | \n",
" 1087 | \n",
" 3305 | \n",
"
\n",
" \n",
" top | \n",
" TCGA-A7-A0DB | \n",
" TCGA-AN-A0XT-01A | \n",
"
\n",
" \n",
" freq | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ParticipantBarcode SampleBarcode\n",
"count 3305 3305\n",
"unique 1087 3305\n",
"top TCGA-A7-A0DB TCGA-AN-A0XT-01A\n",
"freq 5 1"
]
},
"execution_count": 31,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"patientBarcodes = []\n",
"sampleBarcodes = []\n",
"for aSample in finalSampleList:\n",
" sampleBarcodes += [aSample]\n",
" patientBarcodes += [aSample[:12]]\n",
"df = pd.DataFrame ( { 'ParticipantBarcode': patientBarcodes,\n",
" 'SampleBarcode': sampleBarcodes } )\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "MVW3R0LsGb5s"
},
"source": [
"As a next step you may want to consider is to put the data into your GCP. An example of how to move files in and out of GCP and BigQuery can be found [here](https://github.com/isb-cgc/Community-Notebooks/tree/master/Notebooks) along with other tutorial notebooks."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "How to create a complex cohort",
"provenance": [],
"version": "0.3.2"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}