{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LEtKzHMe1U9K"
},
"source": [
"# ISB-CGC Community Notebooks\n",
"\n",
"Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
"\n",
"```\n",
"Title: How to Find GDC File Locations\n",
"Author: Lauren Hagen\n",
"Created: 2019-08-13\n",
"Purpose: Demonstrate how to find GDC file locations using manifests available in BigQuery\n",
"URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_Find_GDC_File_Locations.ipynb\n",
"Notes: \n",
"```\n",
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7aaiBGof2PQY"
},
"source": [
"# How to Find GDC File Locations\n",
"\n",
"In this notebook, we will explore the data sets available in the [Genomic Data Commons](https://gdc.cancer.gov/) using the GDC metadata tables in BigQuery and find the file location within GDC. The metadata tables are useful because several of the available data sets in GDC are not yet available in the ISB-CGC WebApp or as BigQuery tables. This also means that the data sets can't be used with the ISB-CGC API's. The metadata tables can help you find which data sets are available in GDC along with their locations and available file and sequencing types. The available metadata tables along with other data sets and tables from ISB-CGC can be explored without login with the [ISB-CGC BigQuery Table Searcher](https://isb-cgc.appspot.com/bq_meta_search/).\n",
"\n",
"This notebook has been designed to keep itself up to date when new metadata tables releases as new data sets are added to GDC or updated every few months though we will be using the tables from release 14 in the examples.\n",
"\n",
"But first things first is to load the BigQuery module, authenticate ourselves, and create a client variable."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "p0l_vYzJIFzS"
},
"outputs": [],
"source": [
"# Load the BigQuery Module\n",
"from google.cloud import bigquery"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 360
},
"colab_type": "code",
"id": "FWnl9YROIJHI",
"outputId": "a824676b-1b28-4305-850b-2e8506bb08cf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Go to the following link in your browser:\n",
"\n",
" https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline\n",
"\n",
"\n",
"Enter verification code: 4/pAFKsa-5H-VSrDxsHGX_MbCErcPGRkCOLK1KVm1fEJpAUueWjjGBEg4\n",
"\n",
"Credentials saved to file: [/content/.config/application_default_credentials.json]\n",
"\n",
"These credentials will be used by any library that requests\n",
"Application Default Credentials.\n",
"\n",
"To generate an access token for other uses, run:\n",
" gcloud auth application-default print-access-token\n",
"\n",
"\n",
"To take a quick anonymous survey, run:\n",
" $ gcloud alpha survey\n",
"\n"
]
}
],
"source": [
"# Authenticate ourselves\n",
"!gcloud auth application-default login"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "4eu5c5quvtyN"
},
"outputs": [],
"source": [
"# Create a variable for which client to use with BigQuery\n",
"client = bigquery.Client('isb-cgc-02-0001') # Replace isb-cgc-02-0001 with your project ID"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "J2UVmG2YJKj2"
},
"source": [
"Let us first explore the available metadata tables in BigQuery that are hosted by ISB-CGC. The data set in BigQuery is `isb-cgc.GDC_metadata` and tables are listed with the release of data that they are from. For example: `isb-cgc.GDC_metadata.rel14_fileData_current` is from release 14 of data in GDC. An explanation of what data is associated with each release can be found on the [GDC Data Release](https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/) and [ISB-CGC Data Releases and Future Plans](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/updates_and_releases/Data_Releases.html) pages. The manifests that are list with each release is what is used to create the metadata within BigQuery that is hosted by ISB-CGC, so that you do not need to download and tidy the release manifests thus saving you time!\n",
"\n",
"Now let us view which releases are available currently in BigQuery:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 731
},
"colab_type": "code",
"id": "zIfr3k5Fvqpw",
"outputId": "f574f244-e231-4e6e-a1be-b3e692ee5676"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GDC Metadata:\n",
"DLBC_affected_files\n",
"GDC_sync_active_20190104\n",
"GDC_sync_active_20190115\n",
"GDC_sync_legacy_20190104\n",
"GDC_sync_legacy_20190115\n",
"GDC_sync_obsolete_20190115\n",
"PanCanAtlas_manifest\n",
"rel12_GDCfileID_to_GCSurl\n",
"rel12_aliquot2caseIDmap\n",
"rel12_caseData\n",
"rel12_fileData_current\n",
"rel12_fileData_legacy\n",
"rel12_slide2caseIDmap\n",
"rel13_GDCfileID_to_GCSurl\n",
"rel13_aliquot2caseIDmap\n",
"rel13_caseData\n",
"rel13_fileData_current\n",
"rel13_fileData_legacy\n",
"rel13_slide2caseIDmap\n",
"rel14_GDCfileID_to_GCSurl\n",
"rel14_GDCfileID_to_GCSurl_NEW\n",
"rel14_aliquot2caseIDmap\n",
"rel14_caseData\n",
"rel14_fileData_current\n",
"rel14_fileData_legacy\n",
"rel14_slide2caseIDmap\n",
"rel15_aliquot2caseIDmap\n",
"rel15_caseData\n",
"rel15_fileData_current\n",
"rel15_fileData_legacy\n",
"rel15_slide2caseIDmap\n",
"rel16_aliquot2caseIDmap\n",
"rel16_caseData\n",
"rel16_fileData_active\n",
"rel16_fileData_legacy\n",
"rel16_slide2caseIDmap\n",
"rel17_aliquot2caseIDmap\n",
"rel17_caseData\n",
"rel17_fileData_active\n",
"rel17_fileData_legacy\n",
"rel17_slide2caseIDmap\n"
]
}
],
"source": [
"# For each table in the dataset print the number of rows,\n",
"# number of bytes and the name of the table\n",
"print(\"GDC Metadata:\")\n",
"tables = list()\n",
"for t in list(client.list_tables('isb-cgc.GDC_metadata')):\n",
" print(t.table_id)\n",
" tables.append(t.table_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "QdOYxgUC0Ybp"
},
"source": [
"Let us get the most recent data release available in the `isb-cgc.GDC_metadata` data set using the python module `re`."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jCOTx1Xd0j1B"
},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"id": "Aztdp8xCPZvG",
"outputId": "9997c3de-0982-455c-b72c-e083bbd070ea"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All Available Releases:\n",
"\trel12\n",
"\trel13\n",
"\trel14\n",
"\trel15\n",
"\trel16\n",
"\trel17\n",
"\n",
"Current Available Release:\n",
"\trel17\n"
]
}
],
"source": [
"releases = list()\n",
"print(\"All Available Releases:\")\n",
"for table in tables:\n",
" r = re.search(r\"(rel\\d*)_\", table)\n",
" if r:\n",
" rel = r.group(1)\n",
" if rel not in releases:\n",
" releases.append(rel)\n",
" print(\"\\t\" +rel)\n",
"curr_release = releases[len(releases)-1]\n",
"print(\"\\nCurrent Available Release:\\n\\t\" + curr_release)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "N0n0rhJk3DWr"
},
"source": [
"The next code block can be turned off to use other release sets but the examples in the rest of the notebook may need to updated for the approapriate release and column names."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "qBGrY-fh29E7"
},
"outputs": [],
"source": [
"# We are going to use release 14 as our example\n",
"# If this code block is turn off, please check that the examples are using the\n",
"# correct column names.\n",
"curr_release = \"rel14\""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9EOEoF2YwNKn"
},
"source": [
"Each release is split up into several tables based on the data that they help to faciltiate finding.\n",
"\n",
"|Table|Description|\n",
"|-----------|----------|\n",
"|rel#_caseData|List of all of the cases in GDC|\n",
"|rel#_fileData_current or rel#_fileData_active|List of the currently active cases in GDC along with information related to those cases|\n",
"|rel#_fileData_legacy|Same as the previous table but with legacy data instead|\n",
"|rel#_aliquot2caseIDmap|“helper” table to help map between identifiers at different levels of aliquot data. The intrinsic hierarchy is program > project > case > sample > portion > analyte > aliquot|\n",
"|rel#_slide2caseIDmap|“helper” table to help map between identifiers at different levels of tissue slide data. The intrinsic hierarchy is program > project > case > sample > portion > slide|\n",
"|rel#_GDCfileID_to_GCSurl|Gives the Google Cloud Storage location for each file|\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LcL-PT-43M9Y"
},
"source": [
"Let us see which data sets are available within the GDC."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"colab_type": "code",
"id": "b3QJAxWC3NhD",
"outputId": "e3645a70-5981-47b9-b8c0-d28f2ed8a391"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" program_name | \n",
" Num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TCGA | \n",
" 314354 | \n",
"
\n",
" \n",
" 1 | \n",
" FM | \n",
" 36134 | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET | \n",
" 7138 | \n",
"
\n",
" \n",
" 3 | \n",
" NCICCR | \n",
" 957 | \n",
"
\n",
" \n",
" 4 | \n",
" CTSP | \n",
" 89 | \n",
"
\n",
" \n",
" 5 | \n",
" VAREPOP | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" program_name Num\n",
"0 TCGA 314354\n",
"1 FM 36134\n",
"2 TARGET 7138\n",
"3 NCICCR 957\n",
"4 CTSP 89\n",
"5 VAREPOP 7"
]
},
"execution_count": 18,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"datasets_query = (\"SELECT program_name, COUNT(program_name) AS Num \"\n",
" \"FROM `isb-cgc.GDC_metadata.\" + curr_release + \"_fileData_current` \"\n",
" \"GROUP BY program_name \"\n",
" \"ORDER BY Num DESC\")\n",
"# Query the table\n",
"datasets_result = client.query(datasets_query)\n",
"# Put the results into a pandas dataframe\n",
"datasets_result = datasets_result.result().to_dataframe()\n",
"datasets_result"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "W8kWBynBme8Z"
},
"source": [
"We are going to explore the Foundation Medicine Adult Cancer Clinical Data set (FM-AD), one of the newer data sets to the GDC. More information can be found on the ['FM-AD Data Set' page](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/data/FM-AD_about.html) in the ISB-CGC documentation.\n",
"\n",
"Let us first take a look at the schema for the `isb-cgc.GDC_metadata.rel14_fileData_current` table as it has the most interesting data for figuring out which files are available in the GDC."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"colab_type": "code",
"id": "gCTjTnQxttZF",
"outputId": "e493e542-7dfc-4980-d9ce-4e9bc7a24c5c"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" column_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dbName | \n",
"
\n",
" \n",
" 1 | \n",
" file_gdc_id | \n",
"
\n",
" \n",
" 2 | \n",
" access | \n",
"
\n",
" \n",
" 3 | \n",
" acl | \n",
"
\n",
" \n",
" 4 | \n",
" analysis_input_file_gdc_ids | \n",
"
\n",
" \n",
" 5 | \n",
" analysis_workflow_link | \n",
"
\n",
" \n",
" 6 | \n",
" analysis_workflow_type | \n",
"
\n",
" \n",
" 7 | \n",
" associated_entities__case_gdc_id | \n",
"
\n",
" \n",
" 8 | \n",
" associated_entities__entity_gdc_id | \n",
"
\n",
" \n",
" 9 | \n",
" associated_entities__entity_submitter_id | \n",
"
\n",
" \n",
" 10 | \n",
" associated_entities__entity_type | \n",
"
\n",
" \n",
" 11 | \n",
" case_gdc_id | \n",
"
\n",
" \n",
" 12 | \n",
" project_dbgap_accession_number | \n",
"
\n",
" \n",
" 13 | \n",
" project_disease_type | \n",
"
\n",
" \n",
" 14 | \n",
" project_name | \n",
"
\n",
" \n",
" 15 | \n",
" program_dbgap_accession_number | \n",
"
\n",
" \n",
" 16 | \n",
" program_name | \n",
"
\n",
" \n",
" 17 | \n",
" project_short_name | \n",
"
\n",
" \n",
" 18 | \n",
" created_datetime | \n",
"
\n",
" \n",
" 19 | \n",
" data_category | \n",
"
\n",
" \n",
" 20 | \n",
" data_format | \n",
"
\n",
" \n",
" 21 | \n",
" data_type | \n",
"
\n",
" \n",
" 22 | \n",
" downstream_analyses__output_file_gdc_ids | \n",
"
\n",
" \n",
" 23 | \n",
" downstream_analyses__workflow_link | \n",
"
\n",
" \n",
" 24 | \n",
" downstream_analyses__workflow_type | \n",
"
\n",
" \n",
" 25 | \n",
" experimental_strategy | \n",
"
\n",
" \n",
" 26 | \n",
" file_name | \n",
"
\n",
" \n",
" 27 | \n",
" file_size | \n",
"
\n",
" \n",
" 28 | \n",
" file_id | \n",
"
\n",
" \n",
" 29 | \n",
" index_file_gdc_id | \n",
"
\n",
" \n",
" 30 | \n",
" index_file_name | \n",
"
\n",
" \n",
" 31 | \n",
" index_file_size | \n",
"
\n",
" \n",
" 32 | \n",
" md5sum | \n",
"
\n",
" \n",
" 33 | \n",
" platform | \n",
"
\n",
" \n",
" 34 | \n",
" file_state | \n",
"
\n",
" \n",
" 35 | \n",
" file_submitter_id | \n",
"
\n",
" \n",
" 36 | \n",
" file_type | \n",
"
\n",
" \n",
" 37 | \n",
" updated_datetime | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" column_name\n",
"0 dbName\n",
"1 file_gdc_id\n",
"2 access\n",
"3 acl\n",
"4 analysis_input_file_gdc_ids\n",
"5 analysis_workflow_link\n",
"6 analysis_workflow_type\n",
"7 associated_entities__case_gdc_id\n",
"8 associated_entities__entity_gdc_id\n",
"9 associated_entities__entity_submitter_id\n",
"10 associated_entities__entity_type\n",
"11 case_gdc_id\n",
"12 project_dbgap_accession_number\n",
"13 project_disease_type\n",
"14 project_name\n",
"15 program_dbgap_accession_number\n",
"16 program_name\n",
"17 project_short_name\n",
"18 created_datetime\n",
"19 data_category\n",
"20 data_format\n",
"21 data_type\n",
"22 downstream_analyses__output_file_gdc_ids\n",
"23 downstream_analyses__workflow_link\n",
"24 downstream_analyses__workflow_type\n",
"25 experimental_strategy\n",
"26 file_name\n",
"27 file_size\n",
"28 file_id\n",
"29 index_file_gdc_id\n",
"30 index_file_name\n",
"31 index_file_size\n",
"32 md5sum\n",
"33 platform\n",
"34 file_state\n",
"35 file_submitter_id\n",
"36 file_type\n",
"37 updated_datetime"
]
},
"execution_count": 21,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"fields_query = (\"SELECT column_name \"\n",
" \"FROM `isb-cgc.GDC_metadata.INFORMATION_SCHEMA.COLUMNS` \"\n",
" \"WHERE table_name = '\" + curr_release + \"_fileData_current'\")\n",
"# Query the table\n",
"fields_result = client.query(fields_query)\n",
"# Put the results into a pandas dataframe\n",
"fields_result = fields_result.result().to_dataframe()\n",
"fields_result"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "kJdxnZw4tnXx"
},
"source": [
"Then we will find out which files formats are available on GDC with a SQL query. Since, we might want to look at different fields and their counts, we will write a function that will take the field, table, release, and program name that will then tell us the count of the field as a pandas data frame using the [Google Cloud Client Libraries](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries#client-libraries-install-python)."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "8Bdm3F2c7dKf"
},
"outputs": [],
"source": [
"def variable_count(field, table_type, release, program):\n",
" # Create a variable for the whole table path\n",
" table_path = \"isb-cgc.GDC_metadata.\" + release + table_type\n",
" # Create a variable with the SQL query\n",
" count_query = (\"SELECT \" + field + \", \"\n",
" \"COUNT(\" + field + \") AS Num FROM `\" + table_path + \"` \"\n",
" \"WHERE program_name = '\" + program + \"' \"\n",
" \"GROUP BY \" + field)\n",
" # Query the table\n",
" query = client.query(count_query)\n",
" # Put the results into a pandas dataframe\n",
" result = query.result().to_dataframe()\n",
" # return the dataframe\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6_wAwgNQe95Z"
},
"source": [
"Let us now test the function to view the available data formats for the FM-AD data set."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"colab_type": "code",
"id": "nZ7qtCtg8Amz",
"outputId": "e7bc28ce-6791-42b3-8595-349b2ccc227e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" data_format | \n",
" Num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MAF | \n",
" 42 | \n",
"
\n",
" \n",
" 1 | \n",
" TSV | \n",
" 84 | \n",
"
\n",
" \n",
" 2 | \n",
" VCF | \n",
" 36008 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" data_format Num\n",
"0 MAF 42\n",
"1 TSV 84\n",
"2 VCF 36008"
]
},
"execution_count": 23,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# Set the field that we are interested in\n",
"field = \"data_format\"\n",
"\n",
"# Set the program that we are interested in\n",
"program = \"FM\"\n",
"\n",
"# Set which table we are going to query\n",
"table_type = \"_fileData_current\"\n",
"\n",
"# run the function with the defined variables\n",
"query_result = variable_count(field, table_type, curr_release, program)\n",
"\n",
"# display the query results\n",
"query_result"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "kuTweprftOOn"
},
"source": [
"It seems that most of the files for FM-AD are VCF files. Next we are going to see how many of the files available are controlled access vs open access."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"colab_type": "code",
"id": "unGAb9Kt9YJp",
"outputId": "96e10fe5-c249-46cf-b4d8-ac1e50db7cf3"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" access | \n",
" Num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" controlled | \n",
" 36050 | \n",
"
\n",
" \n",
" 1 | \n",
" open | \n",
" 84 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" access Num\n",
"0 controlled 36050\n",
"1 open 84"
]
},
"execution_count": 24,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# Change the field to access type\n",
"field = \"access\"\n",
"\n",
"# run the function with the defined variables\n",
"query_result = variable_count(field, table_type, curr_release, program)\n",
"\n",
"# display the query results\n",
"query_result"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ksde5Ghqus05"
},
"source": [
"Wow, the majority of the files are controlled access files. You'll want to review that you have the correct permissions such as dbGaP authorization to access controlled data with GDC before proceeding with attempting to use the data. More information can be found in the ISB-CGC documentation on the ['Accessing Controlled Data' page](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Gaining-Access-To-Controlled-Access-Data.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "r_gUpA3czcrM"
},
"source": [
"A few other fields that we might be interested are `data_category`, `data_type`, `experimental_strategy`, `file_size`, and `platform`. We can modify the query by modifying the variables `field`, `dataset`, `curr_release`, and `table_type` to view different tables and field counts. Let's update the query to look at the `data_type` field:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "D0fZ-G1I00bD",
"outputId": "7401f937-f095-4f39-e2da-cd60fef5adc4"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" data_type | \n",
" Num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Aggregated Somatic Mutation | \n",
" 42 | \n",
"
\n",
" \n",
" 1 | \n",
" Biospecimen Supplement | \n",
" 42 | \n",
"
\n",
" \n",
" 2 | \n",
" Clinical Supplement | \n",
" 42 | \n",
"
\n",
" \n",
" 3 | \n",
" Raw Simple Somatic Mutation | \n",
" 18004 | \n",
"
\n",
" \n",
" 4 | \n",
" Annotated Somatic Mutation | \n",
" 18004 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" data_type Num\n",
"0 Aggregated Somatic Mutation 42\n",
"1 Biospecimen Supplement 42\n",
"2 Clinical Supplement 42\n",
"3 Raw Simple Somatic Mutation 18004\n",
"4 Annotated Somatic Mutation 18004"
]
},
"execution_count": 25,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# Change the field to data type\n",
"field = \"data_type\"\n",
"\n",
"# run the function with the defined variables\n",
"query_result = variable_count(field, table_type, curr_release, program)\n",
"\n",
"# display the query results\n",
"query_result"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zlT1QZxd-ENE"
},
"source": [
"Now that we have looked over different fields in the table, let us create a set of files that we are interested in. For this example, we are going to create a query that will find the GDC file id associated with the FM-AD data set that are VCF files and have the file type of 'simple_somatic_mutation'."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "7wnUHag3GyB3"
},
"outputs": [],
"source": [
"FM_gdc_file_query = (\"SELECT file_gdc_id \"\n",
" \"FROM `isb-cgc.GDC_metadata.\" + curr_release + \"_fileData_current` \"\n",
" \"WHERE program_name = 'FM' AND data_format = 'VCF' AND file_type = 'simple_somatic_mutation'\")\n",
"# Query the table\n",
"gdc_file_id = client.query(FM_gdc_file_query)\n",
"# Put the results into a pandas dataframe\n",
"gdc_file_id = gdc_file_id.result().to_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "_C0D1wVtHqrl",
"outputId": "b2ee25f8-c648-45c6-d574-943adae856db"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" file_gdc_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7c9a2c2b-4401-47c6-918f-c0032f9c72a7 | \n",
"
\n",
" \n",
" 1 | \n",
" 2c9d1e76-1c74-41dd-aa38-43d2f2f55e4e | \n",
"
\n",
" \n",
" 2 | \n",
" a7e3fd83-49be-4633-8e46-9fbdfe7ba741 | \n",
"
\n",
" \n",
" 3 | \n",
" ac26d85c-efa2-42a2-b63a-58b89f0c9c1c | \n",
"
\n",
" \n",
" 4 | \n",
" a1f089d2-d5dc-496a-8ec2-faaac1225d86 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" file_gdc_id\n",
"0 7c9a2c2b-4401-47c6-918f-c0032f9c72a7\n",
"1 2c9d1e76-1c74-41dd-aa38-43d2f2f55e4e\n",
"2 a7e3fd83-49be-4633-8e46-9fbdfe7ba741\n",
"3 ac26d85c-efa2-42a2-b63a-58b89f0c9c1c\n",
"4 a1f089d2-d5dc-496a-8ec2-faaac1225d86"
]
},
"execution_count": 29,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"gdc_file_id.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "wUTFQlmuH1ca",
"outputId": "10cccb4f-1519-4695-b08c-996b7253b537"
},
"outputs": [
{
"data": {
"text/plain": [
"18004"
]
},
"execution_count": 30,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"len(gdc_file_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lErPykTOH6Wf"
},
"source": [
"Now that we have a list of the GDC file ids, we can join it with the GCP urls from the `rel14_GDCfileID_to_GCSurl` table."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "YS40U7geIfz8"
},
"outputs": [],
"source": [
"url_query = (\"WITH id AS (SELECT file_gdc_id \"\n",
" \"FROM `isb-cgc.GDC_metadata.\"+ curr_release + \"_fileData_current` \"\n",
" \"WHERE program_name = '\" + program + \"' \"\n",
" \"AND data_format = 'VCF' AND file_type = 'simple_somatic_mutation') \"\n",
" \"SELECT t2.file_gdc_url \"\n",
" \"FROM id AS t1 \"\n",
" \"INNER JOIN `isb-cgc.GDC_metadata.\" + curr_release +\"_GDCfileID_to_GCSurl_NEW` AS t2 \"\n",
" \"ON t1.file_gdc_id = t2.file_gdc_id\")\n",
"\n",
"# Query the table\n",
"result = client.query(url_query)\n",
"# Put the results into a pandas dataframe\n",
"gdc_file_url = result.result().to_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "ebMIng1rKJ7T",
"outputId": "eb61187a-c098-4c0f-8daf-bbdad827b2e0"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" file_gdc_url | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" gs://gdc-fm-phs001179-controlled/a9130aed-b49d... | \n",
"
\n",
" \n",
" 1 | \n",
" gs://gdc-fm-phs001179-controlled/b1c15a06-12d2... | \n",
"
\n",
" \n",
" 2 | \n",
" gs://gdc-fm-phs001179-controlled/01f4eddf-15df... | \n",
"
\n",
" \n",
" 3 | \n",
" gs://gdc-fm-phs001179-controlled/45bbf05a-3ad2... | \n",
"
\n",
" \n",
" 4 | \n",
" gs://gdc-fm-phs001179-controlled/646fb13b-6471... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" file_gdc_url\n",
"0 gs://gdc-fm-phs001179-controlled/a9130aed-b49d...\n",
"1 gs://gdc-fm-phs001179-controlled/b1c15a06-12d2...\n",
"2 gs://gdc-fm-phs001179-controlled/01f4eddf-15df...\n",
"3 gs://gdc-fm-phs001179-controlled/45bbf05a-3ad2...\n",
"4 gs://gdc-fm-phs001179-controlled/646fb13b-6471..."
]
},
"execution_count": 24,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"gdc_file_url.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "hrlkxATEKPlz",
"outputId": "7fb93437-224f-449f-9e06-9e0ff5c3b53c"
},
"outputs": [
{
"data": {
"text/plain": [
"18004"
]
},
"execution_count": 76,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"len(gdc_file_url)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "GtAoAb29TLma"
},
"source": [
"Now that we have some basics about the metadata tables, it would be good to go over a more complicated SQL query to combine a set of cohort case_barcodes and then find the associated GDC urls by building and joining many tables. The first portion of the query below is from the [How to Create Cohorts Notebook](https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb) in the [ISB-CGC Community Notebook Repository](https://github.com/isb-cgc/Community-Notebooks/tree/master/Notebooks)."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "QBlfiDy5VeBd"
},
"outputs": [],
"source": [
"# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe\n",
"# Change isb-cgc-02-0001 to your project ID\n",
"%%bigquery cohort_to_url --project isb-cgc-02-0001\n",
"-- SQL Query from the How to Create Cohorts Notebook\n",
"WITH\n",
" select_on_annotations AS (\n",
" SELECT\n",
" case_barcode,\n",
" category AS categoryName,\n",
" classification AS classificationName\n",
" FROM\n",
" `isb-cgc.TCGA_bioclin_v0.Annotations`\n",
" WHERE\n",
" ( entity_type=\"Patient\"\n",
" AND (category=\"History of unacceptable prior treatment related to a prior/other malignancy\"\n",
" OR classification=\"Redaction\" ) )\n",
" GROUP BY\n",
" case_barcode,\n",
" categoryName,\n",
" classificationName ),\n",
" select_on_clinical AS (\n",
" SELECT\n",
" case_barcode,\n",
" vital_status,\n",
" days_to_last_known_alive,\n",
" ethnicity,\n",
" histological_type,\n",
" menopause_status,\n",
" race\n",
" FROM\n",
" `isb-cgc.TCGA_bioclin_v0.Clinical`\n",
" WHERE\n",
" ( disease_code = \"BRCA\"\n",
" AND age_at_diagnosis<=50\n",
" AND gender=\"FEMALE\" ) ),\n",
"-- Combine the cohort with the metadata tables to create a list of GDC urls\n",
" cohort AS (\n",
" SELECT\n",
" case_barcode\n",
" FROM (\n",
" SELECT\n",
" a.categoryName,\n",
" a.classificationName,\n",
" c.case_barcode\n",
" FROM\n",
" select_on_annotations AS a\n",
" FULL JOIN\n",
" select_on_clinical AS c\n",
" ON\n",
" a.case_barcode = c.case_barcode\n",
" WHERE\n",
" a.case_barcode IS NOT NULL\n",
" OR c.case_barcode IS NOT NULL\n",
" ORDER BY\n",
" a.classificationName,\n",
" a.categoryName,\n",
" c.case_barcode )\n",
" WHERE\n",
" categoryName IS NULL\n",
" AND classificationName IS NULL\n",
" AND case_barcode IS NOT NULL\n",
" ORDER BY\n",
" case_barcode),\n",
" gdc AS (SELECT a.case_barcode, b.case_gdc_id\n",
" FROM cohort AS a\n",
" INNER JOIN `isb-cgc.GDC_metadata.rel14_caseData` AS b\n",
" ON a.case_barcode = b.case_barcode),\n",
" curr AS (SELECT c.case_barcode, c.case_gdc_id, d.file_gdc_id\n",
" FROM gdc as c\n",
" INNER JOIN `isb-cgc.GDC_metadata.rel14_fileData_current` AS d\n",
" ON c.case_gdc_id = d.case_gdc_id),\n",
" url AS ( SELECT e.case_barcode, e.case_gdc_id, e.file_gdc_id, f.file_gdc_url\n",
" FROM curr AS e\n",
" INNER JOIN `isb-cgc.GDC_metadata.rel14_GDCfileID_to_GCSurl_NEW` AS f\n",
" ON e.file_gdc_id = f.file_gdc_id)\n",
"SELECT case_barcode, file_gdc_url FROM url ORDER BY case_barcode"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "cbXXRb0vVn97",
"outputId": "9abb10b5-2332-4e94-bdc1-f048587f002b"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" file_gdc_url | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TCGA-3C-AALI | \n",
" gs://gdc-tcga-phs000178-open/c993c79e-ac58-423... | \n",
"
\n",
" \n",
" 1 | \n",
" TCGA-3C-AALI | \n",
" gs://gdc-tcga-phs000178-open/fe3ceb38-aecc-4ef... | \n",
"
\n",
" \n",
" 2 | \n",
" TCGA-3C-AALI | \n",
" gs://gdc-tcga-phs000178-controlled/a736c2d0-2c... | \n",
"
\n",
" \n",
" 3 | \n",
" TCGA-3C-AALI | \n",
" gs://gdc-tcga-phs000178-open/00737aa1-d1da-438... | \n",
"
\n",
" \n",
" 4 | \n",
" TCGA-3C-AALI | \n",
" gs://gdc-tcga-phs000178-open/ad8e6a02-4483-404... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" case_barcode file_gdc_url\n",
"0 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/c993c79e-ac58-423...\n",
"1 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/fe3ceb38-aecc-4ef...\n",
"2 TCGA-3C-AALI gs://gdc-tcga-phs000178-controlled/a736c2d0-2c...\n",
"3 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/00737aa1-d1da-438...\n",
"4 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/ad8e6a02-4483-404..."
]
},
"execution_count": 8,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"cohort_to_url.head(5)"
]
}
],
"metadata": {
"colab": {
"name": "How_to_Find_GDC_File_Locations.ipynb",
"provenance": [],
"version": "0.3.2"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}