{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "c_WwxKQHlY5u"
},
"source": [
"# ISB-CGC Community Notebooks\n",
"\n",
"Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
"\n",
"```\n",
"Title: How to use PyPika to build a query\n",
"Author: Lauren Hagen\n",
"Created: 2020-02-13\n",
"URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How to use PyPika_to_create_a_BigQuery_SQL_query.ipynb\n",
"Purpose: Demonstrate query creation using PyPika with BigQuery\n",
"Notes: \n",
"```\n",
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Z3O_ygSonqTf"
},
"source": [
"# Introduction\n",
"\n",
"## Overview\n",
"In this notebook, we are going to use the PyPika package to build a query and use that query with the ISB-CGC BigQuery tables. First we'll create a cohort by selecting clinical features, then we'll use the cohort to filter the RNA and miRNA expression tables from the TARGET data set.\n",
"\n",
"## What is PyPika?\n",
"\n",
"[PyPika](https://pypika.readthedocs.io/en/latest/index.html) is a Python package designed to construct SQL queries. This package is useful for people who want to integrate BigQuery into their python code. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "_gP8n99Cs-lB"
},
"source": [
"Before we get started, we need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ETmm5nIT0PLw"
},
"outputs": [],
"source": [
"# Load the BigQuery Module\n",
"from google.cloud import bigquery"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "OFD7t1Ii0WR4"
},
"outputs": [],
"source": [
"# Authenticate ourselves\n",
"!gcloud auth application-default login"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "OdRUnslx0W1Q"
},
"outputs": [],
"source": [
"# Create the client object\n",
"project_num = 'your_project_number' # Replace with your project ID\n",
"if project_num == 'your_project_number':\n",
" print('Please update the project number with your Google Cloud Project')\n",
"else:\n",
" client = bigquery.Client(project_num)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "otwoIztiz8kN"
},
"outputs": [],
"source": [
"# If needed, install PyPika\n",
"!pip install pypika"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "OkpvhZe40R1o"
},
"outputs": [],
"source": [
"# Import from PyPika\n",
"from pypika import Query, Table, Field, Order"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "je16_yBCtX9R"
},
"source": [
"# Query Building\n",
"\n",
"In this notebook, we will query each table individually first and then join them for a final query. We'll use a selection of clinical data with molecular data from the TARGET data set.\n",
"\n",
"## Patient Clinical Data Query\n",
"\n",
"We want to write a query that filters the TARGET data set for AML with columns for the case barcode and the remission status of the patient for our cohort. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "DgUUORtA0hCa"
},
"outputs": [],
"source": [
"# Pass the Clincal table name to the Table class to create a variable\n",
"clin_table = Table('`isb-cgc.TARGET_bioclin_v0.Clinical`')\n",
"\n",
"# Create query with PyPika\n",
"clin_query = Query.from_(clin_table) \\\n",
" .select('case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2') \\\n",
" .where(clin_table.disease_code=='AML')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 55
},
"colab_type": "code",
"id": "be_0nDxO-F4z",
"outputId": "3a1d9183-627a-4fdb-c662-ba160889aa38"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT \"case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2\" FROM \"`isb-cgc.TARGET_bioclin_v0.Clinical`\" WHERE \"disease_code\"='AML'\n"
]
}
],
"source": [
"# Print returned query\n",
"print(clin_query)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "AO9gxoHT-BWx"
},
"source": [
"PyPika returns the query with quotation marks around some of the inputs. These need to be removed before the querying BigQuery."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "YXC_CRyA0uR6"
},
"outputs": [],
"source": [
"# Remove \" from the string\n",
"clin_query_clean = str(clin_query).replace('\"', \"\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"colab_type": "code",
"id": "56JGPj8f1k_w",
"outputId": "4009e605-52ce-4d85-89e9-6854c014d268"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_bioclin_v0.Clinical` WHERE disease_code='AML'\n"
]
}
],
"source": [
"# Print the query that PyPika created to check that it looks good.\n",
"# We can also copy the query into the BigQuery UI to check for errors\n",
"print(clin_query_clean)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "36mFrlOA_KR-"
},
"source": [
"We can query BigQuery, then create a data frame with the results though this isn't necessary for creating the final query."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "06aQH3mU00Ya"
},
"outputs": [],
"source": [
"# Query BigQuery and create a data frame with the results\n",
"clin = client.query(clin_query_clean).to_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
},
"colab_type": "code",
"id": "4vExUiCO2YE8",
"outputId": "350b90aa-75c3-44cd-a018-3b32f6801738"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" CR_status_at_end_of_course_1 | \n",
" CR_status_at_end_of_course_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-20-PATDMY | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-20-PARHSA | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-20-PARLSW | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-20-PASLHH | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-20-PATKUG | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" case_barcode CR_status_at_end_of_course_1 CR_status_at_end_of_course_2\n",
"0 TARGET-20-PATDMY CR CR\n",
"1 TARGET-20-PARHSA CR CR\n",
"2 TARGET-20-PARLSW CR CR\n",
"3 TARGET-20-PASLHH CR CR\n",
"4 TARGET-20-PATKUG CR CR"
]
},
"execution_count": 13,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# View the first several lines of the returned table\n",
"clin.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "NeyqY0hbjFEV"
},
"source": [
"## Molecular Data Query\n",
"\n",
"Now that we have a list of cases with some clinical information, we can join that table to one of the molecular data sets, such as the TARGET gene expression data."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZkJTgyGBcMo0"
},
"source": [
"We will now build the query for the molecular data set. We are not going to query BigQuery at this point, but it is good to make sure the query looks correct before joining it with another table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "1OhOJtqwf7H1"
},
"outputs": [],
"source": [
"# Pass the Clincal data table name to the Table class to create a variable\n",
"expr_table = Table('`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`')\n",
"\n",
"# Create a query with PyPika\n",
"expr_query = Query.from_(expr_table) \\\n",
" .select('case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name') \\\n",
" .orderby('HTSeq__FPKM_UQ')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "EkvIotGMgn3x"
},
"outputs": [],
"source": [
"# Remove \" from the string\n",
"expr_query_clean = str(expr_query).replace('\"', \"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "eVcp0gFXgs_h",
"outputId": "f3e93587-f91e-45ee-e114-cbb63d974d14"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` ORDER BY HTSeq__FPKM_UQ\n"
]
}
],
"source": [
"# View the created query\n",
"print(expr_query_clean)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lpztZOEziHH_"
},
"source": [
"# Create the Final Query\n",
"\n",
"Finally, we will create a query to join the two tables. BigQuery and PyPika support all join types, though, for this query, we are using the standard inner join. This query returns a large number of lines and can be slow to bring into Collaboratory, so we will limit the number of lines returned to 100."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "JDo0XlKYjEE4"
},
"outputs": [],
"source": [
"# Create a query with PyPika\n",
"join_query = Query.from_(expr_table) \\\n",
" .join(clin_table).on(expr_table.case_barcode == clin_table.case_barcode) \\\n",
" .select('case_barcode', 'HTSeq__FPKM_UQ', 'Ensembl_gene_id', 'gene_name') \\\n",
" .select(clin_table.CR_status_at_end_of_course_1, clin_table.CR_status_at_end_of_course_2) \\\n",
" .where(clin_table.disease_code=='AML') \\\n",
" .limit(100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "LBUB5EmMkLJl"
},
"outputs": [],
"source": [
"# Replace the \" in the query\n",
"join_query_clean = str(join_query).replace('\"', \"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
},
"colab_type": "code",
"id": "q3e8414pkQEU",
"outputId": "5b9f0e0d-101a-4221-8b0d-cfbac05334bd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.HTSeq__FPKM_UQ,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.Ensembl_gene_id,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.gene_name,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_1,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` JOIN `isb-cgc.TARGET_bioclin_v0.Clinical` ON `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode=`isb-cgc.TARGET_bioclin_v0.Clinical`.case_barcode WHERE `isb-cgc.TARGET_bioclin_v0.Clinical`.disease_code='AML' LIMIT 100\n"
]
}
],
"source": [
"# View the created query\n",
"print(join_query_clean)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "CX6Kh5pK2r0L"
},
"source": [
"There! We now have a query that joins the two tables. We can now query BigQuery and view the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "6y6KUk-t3O7D"
},
"outputs": [],
"source": [
"# Query BigQuery and create a data frame\n",
"final = client.query(join_query_clean).to_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "Noo9eRhe3WvF",
"outputId": "8fbcb238-890b-4769-fdf9-0cc001c0573d"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" HTSeq__FPKM_UQ | \n",
" Ensembl_gene_id | \n",
" gene_name | \n",
" CR_status_at_end_of_course_1 | \n",
" CR_status_at_end_of_course_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-20-PAPWHS | \n",
" 392731.987867 | \n",
" ENSG00000173598 | \n",
" NUDT4 | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-20-PAPWHS | \n",
" 13617.578556 | \n",
" ENSG00000204745 | \n",
" AC083899.3 | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-20-PAPWHS | \n",
" 87978.648259 | \n",
" ENSG00000235183 | \n",
" RP11-613C6.4 | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-20-PAPWHS | \n",
" 258701.117197 | \n",
" ENSG00000177917 | \n",
" ARL6IP6 | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-20-PAPWHS | \n",
" 8689.796007 | \n",
" ENSG00000181085 | \n",
" MAPK15 | \n",
" CR | \n",
" CR | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" case_barcode ... CR_status_at_end_of_course_2\n",
"0 TARGET-20-PAPWHS ... CR\n",
"1 TARGET-20-PAPWHS ... CR\n",
"2 TARGET-20-PAPWHS ... CR\n",
"3 TARGET-20-PAPWHS ... CR\n",
"4 TARGET-20-PAPWHS ... CR\n",
"\n",
"[5 rows x 6 columns]"
]
},
"execution_count": 139,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# View the first several lines of the data frame\n",
"final.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "UdODK5dqPwHT"
},
"source": [
"It's that simple! Please let us know if you have any questions at feedback@isb-cgc.org."
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyMWalUgBuKVyDhcpwSFF79G",
"include_colab_link": true,
"name": "How to use PyPika_to_create_a_BigQuery_SQL_query.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}