{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "sEYqWcUtqlRe"
},
"source": [
"# ISB-CGC Community Notebooks\n",
"\n",
"```\n",
"Title: How to Create a Random Sample in BigQuery\n",
"Author: Lauren Hagen\n",
"Created: 2019-10-17\n",
"Purpose: Demonstrates how to split a data set into multiple groups randomly with BigQuery\n",
"```\n",
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "WhUTXNzMxy3k"
},
"source": [
"# How to Create a Random Sample in BigQuery\n",
"\n",
"In this notebook, we will be using BigQuery to create random samples for predicting an outcome with test and training data sets such as in machine learning. In this notebook, we assume that you have set up your GCP and accessed the ISB-CGC WebApp. If not, please visit the [How To Get Started on ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) or the [Community Notebook Repository](https://github.com/isb-cgc/Community-Notebooks) for guides on how to get started.\n",
"\n",
"We will go over two methods to create the subset data:\n",
"- `RAND()` Function\n",
"- `MOD()` and `FARM_FINGERPRINT` Functions"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "eQaCa1jtzMzE"
},
"source": [
"Before we can begin working with BigQuery, we will need to load the BigQuery module and authenticate ourselves."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "fXpa3o8Gigo2",
"scrolled": true
},
"outputs": [],
"source": [
"from google.cloud import bigquery"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 275
},
"colab_type": "code",
"id": "LOEQ1QGihzqg",
"outputId": "e3017369-1bbd-4349-f5e6-c5a1a43c01d3",
"scrolled": true
},
"outputs": [],
"source": [
"!gcloud auth application-default login"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "rzAxA6Cp0ESA"
},
"source": [
"## `RAND()` Function for Randomly Splitting data\n",
"A simple way to create a random sample with BigQuery is to use the `RAND()` function. The `RAND()` function creates a seemingly random set of numbers and then the query can select to create a random sample of rows.\n",
"\n",
"We will create two queries with the `RAND()` function. The first will return a random sample of the data and the second will return all of a cohort with the rows labeled for which subset they belong to.\n",
"\n",
"We are going to start with a simple query to create a cohort. Cohorts can be created previously in the WebApp or through other means instead of within this query. For more information on creating cohorts, please see the [ISB-CGC Web Interface (Web App) documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Web-UI.html) and the [Community Notebook Repository](https://github.com/isb-cgc/Community-Notebooks).\n",
"\n",
"\n",
"** Note: you will need to update 'your_project_number' with your project number before continuing with the notebook **"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "O2_isLlK0DnZ",
"scrolled": true
},
"outputs": [],
"source": [
"# Create a query with the cohort information\n",
"# This can be replaced with your own cohort\n",
"%%bigquery cohort --project your_project_number\n",
"SELECT case_barcode, project_short_name, case_gdc_id\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "LNoeLC9BLYAM",
"outputId": "d24adfb5-a434-46ca-beec-e9b5910059a2",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PATSRD | \n",
" TARGET-NBL | \n",
" fef92ed0-242b-5564-ad92-6b35c21c3bd5 | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PATTEF | \n",
" TARGET-NBL | \n",
" fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PARJAR | \n",
" TARGET-NBL | \n",
" feb97edc-ce83-5fd5-94e3-261ce244ac52 | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PAUAZA | \n",
" TARGET-NBL | \n",
" fe831368-c7ce-5e2b-b0fd-c35216a7761d | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PAIJGC | \n",
" TARGET-NBL | \n",
" fe58a8bf-8306-5aaf-99d1-b65c20fedc58 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" case_barcode project_short_name case_gdc_id\n",
"0 TARGET-30-PATSRD TARGET-NBL fef92ed0-242b-5564-ad92-6b35c21c3bd5\n",
"1 TARGET-30-PATTEF TARGET-NBL fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb\n",
"2 TARGET-30-PARJAR TARGET-NBL feb97edc-ce83-5fd5-94e3-261ce244ac52\n",
"3 TARGET-30-PAUAZA TARGET-NBL fe831368-c7ce-5e2b-b0fd-c35216a7761d\n",
"4 TARGET-30-PAIJGC TARGET-NBL fe58a8bf-8306-5aaf-99d1-b65c20fedc58"
]
},
"execution_count": 4,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# View the first 5 rows of the cohort\n",
"cohort.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "hBG1MUiZON7d",
"outputId": "99863f3e-a4c4-43a0-87d8-4d837e554e5e",
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This cohort has 1180 cases (rows).\n"
]
}
],
"source": [
"print(\"This cohort has \" + str(len(cohort)) + \" cases (rows).\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "xR5TRP5xOV81"
},
"source": [
"The next part of the query is creating a random sample. This line can be adjusted to return any percent of the data table into a random sample. For this example, it is set to create a random sample of ~25% of the data."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "3EbowSgDOVi9",
"scrolled": true
},
"outputs": [],
"source": [
"%%bigquery sample --project your_project_number\n",
"\n",
"--- Create Cohort\n",
"WITH table1 AS (\n",
"SELECT case_barcode, project_short_name, case_gdc_id, 0 as table_num\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\")\n",
"\n",
"--- Select a random sample that is ~25% of the data\n",
"SELECT case_barcode, project_short_name, case_gdc_id, 1 as table_num\n",
"FROM table1\n",
"-- Count the number of rows in the cohort, then find how many of them will be 25%\n",
"-- of the cohort. Divid that number by the total number of rows in the data \n",
"-- To change the %, change the 0.25 to what ever precentage you need\n",
"WHERE RAND() < ((SELECT COUNT(*) FROM table1)*0.25)/(SELECT COUNT(*) FROM table1)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "ntrs5zZXRY-e",
"outputId": "93655a7a-123e-42f8-d287-dba9eccf1e2e",
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The random sample of the cohort with 300 cases which is around 25.4% of the cohort.\n"
]
}
],
"source": [
"print(\"The random sample of the cohort with \" + str(len(sample)) + \" cases which is around \" + str(round((len(sample)/len(cohort)*100),1)) + \"% of the cohort.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7Elxb9tMQa_U"
},
"source": [
"This query is nice if we just wanted to grab a smaller sample of the cohort to do some initial analysis before moving to a larger data set but it is not useful if you want to create two separate subsets of data for training a model and then testing the model. The final query joins the new random sample table back with the main table and preserving the split of data with a column for which subset it belongs to."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "0ZwYS1v6zEQr",
"scrolled": true
},
"outputs": [],
"source": [
"%%bigquery dataset --project your_project_number\n",
"--- Create Cohort\n",
"WITH table1 AS (\n",
"SELECT case_barcode, project_short_name, case_gdc_id, 0 as table_num\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\"),\n",
"\n",
"--- Select a random sample that is ~25% of the data\n",
"table2 AS (\n",
"SELECT case_barcode, project_short_name, case_gdc_id, 1 as table_num\n",
"FROM table1\n",
"-- Count the number of rows in the cohort, then find how many of them will be 25%\n",
"-- of the cohort. Divid that number by the total number of rows in the data\n",
"-- To change the %, change the 0.25 to what ever precentage you need\n",
"WHERE RAND() < ((SELECT COUNT(*) FROM table1)*0.25)/(SELECT COUNT(*) FROM table1)\n",
")\n",
"\n",
"--- Join the random sample table back to the main table\n",
"SELECT a.case_barcode, a.project_short_name, a.case_gdc_id, IFNULL(b.table_num,2) AS table_num\n",
"FROM table1 AS a\n",
"FULL OUTER JOIN table2 AS b\n",
"ON a.case_barcode = b.case_barcode"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "6dKOnYcEMYDL",
"outputId": "053398d7-9f12-41eb-9c8d-860cab227434",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
" table_num | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PAUELT | \n",
" TARGET-NBL | \n",
" e785d22c-f983-55ff-be60-554ac487cf8c | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PADINC | \n",
" TARGET-NBL | \n",
" d190884d-551b-57ca-8d95-79cd809af8db | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PAUYDE | \n",
" TARGET-NBL | \n",
" ccfab7a6-e356-51c4-a803-04082cb67d49 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PAPUWY | \n",
" TARGET-NBL | \n",
" 95d7372e-2734-5068-8f7d-7597d92c1737 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PANYGR | \n",
" TARGET-NBL | \n",
" 7d624f31-49fa-55d7-a0a1-6b32cc17a333 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" case_barcode ... table_num\n",
"0 TARGET-30-PAUELT ... 1\n",
"1 TARGET-30-PADINC ... 1\n",
"2 TARGET-30-PAUYDE ... 2\n",
"3 TARGET-30-PAPUWY ... 1\n",
"4 TARGET-30-PANYGR ... 2\n",
"\n",
"[5 rows x 4 columns]"
]
},
"execution_count": 9,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"dataset.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "WnvdHJYbNIYp",
"outputId": "0b9cba6f-ebd0-47d7-cf12-a387cfe49f84",
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The final table has 1180 cases.\n"
]
}
],
"source": [
"print(\"The final table has \" + str(len(dataset)) + \" cases.\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "VH8O7eW10s71",
"outputId": "59abae07-1c06-498b-a58f-8b76859b69c2",
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 11,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# Create a list with the sorted initial barcodes\n",
"case_barcode_initial = list(cohort.case_barcode.sort_values())\n",
"# Create a list with the sorted final barcodes\n",
"case_barcode_final = list(dataset.case_barcode.sort_values())\n",
"# Compare the two lists, if TRUE, no barcodes were lost\n",
"case_barcode_initial == case_barcode_final "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "CInSjdzQQaL2"
},
"source": [
"Each query will have a different set of random samples because each time `RAND()` is run, it generates a new set of random numbers. This could be a problem if you want reproducible results each time you run the query. Another way to solve this problem is to use `FARM_FINGERPRINT()` with `MOD()` which we will cover next."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lpt-vNk2hMev"
},
"source": [
"## `MOD()` and `FARM_FINGERPRINT` Functions\n",
"\n",
"`FARM_FINGERPRINT()` will compute a string of BYTES or STRING and will never change. The `MOD()` function will return the remainder of the farm fingerprint number and a number. This method will always return the same values for each subset.\n",
"\n",
"For example, we want three approximately equal subsets of the data sets of our cohort. We will create a `WHERE` statement that has the `case_barcode` in the `FARM_FINGERPRINT` function, take the absolute value, put that number into the `MOD()`, and set that equal to 0.\n",
"\n",
"Note: You need to run `FARM_FINGERPRINT()` on a column that has unique values for each row or combine two columns with `CONCAT` to create a unique value row.\n",
"\n",
"For more information visit the BigQuery documentation:\n",
"- [`FARM_FINGERPRINT()`](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#farm_fingerprint)\n",
"- [`MOD()`](https://cloud.google.com/dataprep/docs/html/MOD-Function_57344691)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"colab_type": "code",
"id": "n0bpAC5nfFAc",
"outputId": "fef07df4-46bc-4fd6-e021-fd3bae2680b3",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PATTEF | \n",
" TARGET-NBL | \n",
" fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PARJAR | \n",
" TARGET-NBL | \n",
" feb97edc-ce83-5fd5-94e3-261ce244ac52 | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PANUVK | \n",
" TARGET-NBL | \n",
" fd756a5f-0f9a-57ca-9879-2b18e5fa0b54 | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PARZHA | \n",
" TARGET-NBL | \n",
" fc9d5f9e-af43-5134-95e7-2d434831580b | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PAPRXW | \n",
" TARGET-NBL | \n",
" fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 367 | \n",
" TARGET-30-PAHSRS | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 368 | \n",
" TARGET-30-PAHYBI | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 369 | \n",
" TARGET-30-PADGWJ | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 370 | \n",
" TARGET-30-PADTYM | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 371 | \n",
" TARGET-30-PALGNJ | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
372 rows × 3 columns
\n",
"
"
],
"text/plain": [
" case_barcode project_short_name case_gdc_id\n",
"0 TARGET-30-PATTEF TARGET-NBL fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb\n",
"1 TARGET-30-PARJAR TARGET-NBL feb97edc-ce83-5fd5-94e3-261ce244ac52\n",
"2 TARGET-30-PANUVK TARGET-NBL fd756a5f-0f9a-57ca-9879-2b18e5fa0b54\n",
"3 TARGET-30-PARZHA TARGET-NBL fc9d5f9e-af43-5134-95e7-2d434831580b\n",
"4 TARGET-30-PAPRXW TARGET-NBL fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c\n",
".. ... ... ...\n",
"367 TARGET-30-PAHSRS TARGET-NBL None\n",
"368 TARGET-30-PAHYBI TARGET-NBL None\n",
"369 TARGET-30-PADGWJ TARGET-NBL None\n",
"370 TARGET-30-PADTYM TARGET-NBL None\n",
"371 TARGET-30-PALGNJ TARGET-NBL None\n",
"\n",
"[372 rows x 3 columns]"
]
},
"execution_count": 12,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project your_project_number\n",
"SELECT case_barcode, project_short_name, case_gdc_id\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) = 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "q75My5qinjZd"
},
"source": [
"The three in the `MOD()` statement is what splits the cohort into three subsets. If we wanted to do ~20% split of the data, we would change the three to a ten and then take any row that less than or equal to 1."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"colab_type": "code",
"id": "DMKocoSxoMsO",
"outputId": "91022df0-5610-4cc7-e553-644d8bc39fee",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PASWIJ | \n",
" TARGET-NBL | \n",
" fe0b727f-3843-5b70-b9c1-8a207b837fc4 | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PARABN | \n",
" TARGET-NBL | \n",
" fc9d0307-a5f5-51c4-ad9e-6e9ed60f0eba | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PAPRXW | \n",
" TARGET-NBL | \n",
" fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PAREAG | \n",
" TARGET-NBL | \n",
" fb071e74-40dc-5f67-b4c3-e61dd2c2ef88 | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PASKSX | \n",
" TARGET-NBL | \n",
" faaed289-3f28-5a16-b18b-5ee164f05f50 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 216 | \n",
" TARGET-30-PAMXWK | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 217 | \n",
" TARGET-30-PAHZRF | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 218 | \n",
" TARGET-30-PADKLJ | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 219 | \n",
" TARGET-30-PAIFAU | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 220 | \n",
" TARGET-30-PALGNJ | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
221 rows × 3 columns
\n",
"
"
],
"text/plain": [
" case_barcode project_short_name case_gdc_id\n",
"0 TARGET-30-PASWIJ TARGET-NBL fe0b727f-3843-5b70-b9c1-8a207b837fc4\n",
"1 TARGET-30-PARABN TARGET-NBL fc9d0307-a5f5-51c4-ad9e-6e9ed60f0eba\n",
"2 TARGET-30-PAPRXW TARGET-NBL fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c\n",
"3 TARGET-30-PAREAG TARGET-NBL fb071e74-40dc-5f67-b4c3-e61dd2c2ef88\n",
"4 TARGET-30-PASKSX TARGET-NBL faaed289-3f28-5a16-b18b-5ee164f05f50\n",
".. ... ... ...\n",
"216 TARGET-30-PAMXWK TARGET-NBL None\n",
"217 TARGET-30-PAHZRF TARGET-NBL None\n",
"218 TARGET-30-PADKLJ TARGET-NBL None\n",
"219 TARGET-30-PAIFAU TARGET-NBL None\n",
"220 TARGET-30-PALGNJ TARGET-NBL None\n",
"\n",
"[221 rows x 3 columns]"
]
},
"execution_count": 13,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project your_project_number\n",
"SELECT case_barcode, project_short_name, case_gdc_id\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) <= 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3_G51QCzocut"
},
"source": [
"With this method, we can also create a query that will retrieve the remaining subset of data. For the first example, we would change the `=` to `!=` as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"colab_type": "code",
"id": "PjLVlcLuo1p-",
"outputId": "59fc4a99-cacf-40dd-b128-041143ac1772",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PATSRD | \n",
" TARGET-NBL | \n",
" fef92ed0-242b-5564-ad92-6b35c21c3bd5 | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PAUAZA | \n",
" TARGET-NBL | \n",
" fe831368-c7ce-5e2b-b0fd-c35216a7761d | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PAIJGC | \n",
" TARGET-NBL | \n",
" fe58a8bf-8306-5aaf-99d1-b65c20fedc58 | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PASWIJ | \n",
" TARGET-NBL | \n",
" fe0b727f-3843-5b70-b9c1-8a207b837fc4 | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PANBMJ | \n",
" TARGET-NBL | \n",
" fdfb389d-eb9a-5014-b391-9cd5f908720d | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 803 | \n",
" TARGET-30-PAKKMP | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 804 | \n",
" TARGET-30-PADKLJ | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 805 | \n",
" TARGET-30-PAKHWS | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 806 | \n",
" TARGET-30-PAHPEL | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 807 | \n",
" TARGET-30-PAIFAU | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
808 rows × 3 columns
\n",
"
"
],
"text/plain": [
" case_barcode project_short_name case_gdc_id\n",
"0 TARGET-30-PATSRD TARGET-NBL fef92ed0-242b-5564-ad92-6b35c21c3bd5\n",
"1 TARGET-30-PAUAZA TARGET-NBL fe831368-c7ce-5e2b-b0fd-c35216a7761d\n",
"2 TARGET-30-PAIJGC TARGET-NBL fe58a8bf-8306-5aaf-99d1-b65c20fedc58\n",
"3 TARGET-30-PASWIJ TARGET-NBL fe0b727f-3843-5b70-b9c1-8a207b837fc4\n",
"4 TARGET-30-PANBMJ TARGET-NBL fdfb389d-eb9a-5014-b391-9cd5f908720d\n",
".. ... ... ...\n",
"803 TARGET-30-PAKKMP TARGET-NBL None\n",
"804 TARGET-30-PADKLJ TARGET-NBL None\n",
"805 TARGET-30-PAKHWS TARGET-NBL None\n",
"806 TARGET-30-PAHPEL TARGET-NBL None\n",
"807 TARGET-30-PAIFAU TARGET-NBL None\n",
"\n",
"[808 rows x 3 columns]"
]
},
"execution_count": 14,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project your_project_number\n",
"SELECT case_barcode, project_short_name, case_gdc_id\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) != 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "8BpXyl4Yo9xv"
},
"source": [
"For the second example, we would change the `<=` to `>` as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"colab_type": "code",
"id": "DLTI11UYor-I",
"outputId": "bc39b30c-37a1-4063-e23c-8d33e5f8de27",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PATSRD | \n",
" TARGET-NBL | \n",
" fef92ed0-242b-5564-ad92-6b35c21c3bd5 | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PATTEF | \n",
" TARGET-NBL | \n",
" fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PARJAR | \n",
" TARGET-NBL | \n",
" feb97edc-ce83-5fd5-94e3-261ce244ac52 | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PAUAZA | \n",
" TARGET-NBL | \n",
" fe831368-c7ce-5e2b-b0fd-c35216a7761d | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PAIJGC | \n",
" TARGET-NBL | \n",
" fe58a8bf-8306-5aaf-99d1-b65c20fedc58 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 954 | \n",
" TARGET-30-PAHXGX | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 955 | \n",
" TARGET-30-PAKEVZ | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 956 | \n",
" TARGET-30-PAKKMP | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 957 | \n",
" TARGET-30-PAKHWS | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
" 958 | \n",
" TARGET-30-PAHPEL | \n",
" TARGET-NBL | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
959 rows × 3 columns
\n",
"
"
],
"text/plain": [
" case_barcode project_short_name case_gdc_id\n",
"0 TARGET-30-PATSRD TARGET-NBL fef92ed0-242b-5564-ad92-6b35c21c3bd5\n",
"1 TARGET-30-PATTEF TARGET-NBL fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb\n",
"2 TARGET-30-PARJAR TARGET-NBL feb97edc-ce83-5fd5-94e3-261ce244ac52\n",
"3 TARGET-30-PAUAZA TARGET-NBL fe831368-c7ce-5e2b-b0fd-c35216a7761d\n",
"4 TARGET-30-PAIJGC TARGET-NBL fe58a8bf-8306-5aaf-99d1-b65c20fedc58\n",
".. ... ... ...\n",
"954 TARGET-30-PAHXGX TARGET-NBL None\n",
"955 TARGET-30-PAKEVZ TARGET-NBL None\n",
"956 TARGET-30-PAKKMP TARGET-NBL None\n",
"957 TARGET-30-PAKHWS TARGET-NBL None\n",
"958 TARGET-30-PAHPEL TARGET-NBL None\n",
"\n",
"[959 rows x 3 columns]"
]
},
"execution_count": 15,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project your_project_number\n",
"SELECT case_barcode, project_short_name, case_gdc_id\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) > 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9ejEUj4mwgdo"
},
"source": [
"If you want to label the entire cohort and only create one query instead of two, a column can be added to label each case with the subset number."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"colab_type": "code",
"id": "VUbIaY88w5Om",
"outputId": "2d7f3e35-1f12-4b97-bd60-83befac67dc3",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" case_barcode | \n",
" project_short_name | \n",
" case_gdc_id | \n",
" subset | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TARGET-30-PATSRD | \n",
" TARGET-NBL | \n",
" fef92ed0-242b-5564-ad92-6b35c21c3bd5 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" TARGET-30-PATTEF | \n",
" TARGET-NBL | \n",
" fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" TARGET-30-PARJAR | \n",
" TARGET-NBL | \n",
" feb97edc-ce83-5fd5-94e3-261ce244ac52 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" TARGET-30-PAUAZA | \n",
" TARGET-NBL | \n",
" fe831368-c7ce-5e2b-b0fd-c35216a7761d | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" TARGET-30-PAIJGC | \n",
" TARGET-NBL | \n",
" fe58a8bf-8306-5aaf-99d1-b65c20fedc58 | \n",
" 2 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1175 | \n",
" TARGET-30-PADKLJ | \n",
" TARGET-NBL | \n",
" None | \n",
" 2 | \n",
"
\n",
" \n",
" 1176 | \n",
" TARGET-30-PAKHWS | \n",
" TARGET-NBL | \n",
" None | \n",
" 2 | \n",
"
\n",
" \n",
" 1177 | \n",
" TARGET-30-PAHPEL | \n",
" TARGET-NBL | \n",
" None | \n",
" 2 | \n",
"
\n",
" \n",
" 1178 | \n",
" TARGET-30-PAIFAU | \n",
" TARGET-NBL | \n",
" None | \n",
" 2 | \n",
"
\n",
" \n",
" 1179 | \n",
" TARGET-30-PALGNJ | \n",
" TARGET-NBL | \n",
" None | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
1180 rows × 4 columns
\n",
"
"
],
"text/plain": [
" case_barcode ... subset\n",
"0 TARGET-30-PATSRD ... 1\n",
"1 TARGET-30-PATTEF ... 0\n",
"2 TARGET-30-PARJAR ... 0\n",
"3 TARGET-30-PAUAZA ... 2\n",
"4 TARGET-30-PAIJGC ... 2\n",
"... ... ... ...\n",
"1175 TARGET-30-PADKLJ ... 2\n",
"1176 TARGET-30-PAKHWS ... 2\n",
"1177 TARGET-30-PAHPEL ... 2\n",
"1178 TARGET-30-PAIFAU ... 2\n",
"1179 TARGET-30-PALGNJ ... 0\n",
"\n",
"[1180 rows x 4 columns]"
]
},
"execution_count": 16,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project your_project_number\n",
"SELECT case_barcode, project_short_name, case_gdc_id, MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) as subset\n",
"FROM `isb-cgc.TARGET_bioclin_v0.Clinical`\n",
"WHERE project_short_name = \"TARGET-NBL\""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7SXSobW1xDM2"
},
"source": [
"The subsets can then be filtered out and manipulated in python. Have fun random sampling the data! Please let us know if you have questions by emailing us at feedback@isb-cgc.org."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "How_to_create_a_random_samples_in_bigquery",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}