{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "V6hHrkja94oQ"
},
"source": [
"# ISB-CGC Community Notebooks\n",
"Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
"\n",
"```\n",
"Title: How to transform a NCBI GEO data set to a BigQuery table\n",
"Author: David L Gibbs\n",
"Created: 2019-06-14\n",
"Purpose: Demonstrate how to make BigQuery tables from external data sets\n",
"URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_make_NCBI_GEO_BigQuery_tables.ipynb\n",
"Notes: \n",
"```\n",
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tranforming NCBI GEO data sets to BigQuery (GEO2BQ)\n",
"\n",
" In this example, we're going to retrieve data from NCBI GEO, and create a BigQuery table."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "wYNMaJPlMEBA"
},
"source": [
"### Authenticate with Google\n",
"Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook. If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "FyrcP8Zs-VkB",
"outputId": "c6401a7b-9efc-4202-a398-94fa55f81c4f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"You have been successfully authenticated!\n"
]
}
],
"source": [
"from google.colab import auth\n",
"try:\n",
" auth.authenticate_user()\n",
" print('You have been successfully authenticated!')\n",
"except:\n",
" print('You have not been authenticated.')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "jPYmhcmQNA5u"
},
"source": [
"### Initialize connection to BigQuery\n",
"Once you're authenticated, we'll begin getting set up to pull data out of BigQuery. \n",
"\n",
"The first step is to initialize the BigQuery client. This requires specifying a Google Cloud Platform (GCP) **project id** in which you have the necessary privileges (also referred to as \"roles\") to execute queries and access the data used by this notebook.\n",
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "92CNdnfsNJUI",
"outputId": "f0846857-53bd-442a-b7cd-31d41274a4c5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BigQuery client successfully initialized\n"
]
}
],
"source": [
"from google.cloud import bigquery\n",
"try:\n",
" project_id = 'your_project_number' # Update your_project_number with your project number\n",
" bqclient = bigquery.Client(project=project_id)\n",
" print('BigQuery client successfully initialized')\n",
"except:\n",
" print('Failure to initialize BigQuery client')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "qRbwf2M8Rqa9"
},
"source": [
"## Install the GEOparse library ##"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 374
},
"colab_type": "code",
"id": "5QzUkfC0Rek-",
"outputId": "b22a3119-486a-45e1-fd83-e45ce1194dad"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting GEOparse\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/67/f6/9206e1acda1858fa9a117ae91d9541e011735e672d58be58a5ee0947ef13/GEOparse-1.1.0.tar.gz (189kB)\n",
"\u001b[K |████████████████████████████████| 194kB 2.8MB/s \n",
"\u001b[?25hRequirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.6/dist-packages (from GEOparse) (1.16.4)\n",
"Requirement already satisfied: pandas>=0.17 in /usr/local/lib/python3.6/dist-packages (from GEOparse) (0.24.2)\n",
"Collecting wgetter>=0.6 (from GEOparse)\n",
" Downloading https://files.pythonhosted.org/packages/8e/ce/7f160ed9f0e16a5365bcbac1dbc6bad1631e9fc91610a444fbdebede3e8b/wgetter-0.7.tar.gz\n",
"Collecting biopython>=1.71 (from GEOparse)\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/28/15/8ac646ff24cfa2588b4d5e5ea51e8d13f3d35806bd9498fbf40ef79026fd/biopython-1.73-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)\n",
"\u001b[K |████████████████████████████████| 2.2MB 36.9MB/s \n",
"\u001b[?25hRequirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17->GEOparse) (2.5.3)\n",
"Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17->GEOparse) (2018.9)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.5.0->pandas>=0.17->GEOparse) (1.12.0)\n",
"Building wheels for collected packages: GEOparse, wgetter\n",
" Building wheel for GEOparse (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Stored in directory: /root/.cache/pip/wheels/f3/aa/77/45a2f1517e7545aaabce83d4ad371e4f58aa818e4ee38691cd\n",
" Building wheel for wgetter (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Stored in directory: /root/.cache/pip/wheels/89/ce/c2/d0f63a63aec734d8af33209050a861f9a8583e11063dc27419\n",
"Successfully built GEOparse wgetter\n",
"Installing collected packages: wgetter, biopython, GEOparse\n",
"Successfully installed GEOparse-1.1.0 biopython-1.73 wgetter-0.7\n"
]
}
],
"source": [
"!pip install GEOparse"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "b748HDJoQ-Mc"
},
"source": [
"##Step 1: Find data on NCBI GEO##\n",
"\n",
"\n",
"Here's a data set I found by searching for pancreatic cancer:\n",
"\n",
"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28735"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "A75BvtYlSdS-"
},
"source": [
"## Step 2: Retrieve the data and make a long Pandas table ##"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "lQ3it12NRFkC"
},
"outputs": [],
"source": [
"import GEOparse\n",
"import pandas as pd\n",
"\n",
"## docs: https://geoparse.readthedocs.io/en/latest/usage.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "8F7_H2tCR_9w"
},
"outputs": [],
"source": [
"gse = GEOparse.get_GEO(geo=\"GSE28735\", destdir=\"./\")\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1380
},
"colab_type": "code",
"id": "W6rAsMqLSDTC",
"outputId": "60c20d02-68d6-4197-e987-7ee9ae97ff50"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"GSM example:\n",
"Name: GSM711904\n",
"Metadata:\n",
" - title : human pancreatic tumor tissue, patient sample 1\n",
" - geo_accession : GSM711904\n",
" - status : Public on Jul 09 2012\n",
" - submission_date : Apr 20 2011\n",
" - last_update_date : Jul 09 2012\n",
" - type : RNA\n",
" - channel_count : 1\n",
" - source_name_ch1 : tumor tissue, patient 1\n",
" - organism_ch1 : Homo sapiens\n",
" - taxid_ch1 : 9606\n",
" - characteristics_ch1 : tissue: T, survival_month: 51, cancer_death: 1\n",
" - molecule_ch1 : total RNA\n",
" - extract_protocol_ch1 : Tissues were flash frozen immediately after surgery.RNA from frozen tissue samples was extracted using standard TRIZOL (Invitrogen) protocol. RNA quality was confirmed with the Agilent 2100 Bioanalyzer (Agilent Technologies) before the microarray gene expression profiling\n",
" - label_ch1 : biotin\n",
" - label_protocol_ch1 : Samples were enzymatically fragmented and biotinylated using the WT Terminal Labeling Kit (Affymetrix)\n",
" - hyb_protocol : Samples were hybridized using Affymetrix GeneChip Hybridization, Wash, and Stain Kit and according to the manufacturer's protocol.\n",
" - scan_protocol : Affymetrix Gene ChIP Scanner 3000 7G\n",
" - description : cancer related death\n",
" - data_processing : All arrays were RMA normalized and gene expression summaries were created for each gene by averaging all probe sets for each gene using Partek Genomics Suite 6.5. All data analysis was performed on gene summarized data.\n",
" - platform_id : GPL6244\n",
" - contact_name : Perwez,,Hussain\n",
" - contact_email : hussainp@mail.nih.gov\n",
" - contact_institute : NCI/NIH\n",
" - contact_address : 37 Convent Drive\n",
" - contact_city : Bethesda\n",
" - contact_zip/postal_code : 20892\n",
" - contact_country : USA\n",
" - supplementary_file : ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM711nnn/GSM711904/suppl/GSM711904.CEL.gz\n",
" - series_id : GSE28735\n",
" - data_row_count : 28869\n",
"Table data:\n",
" ID_REF VALUE\n",
"0 8157281 2.87856\n",
"1 7997332 4.81907\n",
"2 8072798 4.79498\n",
"3 7972808 3.91443\n",
"4 8157283 6.54982\n",
"\n",
"GPL example:\n",
"Name: GPL6244\n",
"Metadata:\n",
" - title : [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]\n",
" - geo_accession : GPL6244\n",
" - status : Public on Dec 05 2007\n",
" - submission_date : Dec 05 2007\n",
" - last_update_date : Jul 26 2018\n",
" - technology : in situ oligonucleotide\n",
" - distribution : commercial\n",
" - organism : Homo sapiens\n",
" - taxid : 9606\n",
" - manufacturer : Affymetrix\n",
" - manufacture_protocol : See manufacturer's web site, \n",
" - description : Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html, , June 03, 2009: annotation table updated with netaffx build 28, June 18, 2012: annotation table updated with netaffx build 32, July 01, 2016: annotation table updated with netaffx build 35\n",
" - web_link : http://www.affymetrix.com/support/technical/byproduct.affx?product=hugene-1_0-st-v1, http://www.affymetrix.com/support/technical/libraryfilesmain.affx\n",
" - contact_name : ,,Affymetrix, Inc.\n",
" - contact_email : geo@ncbi.nlm.nih.gov, support@affymetrix.com\n",
" - contact_phone : 888-362-2447\n",
" - contact_institute : Affymetrix, Inc.\n",
" - contact_address : \n",
" - contact_city : Santa Clara\n",
" - contact_state : CA\n",
" - contact_zip/postal_code : 95051\n",
" - contact_country : USA\n",
" - contact_web_link : http://www.affymetrix.com/index.affx\n",
" - relation : Alternative to: GPL10063 (Alternative CDF), Alternative to: GPL10666 (Alternative CDF), Alternative to: GPL11209 (Alternative CDF), Alternative to: GPL13243 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_11.0.1]), Alternative to: GPL14010 (Alternative CDF), Alternative to: GPL15034 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_v14]), Alternative to: GPL15424 (alternative CDF [HuGene10stv1_Hs_ENST]), Alternative to: GPL15648 (Alternative CDF), Alternative to: GPL15969 (Alternative CDF [HuGene1_0_genecentric]), Alternative to: GPL16239 (Alternative CDF), Alternative to: GPL16332 (Alternative CDF [MMBGX-Ensembl64]), Alternative to: GPL16522 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_15.1.0]), Alternative to: GPL16786 (Alternative CDF [HuGene10stv1_Hs_ENSG_v13.0.0]), Alternative to: GPL16987 (Alternative CDF [GATExplorer_Ensembl v57]), Alternative to: GPL17047 ( Alternative CDF [HuGene10stv1_Hs_ENTREZG_15.0.0]), Alternative to: GPL17244 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_16.0.0]), Alternative to: GPL17556 (alternative), Alternative to: GPL17737 (Alternative CDF), Alternative to: GPL18401 (AltAnalyze probeset-to-Ensembl), Alternative to: GPL18412 (Alternative CDF [hugene10stv1_74_030]), Alternative to: GPL18695 (alternative CDF[Hugene10st_Hs_ENST_16.0.0]), Alternative to: GPL19145 (Alternative CDF [hugene10st_Hs_ENSG_16.0.0]), Alternative to: GPL19433 (Alternative CDF [hugene10st_Hs_ENSGvs1500.cdf]), Alternative to: GPL20171 (Alternative CDF [hugene10st_Hs_ENTREZG_19.0.0]), Alternative to: GPL23526 (Alternative CDF [hugene10st_Hs_ENTREZG_20])\n",
" - data_row_count : 33297\n",
"Table data:\n",
" ID ... category\n",
"0 7896736 ... main\n",
"1 7896738 ... main\n",
"2 7896740 ... main\n",
"3 7896742 ... main\n",
"4 7896744 ... main\n",
"\n",
"[5 rows x 12 columns]\n"
]
}
],
"source": [
"\n",
"print()\n",
"print(\"GSM example:\")\n",
"for gsm_name, gsm in gse.gsms.items():\n",
" print(\"Name: \", gsm_name)\n",
" print(\"Metadata:\",)\n",
" for key, value in gsm.metadata.items():\n",
" print(\" - %s : %s\" % (key, \", \".join(value)))\n",
" print (\"Table data:\",)\n",
" print (gsm.table.head())\n",
" break\n",
"\n",
"print()\n",
"print(\"GPL example:\")\n",
"for gpl_name, gpl in gse.gpls.items():\n",
" print(\"Name: \", gpl_name)\n",
" print(\"Metadata:\",)\n",
" for key, value in gpl.metadata.items():\n",
" print(\" - %s : %s\" % (key, \", \".join(value)))\n",
" print(\"Table data:\",)\n",
" print(gpl.table.head())\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "9Au9tS2USmb2",
"outputId": "38878913-342f-4fd6-bcf1-cb1828b19941"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"adc.json GSE28735_family.soft.gz sample_data\n"
]
}
],
"source": [
"\n",
"### Let's check if the data is present ... ###\n",
"!ls"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
},
"colab_type": "code",
"id": "Kihf3RrsS_VB",
"outputId": "c6b4718b-19c7-47c9-afde-69c8ed5cf1de"
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['GSM711904', 'GSM711905', 'GSM711906', 'GSM711907', 'GSM711908', 'GSM711909', 'GSM711910', 'GSM711911', 'GSM711912', 'GSM711913', 'GSM711914', 'GSM711915', 'GSM711916', 'GSM711917', 'GSM711918', 'GSM711919', 'GSM711920', 'GSM711921', 'GSM711922', 'GSM711923', 'GSM711924', 'GSM711925', 'GSM711926', 'GSM711927', 'GSM711928', 'GSM711929', 'GSM711930', 'GSM711931', 'GSM711932', 'GSM711933', 'GSM711934', 'GSM711935', 'GSM711936', 'GSM711937', 'GSM711938', 'GSM711939', 'GSM711940', 'GSM711941', 'GSM711942', 'GSM711943', 'GSM711944', 'GSM711945', 'GSM711946', 'GSM711947', 'GSM711948', 'GSM711949', 'GSM711950', 'GSM711951', 'GSM711952', 'GSM711953', 'GSM711954', 'GSM711955', 'GSM711956', 'GSM711957', 'GSM711958', 'GSM711959', 'GSM711960', 'GSM711961', 'GSM711962', 'GSM711963', 'GSM711964', 'GSM711965', 'GSM711966', 'GSM711967', 'GSM711968', 'GSM711969', 'GSM711970', 'GSM711971', 'GSM711972', 'GSM711973', 'GSM711974', 'GSM711975', 'GSM711976', 'GSM711977', 'GSM711978', 'GSM711979', 'GSM711980', 'GSM711981', 'GSM711982', 'GSM711983', 'GSM711984', 'GSM711985', 'GSM711986', 'GSM711987', 'GSM711988', 'GSM711989', 'GSM711990', 'GSM711991', 'GSM711992', 'GSM711993'])"
]
},
"execution_count": 20,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"### Here's the sample names ###\n",
"\n",
"gse.gsms.keys()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "KCEK3E3-Tl0I"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "oLMDSxbZTbAp",
"outputId": "7621422d-d159-4031-fe32-58f1ee5a0580"
},
"outputs": [
{
"data": {
"text/plain": [
"(28869, 90)"
]
},
"execution_count": 56,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"### To make a pandas table\n",
"\n",
"dataList = []\n",
"keyList = []\n",
"\n",
"\n",
"for gsm_name, gsm in gse.gsms.items():\n",
" dataList.append(gsm.table['VALUE'])\n",
" keyList.append(gsm_name) \n",
" \n",
"dt = pd.concat(dataList, axis=1, keys=keyList)\n",
"\n",
"dt.shape"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 284
},
"colab_type": "code",
"id": "1tpaj86DUwFZ",
"outputId": "015361f5-2347-406a-a46f-31778f2b94ff"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" GSM711904 | \n",
" GSM711905 | \n",
" GSM711906 | \n",
" GSM711907 | \n",
" GSM711908 | \n",
" GSM711909 | \n",
" GSM711910 | \n",
" GSM711911 | \n",
" GSM711912 | \n",
" GSM711913 | \n",
" GSM711914 | \n",
" GSM711915 | \n",
" GSM711916 | \n",
" GSM711917 | \n",
" GSM711918 | \n",
" GSM711919 | \n",
" GSM711920 | \n",
" GSM711921 | \n",
" GSM711922 | \n",
" GSM711923 | \n",
" GSM711924 | \n",
" GSM711925 | \n",
" GSM711926 | \n",
" GSM711927 | \n",
" GSM711928 | \n",
" GSM711929 | \n",
" GSM711930 | \n",
" GSM711931 | \n",
" GSM711932 | \n",
" GSM711933 | \n",
" GSM711934 | \n",
" GSM711935 | \n",
" GSM711936 | \n",
" GSM711937 | \n",
" GSM711938 | \n",
" GSM711939 | \n",
" GSM711940 | \n",
" GSM711941 | \n",
" GSM711942 | \n",
" GSM711943 | \n",
" ... | \n",
" GSM711954 | \n",
" GSM711955 | \n",
" GSM711956 | \n",
" GSM711957 | \n",
" GSM711958 | \n",
" GSM711959 | \n",
" GSM711960 | \n",
" GSM711961 | \n",
" GSM711962 | \n",
" GSM711963 | \n",
" GSM711964 | \n",
" GSM711965 | \n",
" GSM711966 | \n",
" GSM711967 | \n",
" GSM711968 | \n",
" GSM711969 | \n",
" GSM711970 | \n",
" GSM711971 | \n",
" GSM711972 | \n",
" GSM711973 | \n",
" GSM711974 | \n",
" GSM711975 | \n",
" GSM711976 | \n",
" GSM711977 | \n",
" GSM711978 | \n",
" GSM711979 | \n",
" GSM711980 | \n",
" GSM711981 | \n",
" GSM711982 | \n",
" GSM711983 | \n",
" GSM711984 | \n",
" GSM711985 | \n",
" GSM711986 | \n",
" GSM711987 | \n",
" GSM711988 | \n",
" GSM711989 | \n",
" GSM711990 | \n",
" GSM711991 | \n",
" GSM711992 | \n",
" GSM711993 | \n",
"
\n",
" \n",
" ID_REF | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 8157281 | \n",
" 2.87856 | \n",
" 3.11238 | \n",
" 2.53830 | \n",
" 3.06354 | \n",
" 3.00862 | \n",
" 2.86031 | \n",
" 2.79610 | \n",
" 2.70925 | \n",
" 2.93254 | \n",
" 2.63424 | \n",
" 2.95893 | \n",
" 2.87200 | \n",
" 2.68794 | \n",
" 2.87043 | \n",
" 3.00570 | \n",
" 2.88930 | \n",
" 2.67934 | \n",
" 2.57238 | \n",
" 3.27607 | \n",
" 2.64049 | \n",
" 3.06408 | \n",
" 2.73406 | \n",
" 2.77169 | \n",
" 2.82460 | \n",
" 2.53917 | \n",
" 2.89397 | \n",
" 3.02304 | \n",
" 2.54533 | \n",
" 2.68656 | \n",
" 2.71329 | \n",
" 2.78667 | \n",
" 3.05640 | \n",
" 2.64101 | \n",
" 2.65976 | \n",
" 2.59429 | \n",
" 2.48407 | \n",
" 2.76008 | \n",
" 2.80774 | \n",
" 2.87721 | \n",
" 2.65070 | \n",
" ... | \n",
" 2.87857 | \n",
" 2.56825 | \n",
" 2.83285 | \n",
" 2.79112 | \n",
" 2.54681 | \n",
" 3.67406 | \n",
" 2.50075 | \n",
" 2.78080 | \n",
" 2.78987 | \n",
" 3.04386 | \n",
" 2.71545 | \n",
" 2.80313 | \n",
" 2.77076 | \n",
" 2.47390 | \n",
" 3.23515 | \n",
" 3.36002 | \n",
" 2.94759 | \n",
" 2.70444 | \n",
" 2.78727 | \n",
" 2.74598 | \n",
" 2.85543 | \n",
" 2.75771 | \n",
" 2.59462 | \n",
" 2.68376 | \n",
" 2.70517 | \n",
" 2.55055 | \n",
" 3.08985 | \n",
" 2.68454 | \n",
" 2.82559 | \n",
" 2.46607 | \n",
" 2.95069 | \n",
" 3.08348 | \n",
" 2.53686 | \n",
" 2.59509 | \n",
" 2.81026 | \n",
" 2.73692 | \n",
" 2.56131 | \n",
" 2.55399 | \n",
" 2.33127 | \n",
" 2.84898 | \n",
"
\n",
" \n",
" 7997332 | \n",
" 4.81907 | \n",
" 5.11957 | \n",
" 3.84534 | \n",
" 4.88599 | \n",
" 4.21256 | \n",
" 5.21558 | \n",
" 4.44358 | \n",
" 4.84622 | \n",
" 4.39662 | \n",
" 5.31573 | \n",
" 4.04326 | \n",
" 4.90359 | \n",
" 4.66811 | \n",
" 4.61300 | \n",
" 3.99221 | \n",
" 4.26957 | \n",
" 4.76269 | \n",
" 5.55366 | \n",
" 4.36159 | \n",
" 4.55511 | \n",
" 4.68806 | \n",
" 4.54581 | \n",
" 4.68843 | \n",
" 4.36278 | \n",
" 4.00679 | \n",
" 5.50511 | \n",
" 3.78917 | \n",
" 5.88371 | \n",
" 4.68404 | \n",
" 4.68971 | \n",
" 4.26084 | \n",
" 5.25235 | \n",
" 4.81340 | \n",
" 5.05444 | \n",
" 5.69069 | \n",
" 5.44038 | \n",
" 4.59510 | \n",
" 4.92991 | \n",
" 4.88100 | \n",
" 4.55881 | \n",
" ... | \n",
" 4.56882 | \n",
" 4.55533 | \n",
" 4.81932 | \n",
" 5.96177 | \n",
" 4.58744 | \n",
" 5.35493 | \n",
" 4.00745 | \n",
" 5.07203 | \n",
" 4.30874 | \n",
" 5.60549 | \n",
" 4.39004 | \n",
" 5.03262 | \n",
" 4.19159 | \n",
" 4.66786 | \n",
" 4.41098 | \n",
" 5.50174 | \n",
" 4.37206 | \n",
" 4.25277 | \n",
" 4.55099 | \n",
" 5.58128 | \n",
" 4.25922 | \n",
" 5.78943 | \n",
" 4.64110 | \n",
" 5.39257 | \n",
" 4.51210 | \n",
" 4.94286 | \n",
" 4.66434 | \n",
" 5.09756 | \n",
" 4.48413 | \n",
" 4.87422 | \n",
" 4.99665 | \n",
" 5.13753 | \n",
" 4.71479 | \n",
" 5.75471 | \n",
" 4.96034 | \n",
" 5.30360 | \n",
" 4.73077 | \n",
" 4.74355 | \n",
" 4.92834 | \n",
" 5.39472 | \n",
"
\n",
" \n",
" 8072798 | \n",
" 4.79498 | \n",
" 3.33289 | \n",
" 4.46755 | \n",
" 4.21402 | \n",
" 3.96569 | \n",
" 5.93537 | \n",
" 5.32265 | \n",
" 5.61217 | \n",
" 4.00068 | \n",
" 4.75637 | \n",
" 4.22405 | \n",
" 4.82199 | \n",
" 4.26785 | \n",
" 4.43000 | \n",
" 3.93890 | \n",
" 4.68159 | \n",
" 4.57920 | \n",
" 5.63977 | \n",
" 4.40216 | \n",
" 4.90237 | \n",
" 4.90333 | \n",
" 3.87491 | \n",
" 4.55239 | \n",
" 5.81248 | \n",
" 4.43769 | \n",
" 3.87557 | \n",
" 4.11975 | \n",
" 4.04842 | \n",
" 4.39872 | \n",
" 5.36391 | \n",
" 4.92992 | \n",
" 3.55703 | \n",
" 4.05370 | \n",
" 4.16998 | \n",
" 3.95932 | \n",
" 4.28915 | \n",
" 4.84668 | \n",
" 4.96392 | \n",
" 4.97775 | \n",
" 5.06941 | \n",
" ... | \n",
" 4.98281 | \n",
" 5.00488 | \n",
" 4.94275 | \n",
" 4.51843 | \n",
" 5.72931 | \n",
" 3.82826 | \n",
" 5.29366 | \n",
" 3.90850 | \n",
" 4.85564 | \n",
" 3.70039 | \n",
" 4.79108 | \n",
" 4.93422 | \n",
" 4.48444 | \n",
" 4.66008 | \n",
" 4.48499 | \n",
" 4.10805 | \n",
" 5.17420 | \n",
" 5.20986 | \n",
" 4.47858 | \n",
" 4.63110 | \n",
" 5.60834 | \n",
" 4.55652 | \n",
" 4.35247 | \n",
" 4.86911 | \n",
" 4.66227 | \n",
" 4.94809 | \n",
" 4.68328 | \n",
" 4.49687 | \n",
" 4.93756 | \n",
" 5.40975 | \n",
" 3.97622 | \n",
" 3.83820 | \n",
" 4.40752 | \n",
" 3.56449 | \n",
" 4.46081 | \n",
" 4.52637 | \n",
" 5.26350 | \n",
" 5.08635 | \n",
" 4.60016 | \n",
" 4.87154 | \n",
"
\n",
" \n",
" 7972808 | \n",
" 3.91443 | \n",
" 3.73899 | \n",
" 3.56541 | \n",
" 3.65571 | \n",
" 3.81255 | \n",
" 3.47989 | \n",
" 3.97206 | \n",
" 3.23493 | \n",
" 3.65640 | \n",
" 3.47709 | \n",
" 3.24363 | \n",
" 4.19606 | \n",
" 3.71938 | \n",
" 3.41108 | \n",
" 3.60706 | \n",
" 3.46097 | \n",
" 3.47632 | \n",
" 3.46287 | \n",
" 3.42344 | \n",
" 3.48008 | \n",
" 3.32608 | \n",
" 3.93251 | \n",
" 3.24573 | \n",
" 3.59813 | \n",
" 3.77408 | \n",
" 4.04420 | \n",
" 3.71794 | \n",
" 3.26882 | \n",
" 3.20658 | \n",
" 3.43510 | \n",
" 3.45698 | \n",
" 3.92237 | \n",
" 3.58156 | \n",
" 3.60522 | \n",
" 3.45792 | \n",
" 3.67040 | \n",
" 3.26171 | \n",
" 3.25425 | \n",
" 3.47135 | \n",
" 3.26036 | \n",
" ... | \n",
" 3.61229 | \n",
" 3.46821 | \n",
" 3.44855 | \n",
" 3.39108 | \n",
" 3.44181 | \n",
" 4.25334 | \n",
" 3.42747 | \n",
" 4.48789 | \n",
" 3.50292 | \n",
" 3.65136 | \n",
" 3.32469 | \n",
" 3.46204 | \n",
" 3.52742 | \n",
" 3.38724 | \n",
" 3.61432 | \n",
" 3.59659 | \n",
" 3.63159 | \n",
" 3.28400 | \n",
" 3.37072 | \n",
" 3.24871 | \n",
" 3.70275 | \n",
" 3.46703 | \n",
" 3.64926 | \n",
" 3.37321 | \n",
" 3.42203 | \n",
" 3.09829 | \n",
" 3.49502 | \n",
" 3.30841 | \n",
" 3.58856 | \n",
" 3.25516 | \n",
" 3.66669 | \n",
" 3.56927 | \n",
" 3.84543 | \n",
" 3.58422 | \n",
" 3.60664 | \n",
" 3.31744 | \n",
" 3.82820 | \n",
" 3.71242 | \n",
" 3.39796 | \n",
" 3.48399 | \n",
"
\n",
" \n",
" 8157283 | \n",
" 6.54982 | \n",
" 6.00254 | \n",
" 6.52508 | \n",
" 6.14602 | \n",
" 6.33795 | \n",
" 6.46008 | \n",
" 6.90400 | \n",
" 6.79776 | \n",
" 6.57530 | \n",
" 6.28204 | \n",
" 6.72710 | \n",
" 6.32471 | \n",
" 6.40431 | \n",
" 6.65186 | \n",
" 6.66155 | \n",
" 6.99649 | \n",
" 6.80546 | \n",
" 6.56490 | \n",
" 6.39291 | \n",
" 6.92403 | \n",
" 6.75612 | \n",
" 6.47914 | \n",
" 6.49044 | \n",
" 6.62358 | \n",
" 6.46638 | \n",
" 6.03631 | \n",
" 6.42088 | \n",
" 6.41698 | \n",
" 6.47206 | \n",
" 6.80258 | \n",
" 6.65255 | \n",
" 6.36393 | \n",
" 6.55296 | \n",
" 6.35654 | \n",
" 6.23710 | \n",
" 6.08812 | \n",
" 6.62913 | \n",
" 6.39054 | \n",
" 7.02084 | \n",
" 6.89889 | \n",
" ... | \n",
" 6.13243 | \n",
" 6.57235 | \n",
" 6.55251 | \n",
" 6.51254 | \n",
" 6.64045 | \n",
" 6.69901 | \n",
" 6.63028 | \n",
" 6.05628 | \n",
" 6.70075 | \n",
" 6.35385 | \n",
" 6.73580 | \n",
" 6.47082 | \n",
" 6.51137 | \n",
" 7.07220 | \n",
" 6.60414 | \n",
" 6.46276 | \n",
" 6.59864 | \n",
" 6.59402 | \n",
" 7.10924 | \n",
" 6.78148 | \n",
" 6.59497 | \n",
" 6.24138 | \n",
" 6.27252 | \n",
" 6.54214 | \n",
" 6.80291 | \n",
" 6.89463 | \n",
" 6.79239 | \n",
" 6.66943 | \n",
" 6.83302 | \n",
" 6.59577 | \n",
" 5.84439 | \n",
" 6.26302 | \n",
" 6.56643 | \n",
" 6.22398 | \n",
" 6.71857 | \n",
" 6.49037 | \n",
" 6.82428 | \n",
" 6.57863 | \n",
" 6.79043 | \n",
" 6.65992 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 90 columns
\n",
"
"
],
"text/plain": [
" GSM711904 GSM711905 GSM711906 ... GSM711991 GSM711992 GSM711993\n",
"ID_REF ... \n",
"8157281 2.87856 3.11238 2.53830 ... 2.55399 2.33127 2.84898\n",
"7997332 4.81907 5.11957 3.84534 ... 4.74355 4.92834 5.39472\n",
"8072798 4.79498 3.33289 4.46755 ... 5.08635 4.60016 4.87154\n",
"7972808 3.91443 3.73899 3.56541 ... 3.71242 3.39796 3.48399\n",
"8157283 6.54982 6.00254 6.52508 ... 6.57863 6.79043 6.65992\n",
"\n",
"[5 rows x 90 columns]"
]
},
"execution_count": 57,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"dt[0:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "M8UcRNQbaAPo"
},
"outputs": [],
"source": [
"\n",
"# Here we convert the wide matrix to a long format #\n",
"longTable = dt.unstack().reset_index() "
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "4g0ZxivwaLQQ",
"outputId": "ffb7cfda-74ee-43f1-c23b-adaa1d8da3c6"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" level_0 | \n",
" ID_REF | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GSM711904 | \n",
" 8157281 | \n",
" 2.87856 | \n",
"
\n",
" \n",
" 1 | \n",
" GSM711904 | \n",
" 7997332 | \n",
" 4.81907 | \n",
"
\n",
" \n",
" 2 | \n",
" GSM711904 | \n",
" 8072798 | \n",
" 4.79498 | \n",
"
\n",
" \n",
" 3 | \n",
" GSM711904 | \n",
" 7972808 | \n",
" 3.91443 | \n",
"
\n",
" \n",
" 4 | \n",
" GSM711904 | \n",
" 8157283 | \n",
" 6.54982 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" level_0 ID_REF 0\n",
"0 GSM711904 8157281 2.87856\n",
"1 GSM711904 7997332 4.81907\n",
"2 GSM711904 8072798 4.79498\n",
"3 GSM711904 7972808 3.91443\n",
"4 GSM711904 8157283 6.54982"
]
},
"execution_count": 74,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"longTable[0:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "zu1UvoHXbB4v"
},
"outputs": [],
"source": [
"# We can rename the columns\n",
"longTable = longTable.set_axis(['Sample', 'Entrez', 'Value'], axis=1, inplace=False)\n"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"colab_type": "code",
"id": "RLpsVSF5cOEW",
"outputId": "7e35a5b8-8630-4ae2-a35f-30aeeb18e98e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sample | \n",
" Entrez | \n",
" Value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GSM711904 | \n",
" 8157281 | \n",
" 2.87856 | \n",
"
\n",
" \n",
" 1 | \n",
" GSM711904 | \n",
" 7997332 | \n",
" 4.81907 | \n",
"
\n",
" \n",
" 2 | \n",
" GSM711904 | \n",
" 8072798 | \n",
" 4.79498 | \n",
"
\n",
" \n",
" 3 | \n",
" GSM711904 | \n",
" 7972808 | \n",
" 3.91443 | \n",
"
\n",
" \n",
" 4 | \n",
" GSM711904 | \n",
" 8157283 | \n",
" 6.54982 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sample Entrez Value\n",
"0 GSM711904 8157281 2.87856\n",
"1 GSM711904 7997332 4.81907\n",
"2 GSM711904 8072798 4.79498\n",
"3 GSM711904 7972808 3.91443\n",
"4 GSM711904 8157283 6.54982"
]
},
"execution_count": 78,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"longTable[0:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "vRR-ybzhcMTy"
},
"outputs": [],
"source": [
"pd."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "hNq5wMdVc80E"
},
"source": [
"## Step 3: Upload to BigQuery Table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "zLk6INaqd43s"
},
"outputs": [],
"source": [
"import pandas_gbq as gbq\n",
"\n",
"gbq.to_gbq(longTable, 'Daves_working_area.my_table', project_id, if_exists='fail')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "dLi1ZUAdfMQ-"
},
"source": [
"And that's it!"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"wYNMaJPlMEBA",
"jPYmhcmQNA5u",
"qRbwf2M8Rqa9"
],
"name": "GEO2BQ.ipynb",
"provenance": [],
"version": "0.3.2"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}