{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 3: Joining dataframes with `cptac`\n", "\n", "In this tutorial, we provide several examples of how to use the built-in `cptac` functions for joining different dataframes.\n", "\n", "We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Data typeAvailable sources
0CNV[bcm, washu]
1circular_RNA[bcm]
2miRNA[bcm, washu]
3proteomics[bcm, umich]
4transcriptomics[bcm, broad, washu]
5ancestry_prediction[harmonized]
6somatic_mutation[harmonized, washu]
7clinical[mssm]
8follow-up[mssm]
9medical_history[mssm]
10acetylproteomics[umich]
11phosphoproteomics[umich]
12cibersort[washu]
13hla_typing[washu]
14tumor_purity[washu]
15xcell[washu]
\n", "
" ], "text/plain": [ " Data type Available sources\n", "0 CNV [bcm, washu]\n", "1 circular_RNA [bcm]\n", "2 miRNA [bcm, washu]\n", "3 proteomics [bcm, umich]\n", "4 transcriptomics [bcm, broad, washu]\n", "5 ancestry_prediction [harmonized]\n", "6 somatic_mutation [harmonized, washu]\n", "7 clinical [mssm]\n", "8 follow-up [mssm]\n", "9 medical_history [mssm]\n", "10 acetylproteomics [umich]\n", "11 phosphoproteomics [umich]\n", "12 cibersort [washu]\n", "13 hla_typing [washu]\n", "14 tumor_purity [washu]\n", "15 xcell [washu]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Start by importing the cptac package\n", "import cptac\n", "\n", "# Create an endometrial data object, named 'en'\n", "en = cptac.Ucec()\n", "\n", "# List the available data sources\n", "en.list_data_sources()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "en.list_data_sources() shows the types of data available in the dataset and their respective sources. For example, you see proteomics data is available from umich, transcriptomics data from bcm, broad, washu and so forth." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameA1BGA1BG-AS1A1CFA2MA2M-AS1A2ML1A2ML1-AS1A2ML1-AS2A2MP1A3GALT2...ZXDBZXDCZYG11AZYG11AP1ZYG11BZYXZYXP1ZZEF1hsa-mir-1253hsa-mir-423
Database_IDENSG00000121410.12ENSG00000268895.6ENSG00000148584.15ENSG00000175899.15ENSG00000245105.4ENSG00000166535.20ENSG00000256661.1ENSG00000256904.1ENSG00000256069.7ENSG00000184389.9...ENSG00000198455.4ENSG00000070476.15ENSG00000203995.10ENSG00000232242.2ENSG00000162378.13ENSG00000159840.16ENSG00000274572.1ENSG00000074755.15ENSG00000272920.1ENSG00000266919.3
Patient_ID
C3L-000062.545.113.6013.756.457.081.800.002.601.16...10.1710.615.540.011.8510.600.011.870.00.0
C3L-000084.404.635.4913.896.616.970.002.743.250.00...9.7910.487.790.012.2811.280.011.930.00.0
C3L-000324.837.263.7314.486.919.560.980.003.260.00...9.439.976.480.011.7210.370.011.700.00.0
C3L-000844.736.015.3715.177.933.860.000.003.731.15...9.2310.377.470.011.8610.130.011.190.00.0
C3L-000904.146.245.6913.876.794.320.000.003.230.00...9.699.647.600.011.9810.310.011.450.00.0
\n", "

5 rows × 59286 columns

\n", "
" ], "text/plain": [ "Name A1BG A1BG-AS1 A1CF \\\n", "Database_ID ENSG00000121410.12 ENSG00000268895.6 ENSG00000148584.15 \n", "Patient_ID \n", "C3L-00006 2.54 5.11 3.60 \n", "C3L-00008 4.40 4.63 5.49 \n", "C3L-00032 4.83 7.26 3.73 \n", "C3L-00084 4.73 6.01 5.37 \n", "C3L-00090 4.14 6.24 5.69 \n", "\n", "Name A2M A2M-AS1 A2ML1 \\\n", "Database_ID ENSG00000175899.15 ENSG00000245105.4 ENSG00000166535.20 \n", "Patient_ID \n", "C3L-00006 13.75 6.45 7.08 \n", "C3L-00008 13.89 6.61 6.97 \n", "C3L-00032 14.48 6.91 9.56 \n", "C3L-00084 15.17 7.93 3.86 \n", "C3L-00090 13.87 6.79 4.32 \n", "\n", "Name A2ML1-AS1 A2ML1-AS2 A2MP1 \\\n", "Database_ID ENSG00000256661.1 ENSG00000256904.1 ENSG00000256069.7 \n", "Patient_ID \n", "C3L-00006 1.80 0.00 2.60 \n", "C3L-00008 0.00 2.74 3.25 \n", "C3L-00032 0.98 0.00 3.26 \n", "C3L-00084 0.00 0.00 3.73 \n", "C3L-00090 0.00 0.00 3.23 \n", "\n", "Name A3GALT2 ... ZXDB ZXDC \\\n", "Database_ID ENSG00000184389.9 ... ENSG00000198455.4 ENSG00000070476.15 \n", "Patient_ID ... \n", "C3L-00006 1.16 ... 10.17 10.61 \n", "C3L-00008 0.00 ... 9.79 10.48 \n", "C3L-00032 0.00 ... 9.43 9.97 \n", "C3L-00084 1.15 ... 9.23 10.37 \n", "C3L-00090 0.00 ... 9.69 9.64 \n", "\n", "Name ZYG11A ZYG11AP1 ZYG11B \\\n", "Database_ID ENSG00000203995.10 ENSG00000232242.2 ENSG00000162378.13 \n", "Patient_ID \n", "C3L-00006 5.54 0.0 11.85 \n", "C3L-00008 7.79 0.0 12.28 \n", "C3L-00032 6.48 0.0 11.72 \n", "C3L-00084 7.47 0.0 11.86 \n", "C3L-00090 7.60 0.0 11.98 \n", "\n", "Name ZYX ZYXP1 ZZEF1 \\\n", "Database_ID ENSG00000159840.16 ENSG00000274572.1 ENSG00000074755.15 \n", "Patient_ID \n", "C3L-00006 10.60 0.0 11.87 \n", "C3L-00008 11.28 0.0 11.93 \n", "C3L-00032 10.37 0.0 11.70 \n", "C3L-00084 10.13 0.0 11.19 \n", "C3L-00090 10.31 0.0 11.45 \n", "\n", "Name hsa-mir-1253 hsa-mir-423 \n", "Database_ID ENSG00000272920.1 ENSG00000266919.3 \n", "Patient_ID \n", "C3L-00006 0.0 0.0 \n", "C3L-00008 0.0 0.0 \n", "C3L-00032 0.0 0.0 \n", "C3L-00084 0.0 0.0 \n", "C3L-00090 0.0 0.0 \n", "\n", "[5 rows x 59286 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Retrieve the transcriptomics data from bcm\n", "bcm_data = en.get_transcriptomics('bcm')\n", "\n", "# Display the first few rows of the dataframe\n", "bcm_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above code, get_transcriptomics('bcm') is used to retrieve the transcriptomics data from bcm. Each row represents a different patient, and each column corresponds to a different gene." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## General format\n", "\n", "cptac has a helpful function called `multi_join`. It allows data from several different cptac dataframes to be joined at the same time.\n", "\n", "To use `multi_join`, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.\n", "\n", "Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.\n", "\n", "If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.\n", "\n", "The join functions use logic analogous to an SQL INNER JOIN." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Join dictionary\n", "\n", "The main parameter for the `multi_join` function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:\n", "\n", "`{('umich', 'proteomics'): ''}`\n", "\n", "or\n", "\n", "`{\"umich proteomics\": ''}`\n", "\n", "as the join dictionary would each result in `multi_join` returning a dataframe containing only awg proteomics data.\n", "\n", "You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join omics to omics\n", "\n", "`multi_join` can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\\Users\\sabme\\anaconda3\\lib\\threading.py, line 910)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameARF5_umich_proteomicsM6PR_umich_proteomicsESRRA_umich_proteomicsFKBP4_umich_proteomicsNDUFAF7_umich_proteomicsFUCA2_umich_proteomicsDBNDD1_umich_proteomicsSEMA3F_umich_proteomicsCFTR_umich_proteomicsCYP51A1_umich_proteomics...ZXDB_bcm_transcriptomicsZXDC_bcm_transcriptomicsZYG11A_bcm_transcriptomicsZYG11AP1_bcm_transcriptomicsZYG11B_bcm_transcriptomicsZYX_bcm_transcriptomicsZYXP1_bcm_transcriptomicsZZEF1_bcm_transcriptomicshsa-mir-1253_bcm_transcriptomicshsa-mir-423_bcm_transcriptomics
Database_IDENSP00000000233.5ENSP00000000412.3ENSP00000000442.6ENSP00000001008.4ENSP00000002125.4ENSP00000002165.5ENSP00000002501.6ENSP00000002829.3ENSP00000003084.6ENSP00000003100.8...ENSG00000198455.4ENSG00000070476.15ENSG00000203995.10ENSG00000232242.2ENSG00000162378.13ENSG00000159840.16ENSG00000274572.1ENSG00000074755.15ENSG00000272920.1ENSG00000266919.3
Patient_ID
C3L-00006-0.0565130.0165570.0025690.3898190.603610-0.332543-0.790426NaN0.8227320.039134...10.1710.615.540.011.8510.600.011.870.00.0
C3L-000080.549959-0.2061290.905784-0.3036310.0187670.5035130.9509550.080142NaN-0.063213...9.7910.487.790.012.2811.280.011.930.00.0
C3L-000320.088681-0.154447-0.1905150.1707530.1963560.544194-0.179078NaNNaN0.377405...9.439.976.480.011.7210.370.011.700.00.0
C3L-00084-0.8465550.027740NaN0.1787000.264054-0.1835480.077215-0.2471640.152277-0.279549...9.2310.377.470.011.8610.130.011.190.00.0
C3L-000900.5390190.956619-0.0395160.3236560.0646050.173433-0.524325-0.038590-0.3114860.309905...9.699.647.600.011.9810.310.011.450.00.0
\n", "

5 rows × 71948 columns

\n", "
" ], "text/plain": [ "Name ARF5_umich_proteomics M6PR_umich_proteomics \\\n", "Database_ID ENSP00000000233.5 ENSP00000000412.3 \n", "Patient_ID \n", "C3L-00006 -0.056513 0.016557 \n", "C3L-00008 0.549959 -0.206129 \n", "C3L-00032 0.088681 -0.154447 \n", "C3L-00084 -0.846555 0.027740 \n", "C3L-00090 0.539019 0.956619 \n", "\n", "Name ESRRA_umich_proteomics FKBP4_umich_proteomics \\\n", "Database_ID ENSP00000000442.6 ENSP00000001008.4 \n", "Patient_ID \n", "C3L-00006 0.002569 0.389819 \n", "C3L-00008 0.905784 -0.303631 \n", "C3L-00032 -0.190515 0.170753 \n", "C3L-00084 NaN 0.178700 \n", "C3L-00090 -0.039516 0.323656 \n", "\n", "Name NDUFAF7_umich_proteomics FUCA2_umich_proteomics \\\n", "Database_ID ENSP00000002125.4 ENSP00000002165.5 \n", "Patient_ID \n", "C3L-00006 0.603610 -0.332543 \n", "C3L-00008 0.018767 0.503513 \n", "C3L-00032 0.196356 0.544194 \n", "C3L-00084 0.264054 -0.183548 \n", "C3L-00090 0.064605 0.173433 \n", "\n", "Name DBNDD1_umich_proteomics SEMA3F_umich_proteomics \\\n", "Database_ID ENSP00000002501.6 ENSP00000002829.3 \n", "Patient_ID \n", "C3L-00006 -0.790426 NaN \n", "C3L-00008 0.950955 0.080142 \n", "C3L-00032 -0.179078 NaN \n", "C3L-00084 0.077215 -0.247164 \n", "C3L-00090 -0.524325 -0.038590 \n", "\n", "Name CFTR_umich_proteomics CYP51A1_umich_proteomics ... \\\n", "Database_ID ENSP00000003084.6 ENSP00000003100.8 ... \n", "Patient_ID ... \n", "C3L-00006 0.822732 0.039134 ... \n", "C3L-00008 NaN -0.063213 ... \n", "C3L-00032 NaN 0.377405 ... \n", "C3L-00084 0.152277 -0.279549 ... \n", "C3L-00090 -0.311486 0.309905 ... \n", "\n", "Name ZXDB_bcm_transcriptomics ZXDC_bcm_transcriptomics \\\n", "Database_ID ENSG00000198455.4 ENSG00000070476.15 \n", "Patient_ID \n", "C3L-00006 10.17 10.61 \n", "C3L-00008 9.79 10.48 \n", "C3L-00032 9.43 9.97 \n", "C3L-00084 9.23 10.37 \n", "C3L-00090 9.69 9.64 \n", "\n", "Name ZYG11A_bcm_transcriptomics ZYG11AP1_bcm_transcriptomics \\\n", "Database_ID ENSG00000203995.10 ENSG00000232242.2 \n", "Patient_ID \n", "C3L-00006 5.54 0.0 \n", "C3L-00008 7.79 0.0 \n", "C3L-00032 6.48 0.0 \n", "C3L-00084 7.47 0.0 \n", "C3L-00090 7.60 0.0 \n", "\n", "Name ZYG11B_bcm_transcriptomics ZYX_bcm_transcriptomics \\\n", "Database_ID ENSG00000162378.13 ENSG00000159840.16 \n", "Patient_ID \n", "C3L-00006 11.85 10.60 \n", "C3L-00008 12.28 11.28 \n", "C3L-00032 11.72 10.37 \n", "C3L-00084 11.86 10.13 \n", "C3L-00090 11.98 10.31 \n", "\n", "Name ZYXP1_bcm_transcriptomics ZZEF1_bcm_transcriptomics \\\n", "Database_ID ENSG00000274572.1 ENSG00000074755.15 \n", "Patient_ID \n", "C3L-00006 0.0 11.87 \n", "C3L-00008 0.0 11.93 \n", "C3L-00032 0.0 11.70 \n", "C3L-00084 0.0 11.19 \n", "C3L-00090 0.0 11.45 \n", "\n", "Name hsa-mir-1253_bcm_transcriptomics hsa-mir-423_bcm_transcriptomics \n", "Database_ID ENSG00000272920.1 ENSG00000266919.3 \n", "Patient_ID \n", "C3L-00006 0.0 0.0 \n", "C3L-00008 0.0 0.0 \n", "C3L-00032 0.0 0.0 \n", "C3L-00084 0.0 0.0 \n", "C3L-00090 0.0 0.0 \n", "\n", "[5 rows x 71948 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Joining two -omics dataframes together using multi_join\n", "prot_and_tran = en.multi_join({\"umich proteomics\":'', \"bcm transcriptomics\":''})\n", "prot_and_tran.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, multi_join is used to join proteomics data from umich and transcriptomics data from bcm into one combined dataframe." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameARF5_umich_proteomicsA1BG_bcm_transcriptomics
Database_IDENSP00000000233.5ENSG00000121410.12
Patient_ID
C3L-00006-0.0565132.54
C3L-000080.5499594.40
C3L-000320.0886814.83
C3L-00084-0.8465554.73
C3L-000900.5390194.14
\n", "
" ], "text/plain": [ "Name ARF5_umich_proteomics A1BG_bcm_transcriptomics\n", "Database_ID ENSP00000000233.5 ENSG00000121410.12\n", "Patient_ID \n", "C3L-00006 -0.056513 2.54\n", "C3L-00008 0.549959 4.40\n", "C3L-00032 0.088681 4.83\n", "C3L-00084 -0.846555 4.73\n", "C3L-00090 0.539019 4.14" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using multi_join with specified columns\n", "prot_and_tran_selected = en.multi_join({\"umich proteomics\":'ARF5', \"bcm transcriptomics\":'A1BG'})\n", "prot_and_tran_selected.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, multi_join is used again, but this time only the 'ARF5' column from the proteomics data and the 'A1BG' column from the transcriptomics data are included in the resulting dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join metadata to omics\n", "\n", "The `multi_join` function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nametumor_codediscovery_studytype_of_analyzed_samples_mssm_clinicalconfirmatory_studytype_of_analyzed_samples_mssm_clinicalagesexraceethnicityethnicity_race_ancestry_identified...ZXDB_bcm_transcriptomicsZXDC_bcm_transcriptomicsZYG11A_bcm_transcriptomicsZYG11AP1_bcm_transcriptomicsZYG11B_bcm_transcriptomicsZYX_bcm_transcriptomicsZYXP1_bcm_transcriptomicsZZEF1_bcm_transcriptomicshsa-mir-1253_bcm_transcriptomicshsa-mir-423_bcm_transcriptomics
Database_ID...ENSG00000198455.4ENSG00000070476.15ENSG00000203995.10ENSG00000232242.2ENSG00000162378.13ENSG00000159840.16ENSG00000274572.1ENSG00000074755.15ENSG00000272920.1ENSG00000266919.3
Patient_ID
C3L-00006UCECYesTumor_and_NormalNaNNaN64FemaleWhiteNot Hispanic or LatinoWhite...10.1710.615.540.011.8510.600.011.870.00.0
C3L-00008UCECYesTumorNaNNaN58FemaleWhiteNot Hispanic or LatinoWhite...9.7910.487.790.012.2811.280.011.930.00.0
C3L-00032UCECYesTumorNaNNaN50FemaleWhiteNot Hispanic or LatinoWhite...9.439.976.480.011.7210.370.011.700.00.0
C3L-00084UCECYesTumorNaNNaN74FemaleWhiteNot Hispanic or LatinoWhite...9.2310.377.470.011.8610.130.011.190.00.0
C3L-00090UCECYesTumorNaNNaN75FemaleWhiteNot Hispanic or LatinoWhite...9.699.647.600.011.9810.310.011.450.00.0
\n", "

5 rows × 59410 columns

\n", "
" ], "text/plain": [ "Name tumor_code discovery_study type_of_analyzed_samples_mssm_clinical \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 UCEC Yes Tumor_and_Normal \n", "C3L-00008 UCEC Yes Tumor \n", "C3L-00032 UCEC Yes Tumor \n", "C3L-00084 UCEC Yes Tumor \n", "C3L-00090 UCEC Yes Tumor \n", "\n", "Name confirmatory_study type_of_analyzed_samples_mssm_clinical age \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN NaN 64 \n", "C3L-00008 NaN NaN 58 \n", "C3L-00032 NaN NaN 50 \n", "C3L-00084 NaN NaN 74 \n", "C3L-00090 NaN NaN 75 \n", "\n", "Name sex race ethnicity \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 Female White Not Hispanic or Latino \n", "C3L-00008 Female White Not Hispanic or Latino \n", "C3L-00032 Female White Not Hispanic or Latino \n", "C3L-00084 Female White Not Hispanic or Latino \n", "C3L-00090 Female White Not Hispanic or Latino \n", "\n", "Name ethnicity_race_ancestry_identified ... ZXDB_bcm_transcriptomics \\\n", "Database_ID ... ENSG00000198455.4 \n", "Patient_ID ... \n", "C3L-00006 White ... 10.17 \n", "C3L-00008 White ... 9.79 \n", "C3L-00032 White ... 9.43 \n", "C3L-00084 White ... 9.23 \n", "C3L-00090 White ... 9.69 \n", "\n", "Name ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics \\\n", "Database_ID ENSG00000070476.15 ENSG00000203995.10 \n", "Patient_ID \n", "C3L-00006 10.61 5.54 \n", "C3L-00008 10.48 7.79 \n", "C3L-00032 9.97 6.48 \n", "C3L-00084 10.37 7.47 \n", "C3L-00090 9.64 7.60 \n", "\n", "Name ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics \\\n", "Database_ID ENSG00000232242.2 ENSG00000162378.13 \n", "Patient_ID \n", "C3L-00006 0.0 11.85 \n", "C3L-00008 0.0 12.28 \n", "C3L-00032 0.0 11.72 \n", "C3L-00084 0.0 11.86 \n", "C3L-00090 0.0 11.98 \n", "\n", "Name ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics \\\n", "Database_ID ENSG00000159840.16 ENSG00000274572.1 \n", "Patient_ID \n", "C3L-00006 10.60 0.0 \n", "C3L-00008 11.28 0.0 \n", "C3L-00032 10.37 0.0 \n", "C3L-00084 10.13 0.0 \n", "C3L-00090 10.31 0.0 \n", "\n", "Name ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics \\\n", "Database_ID ENSG00000074755.15 ENSG00000272920.1 \n", "Patient_ID \n", "C3L-00006 11.87 0.0 \n", "C3L-00008 11.93 0.0 \n", "C3L-00032 11.70 0.0 \n", "C3L-00084 11.19 0.0 \n", "C3L-00090 11.45 0.0 \n", "\n", "Name hsa-mir-423_bcm_transcriptomics \n", "Database_ID ENSG00000266919.3 \n", "Patient_ID \n", "C3L-00006 0.0 \n", "C3L-00008 0.0 \n", "C3L-00032 0.0 \n", "C3L-00084 0.0 \n", "C3L-00090 0.0 \n", "\n", "[5 rows x 59410 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join a metadata dataframe with an -omics dataframe\n", "clin_and_tran = en.multi_join({\"mssm clinical\":'', \"bcm transcriptomics\":''})\n", "clin_and_tran.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Joining only specific columns:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameageOverall survival, daysZYX_bcm_transcriptomicsZZEF1_bcm_transcriptomics
Database_IDENSG00000159840.16ENSG00000074755.15
Patient_ID
C3L-0000664737.010.6011.87
C3L-0000858898.011.2811.93
C3L-00032501710.010.3711.70
C3L-0008474335.010.1311.19
C3L-00090751281.010.3111.45
\n", "
" ], "text/plain": [ "Name age Overall survival, days ZYX_bcm_transcriptomics \\\n", "Database_ID ENSG00000159840.16 \n", "Patient_ID \n", "C3L-00006 64 737.0 10.60 \n", "C3L-00008 58 898.0 11.28 \n", "C3L-00032 50 1710.0 10.37 \n", "C3L-00084 74 335.0 10.13 \n", "C3L-00090 75 1281.0 10.31 \n", "\n", "Name ZZEF1_bcm_transcriptomics \n", "Database_ID ENSG00000074755.15 \n", "Patient_ID \n", "C3L-00006 11.87 \n", "C3L-00008 11.93 \n", "C3L-00032 11.70 \n", "C3L-00084 11.19 \n", "C3L-00090 11.45 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clin_and_tran = en.multi_join({\"mssm clinical\": [\"age\", \"Overall survival, days\"], \"bcm transcriptomics\": [\"ZYX\", 'ZZEF1']})\n", "clin_and_tran.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join metadata to metadata\n", "\n", "Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string `''` or an empty list `[]` for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nametumor_codediscovery_studytype_of_analyzed_samples_mssm_clinicalconfirmatory_studytype_of_analyzed_samples_mssm_clinicalagesexraceethnicityethnicity_race_ancestry_identified...ZXDB_bcm_transcriptomicsZXDC_bcm_transcriptomicsZYG11A_bcm_transcriptomicsZYG11AP1_bcm_transcriptomicsZYG11B_bcm_transcriptomicsZYX_bcm_transcriptomicsZYXP1_bcm_transcriptomicsZZEF1_bcm_transcriptomicshsa-mir-1253_bcm_transcriptomicshsa-mir-423_bcm_transcriptomics
Database_ID...ENSG00000198455.4ENSG00000070476.15ENSG00000203995.10ENSG00000232242.2ENSG00000162378.13ENSG00000159840.16ENSG00000274572.1ENSG00000074755.15ENSG00000272920.1ENSG00000266919.3
Patient_ID
C3L-00006UCECYesTumor_and_NormalNaNNaN64FemaleWhiteNot Hispanic or LatinoWhite...10.1710.615.540.011.8510.600.011.870.00.0
C3L-00008UCECYesTumorNaNNaN58FemaleWhiteNot Hispanic or LatinoWhite...9.7910.487.790.012.2811.280.011.930.00.0
C3L-00032UCECYesTumorNaNNaN50FemaleWhiteNot Hispanic or LatinoWhite...9.439.976.480.011.7210.370.011.700.00.0
C3L-00084UCECYesTumorNaNNaN74FemaleWhiteNot Hispanic or LatinoWhite...9.2310.377.470.011.8610.130.011.190.00.0
C3L-00090UCECYesTumorNaNNaN75FemaleWhiteNot Hispanic or LatinoWhite...9.699.647.600.011.9810.310.011.450.00.0
\n", "

5 rows × 59410 columns

\n", "
" ], "text/plain": [ "Name tumor_code discovery_study type_of_analyzed_samples_mssm_clinical \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 UCEC Yes Tumor_and_Normal \n", "C3L-00008 UCEC Yes Tumor \n", "C3L-00032 UCEC Yes Tumor \n", "C3L-00084 UCEC Yes Tumor \n", "C3L-00090 UCEC Yes Tumor \n", "\n", "Name confirmatory_study type_of_analyzed_samples_mssm_clinical age \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN NaN 64 \n", "C3L-00008 NaN NaN 58 \n", "C3L-00032 NaN NaN 50 \n", "C3L-00084 NaN NaN 74 \n", "C3L-00090 NaN NaN 75 \n", "\n", "Name sex race ethnicity \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 Female White Not Hispanic or Latino \n", "C3L-00008 Female White Not Hispanic or Latino \n", "C3L-00032 Female White Not Hispanic or Latino \n", "C3L-00084 Female White Not Hispanic or Latino \n", "C3L-00090 Female White Not Hispanic or Latino \n", "\n", "Name ethnicity_race_ancestry_identified ... ZXDB_bcm_transcriptomics \\\n", "Database_ID ... ENSG00000198455.4 \n", "Patient_ID ... \n", "C3L-00006 White ... 10.17 \n", "C3L-00008 White ... 9.79 \n", "C3L-00032 White ... 9.43 \n", "C3L-00084 White ... 9.23 \n", "C3L-00090 White ... 9.69 \n", "\n", "Name ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics \\\n", "Database_ID ENSG00000070476.15 ENSG00000203995.10 \n", "Patient_ID \n", "C3L-00006 10.61 5.54 \n", "C3L-00008 10.48 7.79 \n", "C3L-00032 9.97 6.48 \n", "C3L-00084 10.37 7.47 \n", "C3L-00090 9.64 7.60 \n", "\n", "Name ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics \\\n", "Database_ID ENSG00000232242.2 ENSG00000162378.13 \n", "Patient_ID \n", "C3L-00006 0.0 11.85 \n", "C3L-00008 0.0 12.28 \n", "C3L-00032 0.0 11.72 \n", "C3L-00084 0.0 11.86 \n", "C3L-00090 0.0 11.98 \n", "\n", "Name ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics \\\n", "Database_ID ENSG00000159840.16 ENSG00000274572.1 \n", "Patient_ID \n", "C3L-00006 10.60 0.0 \n", "C3L-00008 11.28 0.0 \n", "C3L-00032 10.37 0.0 \n", "C3L-00084 10.13 0.0 \n", "C3L-00090 10.31 0.0 \n", "\n", "Name ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics \\\n", "Database_ID ENSG00000074755.15 ENSG00000272920.1 \n", "Patient_ID \n", "C3L-00006 11.87 0.0 \n", "C3L-00008 11.93 0.0 \n", "C3L-00032 11.70 0.0 \n", "C3L-00084 11.19 0.0 \n", "C3L-00090 11.45 0.0 \n", "\n", "Name hsa-mir-423_bcm_transcriptomics \n", "Database_ID ENSG00000266919.3 \n", "Patient_ID \n", "C3L-00006 0.0 \n", "C3L-00008 0.0 \n", "C3L-00032 0.0 \n", "C3L-00084 0.0 \n", "C3L-00090 0.0 \n", "\n", "[5 rows x 59410 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clin_and_tran = en.multi_join({\n", " \"mssm clinical\": \"\",\n", " \"bcm transcriptomics\": '' # Note that by using an empty string or list as the value, we join the entire dataframe\n", "})\n", "\n", "clin_and_tran.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join many datatypes together\n", "\n", "If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for `multi_join` can take is your imagination." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameARF5_umich_proteomicsA1BG_bcm_transcriptomicstumor_codediscovery_studytype_of_analyzed_samples_mssm_clinicalconfirmatory_studytype_of_analyzed_samples_mssm_clinicalagesexrace...additional_treatment_immuno_for_new_tumornumber_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regionalnumber_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasisRecurrence-free survival, daysRecurrence-free survival from collection, daysRecurrence status (1, yes; 0, no)Overall survival, daysOverall survival from collection, daysSurvival status (1, dead; 0, alive)Sample_Status
Database_IDENSP00000000233.5ENSG00000121410.12...
Patient_ID
C3L-00006-0.0565132.54UCECYesTumor_and_NormalNaNNaN64FemaleWhite...NaNNaNNaNNaNNaN0.0737.0737.00.0Tumor
C3L-000080.5499594.40UCECYesTumorNaNNaN58FemaleWhite...NaNNaNNaNNaNNaN0.0898.0898.00.0Tumor
C3L-000320.0886814.83UCECYesTumorNaNNaN50FemaleWhite...NaNNaNNaNNaNNaN0.01710.01710.00.0Tumor
C3L-00084-0.8465554.73UCECYesTumorNaNNaN74FemaleWhite...NaNNaNNaNNaNNaN0.0335.0335.00.0Tumor
C3L-000900.5390194.14UCECYesTumorNaNNaN75FemaleWhite...NoNaNNaN50.056.01.01281.01287.01.0Tumor
\n", "

5 rows × 127 columns

\n", "
" ], "text/plain": [ "Name ARF5_umich_proteomics A1BG_bcm_transcriptomics tumor_code \\\n", "Database_ID ENSP00000000233.5 ENSG00000121410.12 \n", "Patient_ID \n", "C3L-00006 -0.056513 2.54 UCEC \n", "C3L-00008 0.549959 4.40 UCEC \n", "C3L-00032 0.088681 4.83 UCEC \n", "C3L-00084 -0.846555 4.73 UCEC \n", "C3L-00090 0.539019 4.14 UCEC \n", "\n", "Name discovery_study type_of_analyzed_samples_mssm_clinical \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 Yes Tumor_and_Normal \n", "C3L-00008 Yes Tumor \n", "C3L-00032 Yes Tumor \n", "C3L-00084 Yes Tumor \n", "C3L-00090 Yes Tumor \n", "\n", "Name confirmatory_study type_of_analyzed_samples_mssm_clinical age \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN NaN 64 \n", "C3L-00008 NaN NaN 58 \n", "C3L-00032 NaN NaN 50 \n", "C3L-00084 NaN NaN 74 \n", "C3L-00090 NaN NaN 75 \n", "\n", "Name sex race ... additional_treatment_immuno_for_new_tumor \\\n", "Database_ID ... \n", "Patient_ID ... \n", "C3L-00006 Female White ... NaN \n", "C3L-00008 Female White ... NaN \n", "C3L-00032 Female White ... NaN \n", "C3L-00084 Female White ... NaN \n", "C3L-00090 Female White ... No \n", "\n", "Name number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 NaN \n", "\n", "Name number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 NaN \n", "\n", "Name Recurrence-free survival, days \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 50.0 \n", "\n", "Name Recurrence-free survival from collection, days \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 56.0 \n", "\n", "Name Recurrence status (1, yes; 0, no) Overall survival, days \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 0.0 737.0 \n", "C3L-00008 0.0 898.0 \n", "C3L-00032 0.0 1710.0 \n", "C3L-00084 0.0 335.0 \n", "C3L-00090 1.0 1281.0 \n", "\n", "Name Overall survival from collection, days \\\n", "Database_ID \n", "Patient_ID \n", "C3L-00006 737.0 \n", "C3L-00008 898.0 \n", "C3L-00032 1710.0 \n", "C3L-00084 335.0 \n", "C3L-00090 1287.0 \n", "\n", "Name Survival status (1, dead; 0, alive) Sample_Status \n", "Database_ID \n", "Patient_ID \n", "C3L-00006 0.0 Tumor \n", "C3L-00008 0.0 Tumor \n", "C3L-00032 0.0 Tumor \n", "C3L-00084 0.0 Tumor \n", "C3L-00090 1.0 Tumor \n", "\n", "[5 rows x 127 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joining_dictionary = {\"umich proteomics\": \"ARF5\", \"bcm transcriptomics\": \"A1BG\", \"mssm clinical\": [], \"washu somatic_mutation\": []}\n", "en.multi_join(joining_dictionary).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`multi_join` does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nametype_of_analyzed_samples_mssm_clinicaltype_of_analyzed_samples_mssm_clinicaldiscovery_study
Patient_ID
C3L-00006Tumor_and_NormalNaNYes
C3L-00008TumorNaNYes
C3L-00032TumorNaNYes
C3L-00084TumorNaNYes
C3L-00090TumorNaNYes
\n", "
" ], "text/plain": [ "Name type_of_analyzed_samples_mssm_clinical \\\n", "Patient_ID \n", "C3L-00006 Tumor_and_Normal \n", "C3L-00008 Tumor \n", "C3L-00032 Tumor \n", "C3L-00084 Tumor \n", "C3L-00090 Tumor \n", "\n", "Name type_of_analyzed_samples_mssm_clinical discovery_study \n", "Patient_ID \n", "C3L-00006 NaN Yes \n", "C3L-00008 NaN Yes \n", "C3L-00032 NaN Yes \n", "C3L-00084 NaN Yes \n", "C3L-00090 NaN Yes " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_type_and_discovery = en.multi_join({\"mssm clinical\": ['type_of_analyzed_samples', 'discovery_study']})\n", "sample_type_and_discovery.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join omics to mutations\n", "\n", "Joining an -omics dataframe with the mutation data for a specified gene or genes involves specific steps. It's worth noting that because there might be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation.\n", "\n", "For samples with no mutation for a particular gene, the list will contain either \"Wildtype_Tumor\" or \"Wildtype_Normal\", depending on whether the sample is a tumor or normal one. The mutation status column will contain either \"Single_mutation\", \"Multiple_mutation\", \"Wildtype_Tumor\", or \"Wildtype_Normal\", which aids with parsing.\n", "\n", "Let's consider an example:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 325)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameARF5_umich_proteomicsM6PR_umich_proteomicsSHANK2_MutationSHANK2_LocationSHANK2_Mutation_StatusSample_Status
Patient_ID
C3L-00006-0.0565130.016557[Missense_Mutation][p.S1692R]Single_mutationTumor
C3L-000080.549959-0.206129[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-000320.088681-0.154447[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00084-0.8465550.027740[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-000900.5390190.956619[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00098-0.0173700.125574[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-001360.2303470.575436[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-001370.1919150.113577[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00139-0.4101420.381355[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00143-0.1705141.008577[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
\n", "
" ], "text/plain": [ "Name ARF5_umich_proteomics M6PR_umich_proteomics SHANK2_Mutation \\\n", "Patient_ID \n", "C3L-00006 -0.056513 0.016557 [Missense_Mutation] \n", "C3L-00008 0.549959 -0.206129 [Wildtype_Tumor] \n", "C3L-00032 0.088681 -0.154447 [Wildtype_Tumor] \n", "C3L-00084 -0.846555 0.027740 [Wildtype_Tumor] \n", "C3L-00090 0.539019 0.956619 [Wildtype_Tumor] \n", "C3L-00098 -0.017370 0.125574 [Wildtype_Tumor] \n", "C3L-00136 0.230347 0.575436 [Wildtype_Tumor] \n", "C3L-00137 0.191915 0.113577 [Wildtype_Tumor] \n", "C3L-00139 -0.410142 0.381355 [Wildtype_Tumor] \n", "C3L-00143 -0.170514 1.008577 [Wildtype_Tumor] \n", "\n", "Name SHANK2_Location SHANK2_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 [p.S1692R] Single_mutation Tumor \n", "C3L-00008 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00032 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00084 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00090 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00098 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00136 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00137 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00139 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00143 [No_mutation] Wildtype_Tumor Tumor " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "somatic_mutations = en.get_somatic_mutation('harmonized')\n", "selected_prot_and_som_mut = en.join_omics_to_mutations(\n", " omics_name = \"proteomics\",\n", " mutations_genes = \"SHANK2\",\n", " omics_genes = [\"ARF5\", \"M6PR\"],\n", " omics_source = 'umich',\n", " mutations_source = 'harmonized')\n", "selected_prot_and_som_mut.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code above, we're joining proteomics data and somatic mutation data. The gene for the mutation data is \"SHANK2\" and the genes for the proteomics data are \"ARF5\" and \"M6PR\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering multiple mutations\n", "\n", "If there are multiple mutations, you can use the multi_join function to filter them. The function allows you to specify certain mutation types or locations to prioritize, and it provides a default sorting hierarchy for all other mutations.\n", "\n", "Here are some examples:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n", "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3972322211.py, line 1)\n", "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n", "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3972322211.py, line 5)\n", "cptac warning: Filter value p.R130Q does not exist in the mutations data for the SHANK2 gene, though it exists for other genes. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n", "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n", "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3972322211.py, line 9)\n" ] } ], "source": [ "SHANK2_default_filter = en.multi_join({\"umich proteomics\": [\"ARF5\", \"M6PR\"],\n", " \"harmonized somatic_mutation\": \"SHANK2\"},\n", " mutations_filter=[])\n", "\n", "SHANK2_simple_filter = en.multi_join({\"umich proteomics\": [\"ARF5\", \"M6PR\"],\n", " \"harmonized somatic_mutation\": \"SHANK2\"},\n", " mutations_filter=[\"Missense_Mutation\"])\n", "\n", "PTEN_complex_filter = en.multi_join({\"umich proteomics\": [\"ARF5\", \"M6PR\"],\n", " \"harmonized somatic_mutation\": \"SHANK2\"}, \n", " mutations_filter=[\"p.R130Q\", \"Nonsense_Mutation\"])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mutations_filter parameter allows you to specify the mutations you're interested in. If you don't provide any specific mutations (i.e., you pass an empty list), it will use a default hierarchy, choosing truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join metadata to mutations\n", "\n", "Joining metadata to mutation data follows the same process as joining other datatypes. You can also use the mutations_filter parameter to filter multiple mutations.\n", "\n", "For instance, you can use the get_clinical function to retrieve clinical data, as shown below:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nametumor_codediscovery_studytype_of_analyzed_samplesconfirmatory_studytype_of_analyzed_samplesagesexraceethnicityethnicity_race_ancestry_identified...additional_treatment_pharmaceutical_therapy_for_new_tumoradditional_treatment_immuno_for_new_tumornumber_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regionalnumber_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasisRecurrence-free survival, daysRecurrence-free survival from collection, daysRecurrence status (1, yes; 0, no)Overall survival, daysOverall survival from collection, daysSurvival status (1, dead; 0, alive)
Patient_ID
C3L-00006UCECYesTumor_and_NormalNaNNaN64FemaleWhiteNot Hispanic or LatinoWhite...NaNNaNNaNNaNNaNNaN0737.0737.00.0
C3L-00008UCECYesTumorNaNNaN58FemaleWhiteNot Hispanic or LatinoWhite...NaNNaNNaNNaNNaNNaN0898.0898.00.0
C3L-00032UCECYesTumorNaNNaN50FemaleWhiteNot Hispanic or LatinoWhite...NaNNaNNaNNaNNaNNaN01710.01710.00.0
C3L-00084UCECYesTumorNaNNaN74FemaleWhiteNot Hispanic or LatinoWhite...NaNNaNNaNNaNNaNNaN0335.0335.00.0
C3L-00090UCECYesTumorNaNNaN75FemaleWhiteNot Hispanic or LatinoWhite...YesNoNaNNaN50.056.011281.01287.01.0
..................................................................
C3N-01520UCECYesTumorNaNNaN69FemaleUnknownUnknownSlavonic...NaNNaNNaNNaNNaNNaN0287.0278.01.0
C3N-01521UCECYesTumorNaNNaN75FemaleUnknownUnknownSlavonic...NaNNaNNaNNaNNaNNaN0728.0681.00.0
C3N-01537UCECYesTumorNaNNaN74FemaleUnknownUnknownSlavonic...YesNo62.0NaN58.031.01698.0671.00.0
C3N-01802UCECYesTumorNaNNaN85FemaleBlack or African AmericanNot Hispanic or LatinoAmerican...NoNoNaNNaN598.0563.01775.0740.00.0
C3N-01825UCECYesTumorNaNNaN70FemaleUnknownUnknownSlavonic...NaNNaNNaNNaNNaNNaN0687.0661.00.0
\n", "

103 rows × 124 columns

\n", "
" ], "text/plain": [ "Name tumor_code discovery_study type_of_analyzed_samples \\\n", "Patient_ID \n", "C3L-00006 UCEC Yes Tumor_and_Normal \n", "C3L-00008 UCEC Yes Tumor \n", "C3L-00032 UCEC Yes Tumor \n", "C3L-00084 UCEC Yes Tumor \n", "C3L-00090 UCEC Yes Tumor \n", "... ... ... ... \n", "C3N-01520 UCEC Yes Tumor \n", "C3N-01521 UCEC Yes Tumor \n", "C3N-01537 UCEC Yes Tumor \n", "C3N-01802 UCEC Yes Tumor \n", "C3N-01825 UCEC Yes Tumor \n", "\n", "Name confirmatory_study type_of_analyzed_samples age sex \\\n", "Patient_ID \n", "C3L-00006 NaN NaN 64 Female \n", "C3L-00008 NaN NaN 58 Female \n", "C3L-00032 NaN NaN 50 Female \n", "C3L-00084 NaN NaN 74 Female \n", "C3L-00090 NaN NaN 75 Female \n", "... ... ... .. ... \n", "C3N-01520 NaN NaN 69 Female \n", "C3N-01521 NaN NaN 75 Female \n", "C3N-01537 NaN NaN 74 Female \n", "C3N-01802 NaN NaN 85 Female \n", "C3N-01825 NaN NaN 70 Female \n", "\n", "Name race ethnicity \\\n", "Patient_ID \n", "C3L-00006 White Not Hispanic or Latino \n", "C3L-00008 White Not Hispanic or Latino \n", "C3L-00032 White Not Hispanic or Latino \n", "C3L-00084 White Not Hispanic or Latino \n", "C3L-00090 White Not Hispanic or Latino \n", "... ... ... \n", "C3N-01520 Unknown Unknown \n", "C3N-01521 Unknown Unknown \n", "C3N-01537 Unknown Unknown \n", "C3N-01802 Black or African American Not Hispanic or Latino \n", "C3N-01825 Unknown Unknown \n", "\n", "Name ethnicity_race_ancestry_identified ... \\\n", "Patient_ID ... \n", "C3L-00006 White ... \n", "C3L-00008 White ... \n", "C3L-00032 White ... \n", "C3L-00084 White ... \n", "C3L-00090 White ... \n", "... ... ... \n", "C3N-01520 Slavonic ... \n", "C3N-01521 Slavonic ... \n", "C3N-01537 Slavonic ... \n", "C3N-01802 American ... \n", "C3N-01825 Slavonic ... \n", "\n", "Name additional_treatment_pharmaceutical_therapy_for_new_tumor \\\n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 Yes \n", "... ... \n", "C3N-01520 NaN \n", "C3N-01521 NaN \n", "C3N-01537 Yes \n", "C3N-01802 No \n", "C3N-01825 NaN \n", "\n", "Name additional_treatment_immuno_for_new_tumor \\\n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 No \n", "... ... \n", "C3N-01520 NaN \n", "C3N-01521 NaN \n", "C3N-01537 No \n", "C3N-01802 No \n", "C3N-01825 NaN \n", "\n", "Name number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional \\\n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 NaN \n", "... ... \n", "C3N-01520 NaN \n", "C3N-01521 NaN \n", "C3N-01537 62.0 \n", "C3N-01802 NaN \n", "C3N-01825 NaN \n", "\n", "Name number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis \\\n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 NaN \n", "... ... \n", "C3N-01520 NaN \n", "C3N-01521 NaN \n", "C3N-01537 NaN \n", "C3N-01802 NaN \n", "C3N-01825 NaN \n", "\n", "Name Recurrence-free survival, days \\\n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 50.0 \n", "... ... \n", "C3N-01520 NaN \n", "C3N-01521 NaN \n", "C3N-01537 58.0 \n", "C3N-01802 598.0 \n", "C3N-01825 NaN \n", "\n", "Name Recurrence-free survival from collection, days \\\n", "Patient_ID \n", "C3L-00006 NaN \n", "C3L-00008 NaN \n", "C3L-00032 NaN \n", "C3L-00084 NaN \n", "C3L-00090 56.0 \n", "... ... \n", "C3N-01520 NaN \n", "C3N-01521 NaN \n", "C3N-01537 31.0 \n", "C3N-01802 563.0 \n", "C3N-01825 NaN \n", "\n", "Name Recurrence status (1, yes; 0, no) Overall survival, days \\\n", "Patient_ID \n", "C3L-00006 0 737.0 \n", "C3L-00008 0 898.0 \n", "C3L-00032 0 1710.0 \n", "C3L-00084 0 335.0 \n", "C3L-00090 1 1281.0 \n", "... ... ... \n", "C3N-01520 0 287.0 \n", "C3N-01521 0 728.0 \n", "C3N-01537 1 698.0 \n", "C3N-01802 1 775.0 \n", "C3N-01825 0 687.0 \n", "\n", "Name Overall survival from collection, days \\\n", "Patient_ID \n", "C3L-00006 737.0 \n", "C3L-00008 898.0 \n", "C3L-00032 1710.0 \n", "C3L-00084 335.0 \n", "C3L-00090 1287.0 \n", "... ... \n", "C3N-01520 278.0 \n", "C3N-01521 681.0 \n", "C3N-01537 671.0 \n", "C3N-01802 740.0 \n", "C3N-01825 661.0 \n", "\n", "Name Survival status (1, dead; 0, alive) \n", "Patient_ID \n", "C3L-00006 0.0 \n", "C3L-00008 0.0 \n", "C3L-00032 0.0 \n", "C3L-00084 0.0 \n", "C3L-00090 1.0 \n", "... ... \n", "C3N-01520 1.0 \n", "C3N-01521 0.0 \n", "C3N-01537 0.0 \n", "C3N-01802 0.0 \n", "C3N-01825 0.0 \n", "\n", "[103 rows x 124 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "en.get_clinical('mssm')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n", "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 437)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameagesexraceSHANK2_MutationSHANK2_LocationSHANK2_Mutation_StatusSample_Status
Patient_ID
C3L-0000664FemaleWhiteMissense_Mutationp.S1692RSingle_mutationTumor
C3L-0000858FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-0003250FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-0008474FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-0009075FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorTumor
........................
C3N-0152069FemaleUnknownMissense_Mutationp.P1586SSingle_mutationTumor
C3N-0152175FemaleUnknownWildtype_TumorNo_mutationWildtype_TumorTumor
C3N-0153774FemaleUnknownWildtype_TumorNo_mutationWildtype_TumorTumor
C3N-0180285FemaleBlack or African AmericanWildtype_TumorNo_mutationWildtype_TumorTumor
C3N-0182570FemaleUnknownWildtype_TumorNo_mutationWildtype_TumorTumor
\n", "

103 rows × 7 columns

\n", "
" ], "text/plain": [ "Name age sex race SHANK2_Mutation \\\n", "Patient_ID \n", "C3L-00006 64 Female White Missense_Mutation \n", "C3L-00008 58 Female White Wildtype_Tumor \n", "C3L-00032 50 Female White Wildtype_Tumor \n", "C3L-00084 74 Female White Wildtype_Tumor \n", "C3L-00090 75 Female White Wildtype_Tumor \n", "... .. ... ... ... \n", "C3N-01520 69 Female Unknown Missense_Mutation \n", "C3N-01521 75 Female Unknown Wildtype_Tumor \n", "C3N-01537 74 Female Unknown Wildtype_Tumor \n", "C3N-01802 85 Female Black or African American Wildtype_Tumor \n", "C3N-01825 70 Female Unknown Wildtype_Tumor \n", "\n", "Name SHANK2_Location SHANK2_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 p.S1692R Single_mutation Tumor \n", "C3L-00008 No_mutation Wildtype_Tumor Tumor \n", "C3L-00032 No_mutation Wildtype_Tumor Tumor \n", "C3L-00084 No_mutation Wildtype_Tumor Tumor \n", "C3L-00090 No_mutation Wildtype_Tumor Tumor \n", "... ... ... ... \n", "C3N-01520 p.P1586S Single_mutation Tumor \n", "C3N-01521 No_mutation Wildtype_Tumor Tumor \n", "C3N-01537 No_mutation Wildtype_Tumor Tumor \n", "C3N-01802 No_mutation Wildtype_Tumor Tumor \n", "C3N-01825 No_mutation Wildtype_Tumor Tumor \n", "\n", "[103 rows x 7 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "en.join_metadata_to_mutations(\n", " metadata_name=\"clinical\",\n", " metadata_source=\"mssm\",\n", " metadata_cols=[\"age\", \"sex\", \"race\"],\n", " mutations_source=\"harmonized\",\n", " mutations_genes=\"SHANK2\",\n", " mutations_filter=[\"Missense_Mutation\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This command joins the age, sex, and race metadata with the mutation data for the SHANK2 gene, filtering out all mutations except Missense_Mutations.\n", "\n", "If you need to join metadata to a larger number of mutation genes, the multi_join function can be useful. Below, we join the same metadata with the mutation data for SHANK2, PTEN, and TP53 genes. Here we do not filter mutations. Remember, by default, the mutations_filter parameter of multi_join behaves the same as the join_metadata_to_mutations function - it returns all mutations as lists in the output dataframe, regardless of the number of mutations for a given sample." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3189298179.py, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameagesexraceSHANK2_MutationSHANK2_LocationSHANK2_Mutation_StatusPTEN_MutationPTEN_LocationPTEN_Mutation_StatusTP53_MutationTP53_LocationTP53_Mutation_StatusSample_Status
Patient_ID
C3L-0000664FemaleWhite[Missense_Mutation][p.S1692R]Single_mutation[Missense_Mutation, Nonsense_Mutation][p.R130Q, p.R233*]Multiple_mutation[Missense_Mutation][p.R248W]Single_mutationTumor
C3L-0000858FemaleWhite[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Missense_Mutation][p.G127R]Single_mutation[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-0003250FemaleWhite[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Nonsense_Mutation][p.W111*]Single_mutation[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-0008474FemaleWhite[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-0009075FemaleWhite[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Missense_Mutation][p.R130G]Single_mutation[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
..........................................
C3N-0152069FemaleUnknown[Missense_Mutation][p.P1586S]Single_mutation[Frame_Shift_Del, Frame_Shift_Ins][p.N323fs, p.D268fs]Multiple_mutation[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3N-0152175FemaleUnknown[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Missense_Mutation][p.H193L]Single_mutationTumor
C3N-0153774FemaleUnknown[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3N-0180285FemaleBlack or African American[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Missense_Mutation][p.P27S]Single_mutationTumor
C3N-0182570FemaleUnknown[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Wildtype_Tumor][No_mutation]Wildtype_Tumor[Missense_Mutation][p.R175H]Single_mutationTumor
\n", "

103 rows × 13 columns

\n", "
" ], "text/plain": [ "Name age sex race SHANK2_Mutation \\\n", "Patient_ID \n", "C3L-00006 64 Female White [Missense_Mutation] \n", "C3L-00008 58 Female White [Wildtype_Tumor] \n", "C3L-00032 50 Female White [Wildtype_Tumor] \n", "C3L-00084 74 Female White [Wildtype_Tumor] \n", "C3L-00090 75 Female White [Wildtype_Tumor] \n", "... .. ... ... ... \n", "C3N-01520 69 Female Unknown [Missense_Mutation] \n", "C3N-01521 75 Female Unknown [Wildtype_Tumor] \n", "C3N-01537 74 Female Unknown [Wildtype_Tumor] \n", "C3N-01802 85 Female Black or African American [Wildtype_Tumor] \n", "C3N-01825 70 Female Unknown [Wildtype_Tumor] \n", "\n", "Name SHANK2_Location SHANK2_Mutation_Status \\\n", "Patient_ID \n", "C3L-00006 [p.S1692R] Single_mutation \n", "C3L-00008 [No_mutation] Wildtype_Tumor \n", "C3L-00032 [No_mutation] Wildtype_Tumor \n", "C3L-00084 [No_mutation] Wildtype_Tumor \n", "C3L-00090 [No_mutation] Wildtype_Tumor \n", "... ... ... \n", "C3N-01520 [p.P1586S] Single_mutation \n", "C3N-01521 [No_mutation] Wildtype_Tumor \n", "C3N-01537 [No_mutation] Wildtype_Tumor \n", "C3N-01802 [No_mutation] Wildtype_Tumor \n", "C3N-01825 [No_mutation] Wildtype_Tumor \n", "\n", "Name PTEN_Mutation PTEN_Location \\\n", "Patient_ID \n", "C3L-00006 [Missense_Mutation, Nonsense_Mutation] [p.R130Q, p.R233*] \n", "C3L-00008 [Missense_Mutation] [p.G127R] \n", "C3L-00032 [Nonsense_Mutation] [p.W111*] \n", "C3L-00084 [Wildtype_Tumor] [No_mutation] \n", "C3L-00090 [Missense_Mutation] [p.R130G] \n", "... ... ... \n", "C3N-01520 [Frame_Shift_Del, Frame_Shift_Ins] [p.N323fs, p.D268fs] \n", "C3N-01521 [Wildtype_Tumor] [No_mutation] \n", "C3N-01537 [Wildtype_Tumor] [No_mutation] \n", "C3N-01802 [Wildtype_Tumor] [No_mutation] \n", "C3N-01825 [Wildtype_Tumor] [No_mutation] \n", "\n", "Name PTEN_Mutation_Status TP53_Mutation TP53_Location \\\n", "Patient_ID \n", "C3L-00006 Multiple_mutation [Missense_Mutation] [p.R248W] \n", "C3L-00008 Single_mutation [Wildtype_Tumor] [No_mutation] \n", "C3L-00032 Single_mutation [Wildtype_Tumor] [No_mutation] \n", "C3L-00084 Wildtype_Tumor [Wildtype_Tumor] [No_mutation] \n", "C3L-00090 Single_mutation [Wildtype_Tumor] [No_mutation] \n", "... ... ... ... \n", "C3N-01520 Multiple_mutation [Wildtype_Tumor] [No_mutation] \n", "C3N-01521 Wildtype_Tumor [Missense_Mutation] [p.H193L] \n", "C3N-01537 Wildtype_Tumor [Wildtype_Tumor] [No_mutation] \n", "C3N-01802 Wildtype_Tumor [Missense_Mutation] [p.P27S] \n", "C3N-01825 Wildtype_Tumor [Missense_Mutation] [p.R175H] \n", "\n", "Name TP53_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 Single_mutation Tumor \n", "C3L-00008 Wildtype_Tumor Tumor \n", "C3L-00032 Wildtype_Tumor Tumor \n", "C3L-00084 Wildtype_Tumor Tumor \n", "C3L-00090 Wildtype_Tumor Tumor \n", "... ... ... \n", "C3N-01520 Wildtype_Tumor Tumor \n", "C3N-01521 Single_mutation Tumor \n", "C3N-01537 Wildtype_Tumor Tumor \n", "C3N-01802 Single_mutation Tumor \n", "C3N-01825 Single_mutation Tumor \n", "\n", "[103 rows x 13 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "en.multi_join({\"mssm clinical\": [\"age\", \"sex\", \"race\"],\n", " \"harmonized somatic_mutation\": [\"SHANK2\", \"PTEN\", \"TP53\"]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of joining clinical data with mutations while filtering specific mutations:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n", "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3101478147.py, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameagesexraceSHANK2_MutationSHANK2_LocationSHANK2_Mutation_StatusPTEN_MutationPTEN_LocationPTEN_Mutation_StatusTP53_MutationTP53_LocationTP53_Mutation_StatusSample_Status
Patient_ID
C3L-0000664FemaleWhiteMissense_Mutationp.S1692RSingle_mutationMissense_Mutationp.R130QMultiple_mutationMissense_Mutationp.R248WSingle_mutationTumor
C3L-0000858FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorMissense_Mutationp.G127RSingle_mutationWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-0003250FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorNonsense_Mutationp.W111*Single_mutationWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-0008474FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-0009075FemaleWhiteWildtype_TumorNo_mutationWildtype_TumorMissense_Mutationp.R130GSingle_mutationWildtype_TumorNo_mutationWildtype_TumorTumor
..........................................
C3N-0152069FemaleUnknownMissense_Mutationp.P1586SSingle_mutationFrame_Shift_Insp.D268fsMultiple_mutationWildtype_TumorNo_mutationWildtype_TumorTumor
C3N-0152175FemaleUnknownWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorMissense_Mutationp.H193LSingle_mutationTumor
C3N-0153774FemaleUnknownWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorTumor
C3N-0180285FemaleBlack or African AmericanWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorMissense_Mutationp.P27SSingle_mutationTumor
C3N-0182570FemaleUnknownWildtype_TumorNo_mutationWildtype_TumorWildtype_TumorNo_mutationWildtype_TumorMissense_Mutationp.R175HSingle_mutationTumor
\n", "

103 rows × 13 columns

\n", "
" ], "text/plain": [ "Name age sex race SHANK2_Mutation \\\n", "Patient_ID \n", "C3L-00006 64 Female White Missense_Mutation \n", "C3L-00008 58 Female White Wildtype_Tumor \n", "C3L-00032 50 Female White Wildtype_Tumor \n", "C3L-00084 74 Female White Wildtype_Tumor \n", "C3L-00090 75 Female White Wildtype_Tumor \n", "... .. ... ... ... \n", "C3N-01520 69 Female Unknown Missense_Mutation \n", "C3N-01521 75 Female Unknown Wildtype_Tumor \n", "C3N-01537 74 Female Unknown Wildtype_Tumor \n", "C3N-01802 85 Female Black or African American Wildtype_Tumor \n", "C3N-01825 70 Female Unknown Wildtype_Tumor \n", "\n", "Name SHANK2_Location SHANK2_Mutation_Status PTEN_Mutation \\\n", "Patient_ID \n", "C3L-00006 p.S1692R Single_mutation Missense_Mutation \n", "C3L-00008 No_mutation Wildtype_Tumor Missense_Mutation \n", "C3L-00032 No_mutation Wildtype_Tumor Nonsense_Mutation \n", "C3L-00084 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "C3L-00090 No_mutation Wildtype_Tumor Missense_Mutation \n", "... ... ... ... \n", "C3N-01520 p.P1586S Single_mutation Frame_Shift_Ins \n", "C3N-01521 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "C3N-01537 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "C3N-01802 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "C3N-01825 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "\n", "Name PTEN_Location PTEN_Mutation_Status TP53_Mutation \\\n", "Patient_ID \n", "C3L-00006 p.R130Q Multiple_mutation Missense_Mutation \n", "C3L-00008 p.G127R Single_mutation Wildtype_Tumor \n", "C3L-00032 p.W111* Single_mutation Wildtype_Tumor \n", "C3L-00084 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "C3L-00090 p.R130G Single_mutation Wildtype_Tumor \n", "... ... ... ... \n", "C3N-01520 p.D268fs Multiple_mutation Wildtype_Tumor \n", "C3N-01521 No_mutation Wildtype_Tumor Missense_Mutation \n", "C3N-01537 No_mutation Wildtype_Tumor Wildtype_Tumor \n", "C3N-01802 No_mutation Wildtype_Tumor Missense_Mutation \n", "C3N-01825 No_mutation Wildtype_Tumor Missense_Mutation \n", "\n", "Name TP53_Location TP53_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 p.R248W Single_mutation Tumor \n", "C3L-00008 No_mutation Wildtype_Tumor Tumor \n", "C3L-00032 No_mutation Wildtype_Tumor Tumor \n", "C3L-00084 No_mutation Wildtype_Tumor Tumor \n", "C3L-00090 No_mutation Wildtype_Tumor Tumor \n", "... ... ... ... \n", "C3N-01520 No_mutation Wildtype_Tumor Tumor \n", "C3N-01521 p.H193L Single_mutation Tumor \n", "C3N-01537 No_mutation Wildtype_Tumor Tumor \n", "C3N-01802 p.P27S Single_mutation Tumor \n", "C3N-01825 p.R175H Single_mutation Tumor \n", "\n", "[103 rows x 13 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "survival_and_SHANK2 = en.multi_join({\"mssm clinical\": [\"age\", \"sex\", \"race\"],\n", " \"harmonized somatic_mutation\": [\"SHANK2\", \"PTEN\", \"TP53\"]}, \n", " mutations_filter=[\"Missense_Mutation\"])\n", "\n", "survival_and_SHANK2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that the mutations_filter parameter receives a list. In this example, it is filtering only the \"Missense_Mutation\" type for all genes specified." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exporting dataframes\n", "\n", "If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "survival_and_SHANK2.to_csv(path_or_buf=\"histologic_type_and_PTEN_mutation.tsv\", sep='\\t')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }