{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 3: Joining dataframes with `cptac`\n", "\n", "In this tutorial, we provide several examples of how to use the built-in `cptac` functions for joining different dataframes.\n", "\n", "We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Data typeAvailable sources
0CNVawg, washu
1CNV_gisticawgconf
2CNV_log2ratioawgconf
3acetylproteomicsawg, awgconf, pdc
4acetylproteomics_geneawgconf
5circular_RNAawg, awgconf, bcm
6clinicalawg, awgconf, mssm, pdc
7deconvolution_cibersortwashu
8deconvolution_xcellwashu
9derived_molecularawg
10experimental_designawg
11followupawg
12gene_fusionawgconf
13methylationawgconf
14miRNAawg, awgconf, washu
15phosphoproteomicsawg, awgconf, pdc, umich
16phosphoproteomics_geneawgconf
17proteomicsawg, awgconf, pdc, umich
18somatic_mutationawg, awgconf, harmonized, washu
19somatic_mutation_binaryawg, awgconf
20targeted_phosphoproteomicsawgconf
21targeted_proteomicsawgconf
22transcriptomicsawg, awgconf, bcm, broad, washu
23tumor_puritywashu
\n", "
" ], "text/plain": [ " Data type Available sources\n", "0 CNV awg, washu\n", "1 CNV_gistic awgconf\n", "2 CNV_log2ratio awgconf\n", "3 acetylproteomics awg, awgconf, pdc\n", "4 acetylproteomics_gene awgconf\n", "5 circular_RNA awg, awgconf, bcm\n", "6 clinical awg, awgconf, mssm, pdc\n", "7 deconvolution_cibersort washu\n", "8 deconvolution_xcell washu\n", "9 derived_molecular awg\n", "10 experimental_design awg\n", "11 followup awg\n", "12 gene_fusion awgconf\n", "13 methylation awgconf\n", "14 miRNA awg, awgconf, washu\n", "15 phosphoproteomics awg, awgconf, pdc, umich\n", "16 phosphoproteomics_gene awgconf\n", "17 proteomics awg, awgconf, pdc, umich\n", "18 somatic_mutation awg, awgconf, harmonized, washu\n", "19 somatic_mutation_binary awg, awgconf\n", "20 targeted_phosphoproteomics awgconf\n", "21 targeted_proteomics awgconf\n", "22 transcriptomics awg, awgconf, bcm, broad, washu\n", "23 tumor_purity washu" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import cptac\n", "en = cptac.Ucec()\n", "en.list_data_sources()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## General format\n", "\n", "cptac has a helpful function called `multi_join`. It allows data from several different cptac dataframes to be joined at the same time.\n", "\n", "To use `multi_join`, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.\n", "\n", "Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.\n", "\n", "If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.\n", "\n", "The join functions use logic analogous to an SQL INNER JOIN." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Join dictionary\n", "\n", "The main parameter for the `multi_join` function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:\n", "\n", "`{('awg', 'proteomics'): ''}`\n", "\n", "or\n", "\n", "`{\"awg proteomics\": ''}`\n", "\n", "as the join dictionary would each result in `multi_join` returning a dataframe containing only awg proteomics data.\n", "\n", "You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join omics to omics\n", "\n", "`multi_join` can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameA1BG_awg_proteomicsA2M_awg_proteomicsA2ML1_awg_proteomicsA4GALT_awg_proteomicsAAAS_awg_proteomicsAACS_awg_proteomicsAADAT_awg_proteomicsAAED1_awg_proteomicsAAGAB_awg_proteomicsAAK1_awg_proteomics...ZZZ3_awg_phosphoproteomics
Site...S397S411S420S424S426S468S89T415T418Y399
Patient_ID
C3L-00006-1.180-0.8630-0.8020.2220.25600.66501.2800-0.33900.412-0.664...0.18400NaNNaNNaN-0.20500NaNNaNNaNNaNNaN
C3L-00008-0.685-1.0700-0.6840.9840.13500.33401.30000.13901.330-0.367...-0.17100NaNNaN-0.393-0.17100NaN0.29NaN0.1605-0.0635
C3L-00032-0.528-1.32000.435NaN-0.24001.0400-0.0213-0.04790.419-0.500...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C3L-00090-1.670-1.1900-0.4430.243-0.09930.75700.7400-0.92900.229-0.223...0.13970NaNNaNNaN-0.55900NaNNaNNaNNaN0.2980
C3L-00098-0.374-0.0206-0.5370.3110.37500.0131-1.1000NaN0.565-0.101...-0.15875NaNNaN0.1960.06175NaNNaNNaNNaN-0.2900
\n", "

5 rows × 84211 columns

\n", "
" ], "text/plain": [ "Name A1BG_awg_proteomics A2M_awg_proteomics A2ML1_awg_proteomics \\\n", "Site \n", "Patient_ID \n", "C3L-00006 -1.180 -0.8630 -0.802 \n", "C3L-00008 -0.685 -1.0700 -0.684 \n", "C3L-00032 -0.528 -1.3200 0.435 \n", "C3L-00090 -1.670 -1.1900 -0.443 \n", "C3L-00098 -0.374 -0.0206 -0.537 \n", "\n", "Name A4GALT_awg_proteomics AAAS_awg_proteomics AACS_awg_proteomics \\\n", "Site \n", "Patient_ID \n", "C3L-00006 0.222 0.2560 0.6650 \n", "C3L-00008 0.984 0.1350 0.3340 \n", "C3L-00032 NaN -0.2400 1.0400 \n", "C3L-00090 0.243 -0.0993 0.7570 \n", "C3L-00098 0.311 0.3750 0.0131 \n", "\n", "Name AADAT_awg_proteomics AAED1_awg_proteomics AAGAB_awg_proteomics \\\n", "Site \n", "Patient_ID \n", "C3L-00006 1.2800 -0.3390 0.412 \n", "C3L-00008 1.3000 0.1390 1.330 \n", "C3L-00032 -0.0213 -0.0479 0.419 \n", "C3L-00090 0.7400 -0.9290 0.229 \n", "C3L-00098 -1.1000 NaN 0.565 \n", "\n", "Name AAK1_awg_proteomics ... ZZZ3_awg_phosphoproteomics \\\n", "Site ... S397 S411 S420 \n", "Patient_ID ... \n", "C3L-00006 -0.664 ... 0.18400 NaN NaN \n", "C3L-00008 -0.367 ... -0.17100 NaN NaN \n", "C3L-00032 -0.500 ... NaN NaN NaN \n", "C3L-00090 -0.223 ... 0.13970 NaN NaN \n", "C3L-00098 -0.101 ... -0.15875 NaN NaN \n", "\n", "Name \n", "Site S424 S426 S468 S89 T415 T418 Y399 \n", "Patient_ID \n", "C3L-00006 NaN -0.20500 NaN NaN NaN NaN NaN \n", "C3L-00008 -0.393 -0.17100 NaN 0.29 NaN 0.1605 -0.0635 \n", "C3L-00032 NaN NaN NaN NaN NaN NaN NaN \n", "C3L-00090 NaN -0.55900 NaN NaN NaN NaN 0.2980 \n", "C3L-00098 0.196 0.06175 NaN NaN NaN NaN -0.2900 \n", "\n", "[5 rows x 84211 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prot_and_phos = en.multi_join({\"awg proteomics\":'', \"awg phosphoproteomics\":''})\n", "prot_and_phos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Joining only specific columns.\n", "(Note that when a gene is selected from the phosphoproteomics dataframe, data for all sites of the gene are selected. The same is done for acetylproteomics data.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameA1BG_awg_proteomicsPIK3CA_awg_phosphoproteomics
SiteS312T313
Patient_ID
C3L-00006-1.180-0.006150.0731
C3L-00008-0.685-0.02220NaN
C3L-00032-0.528NaN0.0830
C3L-00090-1.670NaN-0.8460
C3L-00098-0.3740.43600NaN
\n", "
" ], "text/plain": [ "Name A1BG_awg_proteomics PIK3CA_awg_phosphoproteomics \n", "Site S312 T313\n", "Patient_ID \n", "C3L-00006 -1.180 -0.00615 0.0731\n", "C3L-00008 -0.685 -0.02220 NaN\n", "C3L-00032 -0.528 NaN 0.0830\n", "C3L-00090 -1.670 NaN -0.8460\n", "C3L-00098 -0.374 0.43600 NaN" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prot_and_phos_selected = en.multi_join({\"awg proteomics\":'A1BG', \"awg phosphoproteomics\":'PIK3CA'})\n", "prot_and_phos_selected.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join metadata to omics\n", "\n", "The `multi_join` function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSample_IDSample_Tumor_NormalProteomics_Tumor_NormalCountryHistologic_Grade_FIGOMyometrial_invasion_SpecifyHistologic_typeTreatment_naiveTumor_purityPath_Stage_Primary_Tumor-pT...ZWILCH_awg_transcriptomicsZWINT_awg_transcriptomicsZXDA_awg_transcriptomicsZXDB_awg_transcriptomicsZXDC_awg_transcriptomicsZYG11A_awg_transcriptomicsZYG11B_awg_transcriptomicsZYX_awg_transcriptomicsZZEF1_awg_transcriptomicsZZZ3_awg_transcriptomics
Patient_ID
C3L-00006S001TumorTumorUnited StatesFIGO grade 1under 50 %EndometrioidYESNormalpT1a (FIGO IA)...11.0610.738.409.7810.885.9311.5210.2311.5011.47
C3L-00008S002TumorTumorUnited StatesFIGO grade 1under 50 %EndometrioidYESNormalpT1a (FIGO IA)...10.8711.438.399.1410.387.2511.6410.6411.2611.57
C3L-00032S003TumorTumorUnited StatesFIGO grade 2under 50 %EndometrioidYESNormalpT1a (FIGO IA)...10.0610.138.359.2710.466.8511.6010.2111.5111.09
C3L-00084S004TumorTumorNaNNaNNaNCarcinosarcomaYESNormalNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C3L-00090S005TumorTumorUnited StatesFIGO grade 2under 50 %EndometrioidYESNormalpT1a (FIGO IA)...10.2910.419.109.5910.157.8911.9010.2111.3411.51
\n", "

5 rows × 28084 columns

\n", "
" ], "text/plain": [ "Name Sample_ID Sample_Tumor_Normal Proteomics_Tumor_Normal \\\n", "Patient_ID \n", "C3L-00006 S001 Tumor Tumor \n", "C3L-00008 S002 Tumor Tumor \n", "C3L-00032 S003 Tumor Tumor \n", "C3L-00084 S004 Tumor Tumor \n", "C3L-00090 S005 Tumor Tumor \n", "\n", "Name Country Histologic_Grade_FIGO Myometrial_invasion_Specify \\\n", "Patient_ID \n", "C3L-00006 United States FIGO grade 1 under 50 % \n", "C3L-00008 United States FIGO grade 1 under 50 % \n", "C3L-00032 United States FIGO grade 2 under 50 % \n", "C3L-00084 NaN NaN NaN \n", "C3L-00090 United States FIGO grade 2 under 50 % \n", "\n", "Name Histologic_type Treatment_naive Tumor_purity \\\n", "Patient_ID \n", "C3L-00006 Endometrioid YES Normal \n", "C3L-00008 Endometrioid YES Normal \n", "C3L-00032 Endometrioid YES Normal \n", "C3L-00084 Carcinosarcoma YES Normal \n", "C3L-00090 Endometrioid YES Normal \n", "\n", "Name Path_Stage_Primary_Tumor-pT ... ZWILCH_awg_transcriptomics \\\n", "Patient_ID ... \n", "C3L-00006 pT1a (FIGO IA) ... 11.06 \n", "C3L-00008 pT1a (FIGO IA) ... 10.87 \n", "C3L-00032 pT1a (FIGO IA) ... 10.06 \n", "C3L-00084 NaN ... NaN \n", "C3L-00090 pT1a (FIGO IA) ... 10.29 \n", "\n", "Name ZWINT_awg_transcriptomics ZXDA_awg_transcriptomics \\\n", "Patient_ID \n", "C3L-00006 10.73 8.40 \n", "C3L-00008 11.43 8.39 \n", "C3L-00032 10.13 8.35 \n", "C3L-00084 NaN NaN \n", "C3L-00090 10.41 9.10 \n", "\n", "Name ZXDB_awg_transcriptomics ZXDC_awg_transcriptomics \\\n", "Patient_ID \n", "C3L-00006 9.78 10.88 \n", "C3L-00008 9.14 10.38 \n", "C3L-00032 9.27 10.46 \n", "C3L-00084 NaN NaN \n", "C3L-00090 9.59 10.15 \n", "\n", "Name ZYG11A_awg_transcriptomics ZYG11B_awg_transcriptomics \\\n", "Patient_ID \n", "C3L-00006 5.93 11.52 \n", "C3L-00008 7.25 11.64 \n", "C3L-00032 6.85 11.60 \n", "C3L-00084 NaN NaN \n", "C3L-00090 7.89 11.90 \n", "\n", "Name ZYX_awg_transcriptomics ZZEF1_awg_transcriptomics \\\n", "Patient_ID \n", "C3L-00006 10.23 11.50 \n", "C3L-00008 10.64 11.26 \n", "C3L-00032 10.21 11.51 \n", "C3L-00084 NaN NaN \n", "C3L-00090 10.21 11.34 \n", "\n", "Name ZZZ3_awg_transcriptomics \n", "Patient_ID \n", "C3L-00006 11.47 \n", "C3L-00008 11.57 \n", "C3L-00032 11.09 \n", "C3L-00084 NaN \n", "C3L-00090 11.51 \n", "\n", "[5 rows x 28084 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clin_and_tran = en.multi_join({\"awg clinical\":'', \"awg transcriptomics\":''})\n", "clin_and_tran.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Joining only specific columns:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAgeHistologic_typeZZZ3_awg_transcriptomics
Patient_ID
C3L-0000664.0Endometrioid11.47
C3L-0000858.0Endometrioid11.57
C3L-0003250.0Endometrioid11.09
C3L-00084NaNCarcinosarcomaNaN
C3L-0009075.0Endometrioid11.51
\n", "
" ], "text/plain": [ "Name Age Histologic_type ZZZ3_awg_transcriptomics\n", "Patient_ID \n", "C3L-00006 64.0 Endometrioid 11.47\n", "C3L-00008 58.0 Endometrioid 11.57\n", "C3L-00032 50.0 Endometrioid 11.09\n", "C3L-00084 NaN Carcinosarcoma NaN\n", "C3L-00090 75.0 Endometrioid 11.51" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clin_and_tran = en.multi_join({\"awg clinical\": [\"Age\", \"Histologic_type\"], \"awg transcriptomics\": \"ZZZ3\"})\n", "clin_and_tran.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join metadata to metadata\n", "\n", "Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string `''` or an empty list `[]` for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameHistologic_typeEstrogen_ReceptorEstrogen_Receptor_%Progesterone_ReceptorProgesterone_Receptor_%MLH1MLH2MSH6PMS2p53...Log2_variant_totalLog2_SNP_totalLog2_INDEL_totalGenomics_subtypeMutation_signature_C>AMutation_signature_C>GMutation_signature_C>TMutation_signature_T>CMutation_signature_T>AMutation_signature_T>G
Patient_ID
C3L-00006EndometrioidCannot be determinedNaNCannot be determinedNaNIntact nuclear expressionIntact nuclear expressionLoss of nuclear expressionIntact nuclear expressionCannot be determined...10.0620469.9844185.832890MSI-H8.3003951.48221372.52964414.4268771.3833991.877470
C3L-00008EndometrioidCannot be determinedNaNCannot be determinedNaNIntact nuclear expressionIntact nuclear expressionIntact nuclear expressionLoss of nuclear expressionCannot be determined...8.8610878.3309177.169925MSI-H14.6417452.80373864.48598115.2647980.9345791.869159
C3L-00032EndometrioidCannot be determinedNaNCannot be determinedNaNIntact nuclear expressionIntact nuclear expressionIntact nuclear expressionIntact nuclear expressionCannot be determined...5.3219285.0000003.169925CNV_low16.1290323.22580670.9677423.2258063.2258063.225806
C3L-00084CarcinosarcomaNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C3L-00090EndometrioidCannot be determinedNaNCannot be determinedNaNIntact nuclear expressionIntact nuclear expressionIntact nuclear expressionIntact nuclear expressionCannot be determined...5.6724255.5235622.584963CNV_low17.7777788.88888962.2222228.8888892.2222220.000000
\n", "

5 rows × 126 columns

\n", "
" ], "text/plain": [ "Name Histologic_type Estrogen_Receptor Estrogen_Receptor_% \\\n", "Patient_ID \n", "C3L-00006 Endometrioid Cannot be determined NaN \n", "C3L-00008 Endometrioid Cannot be determined NaN \n", "C3L-00032 Endometrioid Cannot be determined NaN \n", "C3L-00084 Carcinosarcoma NaN NaN \n", "C3L-00090 Endometrioid Cannot be determined NaN \n", "\n", "Name Progesterone_Receptor Progesterone_Receptor_% \\\n", "Patient_ID \n", "C3L-00006 Cannot be determined NaN \n", "C3L-00008 Cannot be determined NaN \n", "C3L-00032 Cannot be determined NaN \n", "C3L-00084 NaN NaN \n", "C3L-00090 Cannot be determined NaN \n", "\n", "Name MLH1 MLH2 \\\n", "Patient_ID \n", "C3L-00006 Intact nuclear expression Intact nuclear expression \n", "C3L-00008 Intact nuclear expression Intact nuclear expression \n", "C3L-00032 Intact nuclear expression Intact nuclear expression \n", "C3L-00084 NaN NaN \n", "C3L-00090 Intact nuclear expression Intact nuclear expression \n", "\n", "Name MSH6 PMS2 \\\n", "Patient_ID \n", "C3L-00006 Loss of nuclear expression Intact nuclear expression \n", "C3L-00008 Intact nuclear expression Loss of nuclear expression \n", "C3L-00032 Intact nuclear expression Intact nuclear expression \n", "C3L-00084 NaN NaN \n", "C3L-00090 Intact nuclear expression Intact nuclear expression \n", "\n", "Name p53 ... Log2_variant_total Log2_SNP_total \\\n", "Patient_ID ... \n", "C3L-00006 Cannot be determined ... 10.062046 9.984418 \n", "C3L-00008 Cannot be determined ... 8.861087 8.330917 \n", "C3L-00032 Cannot be determined ... 5.321928 5.000000 \n", "C3L-00084 NaN ... NaN NaN \n", "C3L-00090 Cannot be determined ... 5.672425 5.523562 \n", "\n", "Name Log2_INDEL_total Genomics_subtype Mutation_signature_C>A \\\n", "Patient_ID \n", "C3L-00006 5.832890 MSI-H 8.300395 \n", "C3L-00008 7.169925 MSI-H 14.641745 \n", "C3L-00032 3.169925 CNV_low 16.129032 \n", "C3L-00084 NaN NaN NaN \n", "C3L-00090 2.584963 CNV_low 17.777778 \n", "\n", "Name Mutation_signature_C>G Mutation_signature_C>T \\\n", "Patient_ID \n", "C3L-00006 1.482213 72.529644 \n", "C3L-00008 2.803738 64.485981 \n", "C3L-00032 3.225806 70.967742 \n", "C3L-00084 NaN NaN \n", "C3L-00090 8.888889 62.222222 \n", "\n", "Name Mutation_signature_T>C Mutation_signature_T>A \\\n", "Patient_ID \n", "C3L-00006 14.426877 1.383399 \n", "C3L-00008 15.264798 0.934579 \n", "C3L-00032 3.225806 3.225806 \n", "C3L-00084 NaN NaN \n", "C3L-00090 8.888889 2.222222 \n", "\n", "Name Mutation_signature_T>G \n", "Patient_ID \n", "C3L-00006 1.877470 \n", "C3L-00008 1.869159 \n", "C3L-00032 3.225806 \n", "C3L-00084 NaN \n", "C3L-00090 0.000000 \n", "\n", "[5 rows x 126 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hist_and_derived_molecular = en.multi_join({\n", " \"awg clinical\": \"Histologic_type\",\n", " \"awg derived_molecular\": '' # Note that by using an empty string or list as the value, we join the entire dataframe\n", "})\n", "\n", "hist_and_derived_molecular.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join many datatypes together\n", "\n", "If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for `multi_join` can take is your imagination." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: The following columns were not found in the awg phosphoproteomics dataframe, so they were inserted into joined table, but filled with NaN: AURKA (, line 2)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " \r" ] }, { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 78 samples for the PTEN gene (, line 2)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAURKA_awg_proteomicsTP53_awg_proteomicsAURKA_awg_phosphoproteomicsTP53_awg_phosphoproteomicsSample_IDSample_Tumor_NormalProteomics_Tumor_NormalCountryHistologic_Grade_FIGO...GenderTumor_SiteTumor_Site_OtherTumor_FocalityTumor_Size_cmNum_full_term_pregnanciesPTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
SiteNaNS315T150...
Patient_ID
C3L-00006NaN0.295NaNNaNNaNS001TumorTumorUnited StatesFIGO grade 1...FemaleAnterior endometriumNaNUnifocal2.91[Missense_Mutation, Nonsense_Mutation][p.R130Q, p.R233*]Multiple_mutationTumor
C3L-000080.3110.277NaN0.646NaNS002TumorTumorUnited StatesFIGO grade 1...FemalePosterior endometriumNaNUnifocal3.51[Missense_Mutation][p.G127R]Single_mutationTumor
C3L-00032NaN-0.871NaN-0.800NaNS003TumorTumorUnited StatesFIGO grade 2...FemaleOther, specifyAnterior and Posterior endometriumUnifocal4.54 or more[Nonsense_Mutation][p.W111*]Single_mutationTumor
C3L-00084NaNNaNNaNNaNNaNS004TumorTumorNaNNaN...NaNNaNNaNNaNNaNNaN[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00090-0.798-0.343NaNNaNNaNS005TumorTumorUnited StatesFIGO grade 2...FemaleOther, specifyAnterior and Posterior endometriumUnifocal3.54 or more[Missense_Mutation][p.R130G]Single_mutationTumor
\n", "

5 rows × 36 columns

\n", "
" ], "text/plain": [ "Name AURKA_awg_proteomics TP53_awg_proteomics \\\n", "Site \n", "Patient_ID \n", "C3L-00006 NaN 0.295 \n", "C3L-00008 0.311 0.277 \n", "C3L-00032 NaN -0.871 \n", "C3L-00084 NaN NaN \n", "C3L-00090 -0.798 -0.343 \n", "\n", "Name AURKA_awg_phosphoproteomics TP53_awg_phosphoproteomics \\\n", "Site NaN S315 T150 \n", "Patient_ID \n", "C3L-00006 NaN NaN NaN \n", "C3L-00008 NaN 0.646 NaN \n", "C3L-00032 NaN -0.800 NaN \n", "C3L-00084 NaN NaN NaN \n", "C3L-00090 NaN NaN NaN \n", "\n", "Name Sample_ID Sample_Tumor_Normal Proteomics_Tumor_Normal \\\n", "Site \n", "Patient_ID \n", "C3L-00006 S001 Tumor Tumor \n", "C3L-00008 S002 Tumor Tumor \n", "C3L-00032 S003 Tumor Tumor \n", "C3L-00084 S004 Tumor Tumor \n", "C3L-00090 S005 Tumor Tumor \n", "\n", "Name Country Histologic_Grade_FIGO ... Gender \\\n", "Site ... \n", "Patient_ID ... \n", "C3L-00006 United States FIGO grade 1 ... Female \n", "C3L-00008 United States FIGO grade 1 ... Female \n", "C3L-00032 United States FIGO grade 2 ... Female \n", "C3L-00084 NaN NaN ... NaN \n", "C3L-00090 United States FIGO grade 2 ... Female \n", "\n", "Name Tumor_Site Tumor_Site_Other \\\n", "Site \n", "Patient_ID \n", "C3L-00006 Anterior endometrium NaN \n", "C3L-00008 Posterior endometrium NaN \n", "C3L-00032 Other, specify Anterior and Posterior endometrium \n", "C3L-00084 NaN NaN \n", "C3L-00090 Other, specify Anterior and Posterior endometrium \n", "\n", "Name Tumor_Focality Tumor_Size_cm Num_full_term_pregnancies \\\n", "Site \n", "Patient_ID \n", "C3L-00006 Unifocal 2.9 1 \n", "C3L-00008 Unifocal 3.5 1 \n", "C3L-00032 Unifocal 4.5 4 or more \n", "C3L-00084 NaN NaN NaN \n", "C3L-00090 Unifocal 3.5 4 or more \n", "\n", "Name PTEN_Mutation PTEN_Location \\\n", "Site \n", "Patient_ID \n", "C3L-00006 [Missense_Mutation, Nonsense_Mutation] [p.R130Q, p.R233*] \n", "C3L-00008 [Missense_Mutation] [p.G127R] \n", "C3L-00032 [Nonsense_Mutation] [p.W111*] \n", "C3L-00084 [Wildtype_Tumor] [No_mutation] \n", "C3L-00090 [Missense_Mutation] [p.R130G] \n", "\n", "Name PTEN_Mutation_Status Sample_Status \n", "Site \n", "Patient_ID \n", "C3L-00006 Multiple_mutation Tumor \n", "C3L-00008 Single_mutation Tumor \n", "C3L-00032 Single_mutation Tumor \n", "C3L-00084 Wildtype_Tumor Tumor \n", "C3L-00090 Single_mutation Tumor \n", "\n", "[5 rows x 36 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joining_dictionary = {\"awg proteomics\": [\"AURKA\", \"TP53\"], \"awg phosphoproteomics\": [\"AURKA\", \"TP53\"], \"awg clinical\": [], \"awg somatic_mutation\": \"PTEN\"}\n", "en.multi_join(joining_dictionary).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`multi_join` does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameHistologic_typeHistologic_Grade_FIGO
Patient_ID
C3L-00006EndometrioidFIGO grade 1
C3L-00008EndometrioidFIGO grade 1
C3L-00032EndometrioidFIGO grade 2
C3L-00084CarcinosarcomaNaN
C3L-00090EndometrioidFIGO grade 2
\n", "
" ], "text/plain": [ "Name Histologic_type Histologic_Grade_FIGO\n", "Patient_ID \n", "C3L-00006 Endometrioid FIGO grade 1\n", "C3L-00008 Endometrioid FIGO grade 1\n", "C3L-00032 Endometrioid FIGO grade 2\n", "C3L-00084 Carcinosarcoma NaN\n", "C3L-00090 Endometrioid FIGO grade 2" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "histologic_type_and_grade = en.multi_join({\"awg clinical\": ['Histologic_type', 'Histologic_Grade_FIGO']})\n", "histologic_type_and_grade.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join omics to mutations\n", "\n", "Joining an -omics dataframe with the mutation data for a specified gene or genes is slightly different than other types of joins using `multi_join`. Because there may be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation. If there is no mutation for the gene in a particular sample, the list contains either \"Wildtype_Tumor\" or \"Wildtype_Normal\", depending on whether it's a tumor or normal sample. The mutation status column contains either \"Single_mutation\", \"Multiple_mutation\", \"Wildtype_Tumor\", or \"Wildtype_Normal\", for help with parsing." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAURKA_awg_proteomicsTP53_awg_proteomicsPTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006NaN0.2950[Missense_Mutation, Nonsense_Mutation][p.R130Q, p.R233*]Multiple_mutationTumor
C3L-000080.311000.2770[Missense_Mutation][p.G127R]Single_mutationTumor
C3L-00032NaN-0.8710[Nonsense_Mutation][p.W111*]Single_mutationTumor
C3L-00090-0.79800-0.3430[Missense_Mutation][p.R130G]Single_mutationTumor
C3L-000983.110003.0100[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00136-1.65000-0.1480[Missense_Mutation, Missense_Mutation][p.Y68C, p.R130G]Multiple_mutationTumor
C3L-00137NaN0.4410[Frame_Shift_Ins, Nonsense_Mutation][p.H118Qfs*8, p.Y180*]Multiple_mutationTumor
C3L-001390.84800-1.2200[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00143-1.73000-0.0825[Missense_Mutation][p.R130G]Single_mutationTumor
C3L-00145-0.00513-0.1810[Missense_Mutation, Frame_Shift_Ins][p.H93R, p.E242*]Multiple_mutationTumor
\n", "
" ], "text/plain": [ "Name AURKA_awg_proteomics TP53_awg_proteomics \\\n", "Patient_ID \n", "C3L-00006 NaN 0.2950 \n", "C3L-00008 0.31100 0.2770 \n", "C3L-00032 NaN -0.8710 \n", "C3L-00090 -0.79800 -0.3430 \n", "C3L-00098 3.11000 3.0100 \n", "C3L-00136 -1.65000 -0.1480 \n", "C3L-00137 NaN 0.4410 \n", "C3L-00139 0.84800 -1.2200 \n", "C3L-00143 -1.73000 -0.0825 \n", "C3L-00145 -0.00513 -0.1810 \n", "\n", "Name PTEN_Mutation PTEN_Location \\\n", "Patient_ID \n", "C3L-00006 [Missense_Mutation, Nonsense_Mutation] [p.R130Q, p.R233*] \n", "C3L-00008 [Missense_Mutation] [p.G127R] \n", "C3L-00032 [Nonsense_Mutation] [p.W111*] \n", "C3L-00090 [Missense_Mutation] [p.R130G] \n", "C3L-00098 [Wildtype_Tumor] [No_mutation] \n", "C3L-00136 [Missense_Mutation, Missense_Mutation] [p.Y68C, p.R130G] \n", "C3L-00137 [Frame_Shift_Ins, Nonsense_Mutation] [p.H118Qfs*8, p.Y180*] \n", "C3L-00139 [Wildtype_Tumor] [No_mutation] \n", "C3L-00143 [Missense_Mutation] [p.R130G] \n", "C3L-00145 [Missense_Mutation, Frame_Shift_Ins] [p.H93R, p.E242*] \n", "\n", "Name PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 Multiple_mutation Tumor \n", "C3L-00008 Single_mutation Tumor \n", "C3L-00032 Single_mutation Tumor \n", "C3L-00090 Single_mutation Tumor \n", "C3L-00098 Wildtype_Tumor Tumor \n", "C3L-00136 Multiple_mutation Tumor \n", "C3L-00137 Multiple_mutation Tumor \n", "C3L-00139 Wildtype_Tumor Tumor \n", "C3L-00143 Single_mutation Tumor \n", "C3L-00145 Multiple_mutation Tumor " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selected_acet_and_PTEN_mut_mult = en.multi_join({\"awg proteomics\": [\"AURKA\", \"TP53\"], \"awg somatic_mutation\": \"PTEN\"})\n", "selected_acet_and_PTEN_mut_mult.head(10)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/robertoldroyd/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3437: UserWarning: No source specified for proteomics data. Source awg used, pass a source to the omics_source parameter to prevent this warning\n", " exec(code_obj, self.user_global_ns, self.user_ns)\n", "/Users/robertoldroyd/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3437: UserWarning: No source specified for mutations data. Source awg used, pass a source to the mutations_source parameter to prevent this warning\n", " exec(code_obj, self.user_global_ns, self.user_ns)\n", "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (/Users/robertoldroyd/opt/anaconda3/lib/python3.8/site-packages/cptac/cancers/cancer.py, line 387)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAURKA_awg_proteomicsTP53_awg_proteomicsPTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006NaN0.2950[Missense_Mutation, Nonsense_Mutation][p.R130Q, p.R233*]Multiple_mutationTumor
C3L-000080.311000.2770[Missense_Mutation][p.G127R]Single_mutationTumor
C3L-00032NaN-0.8710[Nonsense_Mutation][p.W111*]Single_mutationTumor
C3L-00090-0.79800-0.3430[Missense_Mutation][p.R130G]Single_mutationTumor
C3L-000983.110003.0100[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00136-1.65000-0.1480[Missense_Mutation, Missense_Mutation][p.Y68C, p.R130G]Multiple_mutationTumor
C3L-00137NaN0.4410[Frame_Shift_Ins, Nonsense_Mutation][p.H118Qfs*8, p.Y180*]Multiple_mutationTumor
C3L-001390.84800-1.2200[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00143-1.73000-0.0825[Missense_Mutation][p.R130G]Single_mutationTumor
C3L-00145-0.00513-0.1810[Missense_Mutation, Frame_Shift_Ins][p.H93R, p.E242*]Multiple_mutationTumor
\n", "
" ], "text/plain": [ "Name AURKA_awg_proteomics TP53_awg_proteomics \\\n", "Patient_ID \n", "C3L-00006 NaN 0.2950 \n", "C3L-00008 0.31100 0.2770 \n", "C3L-00032 NaN -0.8710 \n", "C3L-00090 -0.79800 -0.3430 \n", "C3L-00098 3.11000 3.0100 \n", "C3L-00136 -1.65000 -0.1480 \n", "C3L-00137 NaN 0.4410 \n", "C3L-00139 0.84800 -1.2200 \n", "C3L-00143 -1.73000 -0.0825 \n", "C3L-00145 -0.00513 -0.1810 \n", "\n", "Name PTEN_Mutation PTEN_Location \\\n", "Patient_ID \n", "C3L-00006 [Missense_Mutation, Nonsense_Mutation] [p.R130Q, p.R233*] \n", "C3L-00008 [Missense_Mutation] [p.G127R] \n", "C3L-00032 [Nonsense_Mutation] [p.W111*] \n", "C3L-00090 [Missense_Mutation] [p.R130G] \n", "C3L-00098 [Wildtype_Tumor] [No_mutation] \n", "C3L-00136 [Missense_Mutation, Missense_Mutation] [p.Y68C, p.R130G] \n", "C3L-00137 [Frame_Shift_Ins, Nonsense_Mutation] [p.H118Qfs*8, p.Y180*] \n", "C3L-00139 [Wildtype_Tumor] [No_mutation] \n", "C3L-00143 [Missense_Mutation] [p.R130G] \n", "C3L-00145 [Missense_Mutation, Frame_Shift_Ins] [p.H93R, p.E242*] \n", "\n", "Name PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 Multiple_mutation Tumor \n", "C3L-00008 Single_mutation Tumor \n", "C3L-00032 Single_mutation Tumor \n", "C3L-00090 Single_mutation Tumor \n", "C3L-00098 Wildtype_Tumor Tumor \n", "C3L-00136 Multiple_mutation Tumor \n", "C3L-00137 Multiple_mutation Tumor \n", "C3L-00139 Wildtype_Tumor Tumor \n", "C3L-00143 Single_mutation Tumor \n", "C3L-00145 Multiple_mutation Tumor " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selected_acet_and_PTEN_mut = en.join_omics_to_mutations(\n", " omics_name=\"proteomics\",\n", " mutations_genes=\"PTEN\", \n", " omics_genes=[\"AURKA\", \"TP53\"])\n", "\n", "selected_acet_and_PTEN_mut.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering multiple mutations\n", "\n", "The function has the ability to filter multiple mutations down to just one mutation. It allows you to specify particular mutation types or locations to prioritize, and also provides a default sorting hierarchy for all other mutations. The default hierarchy chooses truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence. \n", "\n", "To filter all mutations based on this default hierarchy, simply pass an empty list to the optional `mutations_filter` parameter. Notice how in sample S001, the nonsense mutation was chosen over the missense mutation, because it's a type of trucation mutation, even though the missense mutation occurs earlier in the peptide sequence. In sample S008, both mutations were types of truncation mutations, so the function just chose the earlier one." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAURKA_awg_proteomicsTP53_awg_proteomicsPTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006NaN0.295Nonsense_Mutationp.R233*Multiple_mutationTumor
C3L-00137NaN0.441Frame_Shift_Insp.H118Qfs*8Multiple_mutationTumor
\n", "
" ], "text/plain": [ "Name AURKA_awg_proteomics TP53_awg_proteomics PTEN_Mutation \\\n", "Patient_ID \n", "C3L-00006 NaN 0.295 Nonsense_Mutation \n", "C3L-00137 NaN 0.441 Frame_Shift_Ins \n", "\n", "Name PTEN_Location PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 p.R233* Multiple_mutation Tumor \n", "C3L-00137 p.H118Qfs*8 Multiple_mutation Tumor " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PTEN_default_filter = en.multi_join({\"awg proteomics\": [\"AURKA\", \"TP53\"],\n", " \"awg somatic_mutation\": \"PTEN\"},\n", " mutations_filter=[])\n", "PTEN_default_filter.loc[[\"C3L-00006\", \"C3L-00137\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To prioritize a particular type of mutation, or a particular location, include it in the `mutations_filter` list. Below, we tell the function to prioritize nonsense mutations over all other mutations. Notice how in sample S008, the nonsense mutation is now selected instead of the frameshift insertion, even though the nonsense mutation occurs later in the peptide sequence." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAURKA_awg_proteomicsTP53_awg_proteomicsPTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006NaN0.295Nonsense_Mutationp.R233*Multiple_mutationTumor
C3L-00137NaN0.441Nonsense_Mutationp.Y180*Multiple_mutationTumor
\n", "
" ], "text/plain": [ "Name AURKA_awg_proteomics TP53_awg_proteomics PTEN_Mutation \\\n", "Patient_ID \n", "C3L-00006 NaN 0.295 Nonsense_Mutation \n", "C3L-00137 NaN 0.441 Nonsense_Mutation \n", "\n", "Name PTEN_Location PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 p.R233* Multiple_mutation Tumor \n", "C3L-00137 p.Y180* Multiple_mutation Tumor " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PTEN_simple_filter = en.multi_join({\"awg proteomics\": [\"AURKA\", \"TP53\"],\n", " \"awg somatic_mutation\": \"PTEN\"},\n", " mutations_filter=[\"Nonsense_Mutation\"])\n", "PTEN_simple_filter.loc[[\"C3L-00006\", \"C3L-00137\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can include multiple mutation types and/or locations in the `mutations_filter` list. Values earlier in the list will be prioritized over values later in the list. For example, with the filter we specify below, the function first selects sample S001's missense mutation over its nonsense mutation, because we put the location of S001's missense mutation as the first value in our filter list. We still included Nonsense_Mutation in the filter list, but it comes after the location of S001's missense mutation, which is why S001's missense mutation is still prioritized. However, on all other samples, unless they also have a mutation at that same location, the function will continue prioritizing nonsense mutations, as we see in sample S008." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAURKA_awg_proteomicsTP53_awg_proteomicsPTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006NaN0.295Missense_Mutationp.R130QMultiple_mutationTumor
C3L-00137NaN0.441Nonsense_Mutationp.Y180*Multiple_mutationTumor
\n", "
" ], "text/plain": [ "Name AURKA_awg_proteomics TP53_awg_proteomics PTEN_Mutation \\\n", "Patient_ID \n", "C3L-00006 NaN 0.295 Missense_Mutation \n", "C3L-00137 NaN 0.441 Nonsense_Mutation \n", "\n", "Name PTEN_Location PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 p.R130Q Multiple_mutation Tumor \n", "C3L-00137 p.Y180* Multiple_mutation Tumor " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PTEN_complex_filter = en.multi_join({\"awg proteomics\": [\"AURKA\", \"TP53\"],\n", " \"awg somatic_mutation\": \"PTEN\"}, \n", " mutations_filter=[\"p.R130Q\", \"Nonsense_Mutation\"])\n", "PTEN_complex_filter.loc[[\"C3L-00006\", \"C3L-00137\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join metadata to mutations\n", "\n", "Joining metadata to mutation data works exactly like joining other datatypes. Just like any time you are using somatic_mutation data, you can filter multiple mutations with the `mutations_filter` parameter. Here are some examples:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 78 samples for the PTEN gene (, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameHistologic_typePTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006Endometrioid[Missense_Mutation, Nonsense_Mutation][p.R130Q, p.R233*]Multiple_mutationTumor
C3L-00008Endometrioid[Missense_Mutation][p.G127R]Single_mutationTumor
C3L-00032Endometrioid[Nonsense_Mutation][p.W111*]Single_mutationTumor
C3L-00084Carcinosarcoma[Wildtype_Tumor][No_mutation]Wildtype_TumorTumor
C3L-00090Endometrioid[Missense_Mutation][p.R130G]Single_mutationTumor
\n", "
" ], "text/plain": [ "Name Histologic_type PTEN_Mutation \\\n", "Patient_ID \n", "C3L-00006 Endometrioid [Missense_Mutation, Nonsense_Mutation] \n", "C3L-00008 Endometrioid [Missense_Mutation] \n", "C3L-00032 Endometrioid [Nonsense_Mutation] \n", "C3L-00084 Carcinosarcoma [Wildtype_Tumor] \n", "C3L-00090 Endometrioid [Missense_Mutation] \n", "\n", "Name PTEN_Location PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 [p.R130Q, p.R233*] Multiple_mutation Tumor \n", "C3L-00008 [p.G127R] Single_mutation Tumor \n", "C3L-00032 [p.W111*] Single_mutation Tumor \n", "C3L-00084 [No_mutation] Wildtype_Tumor Tumor \n", "C3L-00090 [p.R130G] Single_mutation Tumor " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hist_and_PTEN = en.multi_join(\n", " {\"awg clinical\": 'Histologic_type',\n", " \"awg somatic_mutation\": \"PTEN\"})\n", "\n", "hist_and_PTEN.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With multiple mutations filtered:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 78 samples for the PTEN gene (, line 1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameHistologic_typePTEN_MutationPTEN_LocationPTEN_Mutation_StatusSample_Status
Patient_ID
C3L-00006EndometrioidNonsense_Mutationp.R233*Multiple_mutationTumor
C3L-00008EndometrioidMissense_Mutationp.G127RSingle_mutationTumor
C3L-00032EndometrioidNonsense_Mutationp.W111*Single_mutationTumor
C3L-00084CarcinosarcomaWildtype_TumorNo_mutationWildtype_TumorTumor
C3L-00090EndometrioidMissense_Mutationp.R130GSingle_mutationTumor
\n", "
" ], "text/plain": [ "Name Histologic_type PTEN_Mutation PTEN_Location \\\n", "Patient_ID \n", "C3L-00006 Endometrioid Nonsense_Mutation p.R233* \n", "C3L-00008 Endometrioid Missense_Mutation p.G127R \n", "C3L-00032 Endometrioid Nonsense_Mutation p.W111* \n", "C3L-00084 Carcinosarcoma Wildtype_Tumor No_mutation \n", "C3L-00090 Endometrioid Missense_Mutation p.R130G \n", "\n", "Name PTEN_Mutation_Status Sample_Status \n", "Patient_ID \n", "C3L-00006 Multiple_mutation Tumor \n", "C3L-00008 Single_mutation Tumor \n", "C3L-00032 Single_mutation Tumor \n", "C3L-00084 Wildtype_Tumor Tumor \n", "C3L-00090 Single_mutation Tumor " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hist_and_PTEN = en.multi_join(\n", " {\"awg clinical\": \"Histologic_type\",\n", " \"awg somatic_mutation\": \"PTEN\"},\n", " mutations_filter=[\"Nonsense_Mutation\"])\n", "\n", "hist_and_PTEN.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exporting dataframes\n", "\n", "If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "hist_and_PTEN.to_csv(path_or_buf=\"histologic_type_and_PTEN_mutation.tsv\", sep='\\t')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }