{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 1: CPTAC Data Introduction\n", "\n", "The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC generates comprehensive proteomics and genomics data from clinical cohorts, typically with ~100 samples per tumor type. The graphic below summarizes the structure of each CPTAC dataset. For more information, visit the [NIH website](https://proteomics.cancer.gov/programs/cptac). \n", "\n", "\"CPTAC\n", "\n", "This Python package makes accessing CPTAC data easy with Python code and Jupyter notebooks. The package contains several tutorials which demonstrate data access and usage. This first tutorial serves as an introduction to the data to help users become familiar with what is included and how it is presented." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Overview\n", "\n", "Our package provides data access in a Python programming environment. If you have not installed Python or have not installed the package, see our installation documentation [here](https://paynelab.github.io/cptac/#installation).\n", "\n", "Once we have the package installed and we're in our Python environment, we begin by importing the package with a standard Python import statement:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import cptac" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view the available datasets, call the `cptac.list_datasets()` function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DescriptionData reuse statusPublication link
Dataset name
Brcabreast cancerno restrictionshttps://pubmed.ncbi.nlm.nih.gov/33212010/
Ccrccclear cell renal cell carcinoma (kidney)no restrictionshttps://pubmed.ncbi.nlm.nih.gov/31675502/
Coloncolorectal cancerno restrictionshttps://pubmed.ncbi.nlm.nih.gov/31031003/
Endometrialendometrial carcinoma (uterine)no restrictionshttps://pubmed.ncbi.nlm.nih.gov/32059776/
Gbmglioblastomano restrictionshttps://pubmed.ncbi.nlm.nih.gov/33577785/
Hnscchead and neck squamous cell carcinomano restrictionshttps://pubmed.ncbi.nlm.nih.gov/33417831/
Lscclung squamous cell carcinomano restrictionshttps://pubmed.ncbi.nlm.nih.gov/34358469/
Luadlung adenocarcinomano restrictionshttps://pubmed.ncbi.nlm.nih.gov/32649874/
Ovarianhigh grade serous ovarian cancerno restrictionshttps://pubmed.ncbi.nlm.nih.gov/27372738/
Pdacpancreatic ductal adenocarcinomano restrictionshttps://pubmed.ncbi.nlm.nih.gov/34534465/
UcecConfendometrial confirmatory carcinomapassword access onlyunpublished
GbmConfglioblastoma confirmatorypassword access onlyunpublished
\n", "
" ], "text/plain": [ " Description Data reuse status \\\n", "Dataset name \n", "Brca breast cancer no restrictions \n", "Ccrcc clear cell renal cell carcinoma (kidney) no restrictions \n", "Colon colorectal cancer no restrictions \n", "Endometrial endometrial carcinoma (uterine) no restrictions \n", "Gbm glioblastoma no restrictions \n", "Hnscc head and neck squamous cell carcinoma no restrictions \n", "Lscc lung squamous cell carcinoma no restrictions \n", "Luad lung adenocarcinoma no restrictions \n", "Ovarian high grade serous ovarian cancer no restrictions \n", "Pdac pancreatic ductal adenocarcinoma no restrictions \n", "UcecConf endometrial confirmatory carcinoma password access only \n", "GbmConf glioblastoma confirmatory password access only \n", "\n", " Publication link \n", "Dataset name \n", "Brca https://pubmed.ncbi.nlm.nih.gov/33212010/ \n", "Ccrcc https://pubmed.ncbi.nlm.nih.gov/31675502/ \n", "Colon https://pubmed.ncbi.nlm.nih.gov/31031003/ \n", "Endometrial https://pubmed.ncbi.nlm.nih.gov/32059776/ \n", "Gbm https://pubmed.ncbi.nlm.nih.gov/33577785/ \n", "Hnscc https://pubmed.ncbi.nlm.nih.gov/33417831/ \n", "Lscc https://pubmed.ncbi.nlm.nih.gov/34358469/ \n", "Luad https://pubmed.ncbi.nlm.nih.gov/32649874/ \n", "Ovarian https://pubmed.ncbi.nlm.nih.gov/27372738/ \n", "Pdac https://pubmed.ncbi.nlm.nih.gov/34534465/ \n", "UcecConf unpublished \n", "GbmConf unpublished " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cptac.list_datasets()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Availability\n", "The goals of CPTAC as a consortium include the broad and open dissemination of cancer proteogenomic data. The timing of the a dataset's public release generally follows three stages: internal release to CPTAC investigators, public release with a publication embargo, and full public release. Each of the cancer types may be at a different data availability stage, depending on the date of data creation. In the Python `cptac` package, these three stages are dealt with as follows:\n", "\n", "**Internally released data** requires a password to download.\n", "\n", "**Embargoed release data** is publicly available, but prints an embargo statement every time you interact with the data.\n", "\n", "**Public data** is fully released without restrictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading data\n", "\n", "The cptac package stores the data files for each dataset on a remote server. When you first install cptac, you will have no data files. To install the latest version of the data files for a particular dataset, simply call the `cptac.download` function, passing the name of your desired dataset for the `dataset` parameter:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cptac.download(dataset=\"endometrial\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the data\n", "\n", "Once you've downloaded a dataset, `cptac` allows you to load the dataset into a Python variable, and you can use that variable to access and work with the data. To load a particular dataset into a variable, type the name you want to give the variable, followed by `=`, and then type `cptac.` and the name of the dataset in [UpperCamelCase](https://en.wikipedia.org/wiki/Camel_case) followed by two parentheses, e.g. `cptac.Endometrial()` or `cptac.Ccrcc()`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "en = cptac.Endometrial()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see what data is available, use the `en.list_data()` function. This displays the different types of data included in the dataset for this particular cancer type, each stored in a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). It also prints the dimensions of each dataframe." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Below are the dataframes contained in this dataset and their dimensions:\n", "\n", "acetylproteomics\n", "\t144 rows\n", "\t10862 columns\n", "circular_RNA\n", "\t109 rows\n", "\t4945 columns\n", "clinical\n", "\t144 rows\n", "\t27 columns\n", "CNV\n", "\t95 rows\n", "\t28057 columns\n", "derived_molecular\n", "\t144 rows\n", "\t125 columns\n", "experimental_design\n", "\t144 rows\n", "\t26 columns\n", "followup\n", "\t396 rows\n", "\t49 columns\n", "miRNA\n", "\t99 rows\n", "\t2337 columns\n", "phosphoproteomics\n", "\t144 rows\n", "\t73212 columns\n", "proteomics\n", "\t144 rows\n", "\t10999 columns\n", "somatic_mutation\n", "\t52560 rows\n", "\t3 columns\n", "somatic_mutation_binary\n", "\t95 rows\n", "\t51559 columns\n", "transcriptomics\n", "\t109 rows\n", "\t28057 columns\n" ] } ], "source": [ "en.list_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Molecular Omics\n", "\n", "Data can be accessed through several \"get\" functions. For example, we can look at the proteomics data by using `en.get_proteomics()`. This returns a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe) containing the proteomic data. Each column in the proteomics dataframe is the quantitiative measurement for a particular protein. Each row in the proteomics dataframe is a sample of either a tumor or non-tumor from a cancer patient." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: ['C3L-00006', 'C3L-00008', 'C3L-00032', 'C3L-00090', 'C3L-00098', 'C3L-00136', 'C3L-00137', 'C3L-00139', 'C3L-00143', 'C3L-00145', 'C3L-00156', 'C3L-00161', 'C3L-00358', 'C3L-00361', 'C3L-00362', 'C3L-00413', 'C3L-00449', 'C3L-00563', 'C3L-00586', 'C3L-00601']\n", "Proteins: ['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1', 'AAGAB', 'AAK1', 'AAMDC', 'AAMP', 'AAR2', 'AARS', 'AARS2', 'AARSD1', 'AASDHPPT', 'AASS', 'AATF', 'ABAT']\n" ] } ], "source": [ "proteomics = en.get_proteomics()\n", "samples = proteomics.index\n", "proteins = proteomics.columns\n", "print(\"Samples:\",samples[0:20].tolist()) #the first twenty samples\n", "print(\"Proteins:\",proteins[0:20].tolist()) #the first twenty proteins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframe values\n", "\n", "Values in the dataframe are protein abundance values. Values that read \"NaN\" mean that particular sample from that patient had no data for that particular protein. For the endometrial CPTAC proteomics data, a TMT-reference channel strategy was used. A detailed description of this strategy can be found at [Nature Protocols](https://www.nature.com/articles/s41596-018-0006-9) and also at [PubMed Central](https://www.ncbi.nlm.nih.gov/pubmed/?term=29988108). This strategy ratios each sample's abundance to a pooled reference. The ratio is then log transformed. Therefore positive values indicate a measurement higher than the pooled reference; negative values are lower than the pooled reference." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameA1BGA2MA2ML1A4GALTAAASAACSAADATAAED1AAGABAAK1...ZSWIM8ZSWIM9ZW10ZWILCHZWINTZXDCZYG11BZYXZZEF1ZZZ3
Patient_ID
C3L-00006-1.180-0.8630-0.8020.2220.25600.66501.2800-0.33900.412-0.664...-0.08770NaN0.02290.1090NaN-0.332-0.43300-1.020-0.1230-0.0859
C3L-00008-0.685-1.0700-0.6840.9840.13500.33401.30000.13901.330-0.367...-0.03560NaN0.36301.07000.737-0.564-0.00461-1.130-0.0757-0.4730
C3L-00032-0.528-1.32000.435NaN-0.24001.0400-0.0213-0.04790.419-0.500...0.00112-0.14500.0105-0.1160NaN0.151-0.07400-0.5400.3200-0.4190
C3L-00090-1.670-1.1900-0.4430.243-0.09930.75700.7400-0.92900.229-0.223...0.07250-0.0552-0.07140.09330.156-0.398-0.07520-0.797-0.0301-0.4670
C3L-00098-0.374-0.0206-0.5370.3110.37500.0131-1.1000NaN0.565-0.101...-0.17600NaN-1.2200-0.56200.937-0.6460.20700-1.850-0.17600.0513
\n", "

5 rows × 10999 columns

\n", "
" ], "text/plain": [ "Name A1BG A2M A2ML1 A4GALT AAAS AACS AADAT AAED1 \\\n", "Patient_ID \n", "C3L-00006 -1.180 -0.8630 -0.802 0.222 0.2560 0.6650 1.2800 -0.3390 \n", "C3L-00008 -0.685 -1.0700 -0.684 0.984 0.1350 0.3340 1.3000 0.1390 \n", "C3L-00032 -0.528 -1.3200 0.435 NaN -0.2400 1.0400 -0.0213 -0.0479 \n", "C3L-00090 -1.670 -1.1900 -0.443 0.243 -0.0993 0.7570 0.7400 -0.9290 \n", "C3L-00098 -0.374 -0.0206 -0.537 0.311 0.3750 0.0131 -1.1000 NaN \n", "\n", "Name AAGAB AAK1 ... ZSWIM8 ZSWIM9 ZW10 ZWILCH ZWINT ZXDC \\\n", "Patient_ID ... \n", "C3L-00006 0.412 -0.664 ... -0.08770 NaN 0.0229 0.1090 NaN -0.332 \n", "C3L-00008 1.330 -0.367 ... -0.03560 NaN 0.3630 1.0700 0.737 -0.564 \n", "C3L-00032 0.419 -0.500 ... 0.00112 -0.1450 0.0105 -0.1160 NaN 0.151 \n", "C3L-00090 0.229 -0.223 ... 0.07250 -0.0552 -0.0714 0.0933 0.156 -0.398 \n", "C3L-00098 0.565 -0.101 ... -0.17600 NaN -1.2200 -0.5620 0.937 -0.646 \n", "\n", "Name ZYG11B ZYX ZZEF1 ZZZ3 \n", "Patient_ID \n", "C3L-00006 -0.43300 -1.020 -0.1230 -0.0859 \n", "C3L-00008 -0.00461 -1.130 -0.0757 -0.4730 \n", "C3L-00032 -0.07400 -0.540 0.3200 -0.4190 \n", "C3L-00090 -0.07520 -0.797 -0.0301 -0.4670 \n", "C3L-00098 0.20700 -1.850 -0.1760 0.0513 \n", "\n", "[5 rows x 10999 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "proteomics.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen in `en.list_data()`, other omics data are also available (e.g. transcriptomics, copy number variation, phoshoproteomics).\n", "\n", "The transcriptomics looks almost identical to the proteomics data, available in a pandas dataframe with the same convention. Each set of samples is consitent, meaning samples found in the endometrial proteomics data will be the same samples in all other endometrial dataframes." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameA1BGA1BG-AS1A1CFA2MA2M-AS1A2ML1A2MP1A3GALT2A4GALTA4GNT...ZWILCHZWINTZXDAZXDBZXDCZYG11AZYG11BZYXZZEF1ZZZ3
Patient_ID
C3L-000064.022.163.2713.395.886.791.550.9710.341.96...11.0610.738.409.7810.885.9311.5210.2311.5011.47
C3L-000084.812.214.8613.245.936.330.930.0010.830.00...10.8711.438.399.1410.387.2511.6410.6411.2611.57
C3L-000326.246.433.6814.326.539.422.790.0010.982.13...10.0610.138.359.2710.466.8511.6010.2111.5111.09
C3L-000905.314.875.5913.776.354.222.970.008.681.98...10.2910.419.109.5910.157.8911.9010.2111.3411.51
C3L-000989.848.837.0013.126.496.831.800.0011.423.28...10.3611.248.609.4411.809.3211.979.7711.3712.35
\n", "

5 rows × 28057 columns

\n", "
" ], "text/plain": [ "Name A1BG A1BG-AS1 A1CF A2M A2M-AS1 A2ML1 A2MP1 A3GALT2 \\\n", "Patient_ID \n", "C3L-00006 4.02 2.16 3.27 13.39 5.88 6.79 1.55 0.97 \n", "C3L-00008 4.81 2.21 4.86 13.24 5.93 6.33 0.93 0.00 \n", "C3L-00032 6.24 6.43 3.68 14.32 6.53 9.42 2.79 0.00 \n", "C3L-00090 5.31 4.87 5.59 13.77 6.35 4.22 2.97 0.00 \n", "C3L-00098 9.84 8.83 7.00 13.12 6.49 6.83 1.80 0.00 \n", "\n", "Name A4GALT A4GNT ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A \\\n", "Patient_ID ... \n", "C3L-00006 10.34 1.96 ... 11.06 10.73 8.40 9.78 10.88 5.93 \n", "C3L-00008 10.83 0.00 ... 10.87 11.43 8.39 9.14 10.38 7.25 \n", "C3L-00032 10.98 2.13 ... 10.06 10.13 8.35 9.27 10.46 6.85 \n", "C3L-00090 8.68 1.98 ... 10.29 10.41 9.10 9.59 10.15 7.89 \n", "C3L-00098 11.42 3.28 ... 10.36 11.24 8.60 9.44 11.80 9.32 \n", "\n", "Name ZYG11B ZYX ZZEF1 ZZZ3 \n", "Patient_ID \n", "C3L-00006 11.52 10.23 11.50 11.47 \n", "C3L-00008 11.64 10.64 11.26 11.57 \n", "C3L-00032 11.60 10.21 11.51 11.09 \n", "C3L-00090 11.90 10.21 11.34 11.51 \n", "C3L-00098 11.97 9.77 11.37 12.35 \n", "\n", "[5 rows x 28057 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transcriptomics = en.get_transcriptomics()\n", "transcriptomics.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clinical Data\n", "\n", "The clinical dataframe lists clinical information for the patient associated with each sample (e.g. age, race, diabetes status, tumor size). " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSample_IDSample_Tumor_NormalProteomics_Tumor_NormalCountryHistologic_Grade_FIGOMyometrial_invasion_SpecifyHistologic_typeTreatment_naiveTumor_purityPath_Stage_Primary_Tumor-pT...AgeDiabetesRaceEthnicityGenderTumor_SiteTumor_Site_OtherTumor_FocalityTumor_Size_cmNum_full_term_pregnancies
Patient_ID
C3L-00006S001TumorTumorUnited StatesFIGO grade 1under 50 %EndometrioidYESNormalpT1a (FIGO IA)...64.0NoWhiteNot-Hispanic or LatinoFemaleAnterior endometriumNaNUnifocal2.91
C3L-00008S002TumorTumorUnited StatesFIGO grade 1under 50 %EndometrioidYESNormalpT1a (FIGO IA)...58.0NoWhiteNot-Hispanic or LatinoFemalePosterior endometriumNaNUnifocal3.51
C3L-00032S003TumorTumorUnited StatesFIGO grade 2under 50 %EndometrioidYESNormalpT1a (FIGO IA)...50.0YesWhiteNot-Hispanic or LatinoFemaleOther, specifyAnterior and Posterior endometriumUnifocal4.54 or more
C3L-00090S005TumorTumorUnited StatesFIGO grade 2under 50 %EndometrioidYESNormalpT1a (FIGO IA)...75.0NoWhiteNot-Hispanic or LatinoFemaleOther, specifyAnterior and Posterior endometriumUnifocal3.54 or more
C3L-00098S006TumorTumorUnited StatesNaNunder 50 %SerousYESNormalpT1a (FIGO IA)...63.0NoWhiteNot-Hispanic or LatinoFemaleOther, specifyAnterior and Posterior endometriumUnifocal6.02
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ "Name Sample_ID Sample_Tumor_Normal Proteomics_Tumor_Normal \\\n", "Patient_ID \n", "C3L-00006 S001 Tumor Tumor \n", "C3L-00008 S002 Tumor Tumor \n", "C3L-00032 S003 Tumor Tumor \n", "C3L-00090 S005 Tumor Tumor \n", "C3L-00098 S006 Tumor Tumor \n", "\n", "Name Country Histologic_Grade_FIGO Myometrial_invasion_Specify \\\n", "Patient_ID \n", "C3L-00006 United States FIGO grade 1 under 50 % \n", "C3L-00008 United States FIGO grade 1 under 50 % \n", "C3L-00032 United States FIGO grade 2 under 50 % \n", "C3L-00090 United States FIGO grade 2 under 50 % \n", "C3L-00098 United States NaN under 50 % \n", "\n", "Name Histologic_type Treatment_naive Tumor_purity \\\n", "Patient_ID \n", "C3L-00006 Endometrioid YES Normal \n", "C3L-00008 Endometrioid YES Normal \n", "C3L-00032 Endometrioid YES Normal \n", "C3L-00090 Endometrioid YES Normal \n", "C3L-00098 Serous YES Normal \n", "\n", "Name Path_Stage_Primary_Tumor-pT ... Age Diabetes Race \\\n", "Patient_ID ... \n", "C3L-00006 pT1a (FIGO IA) ... 64.0 No White \n", "C3L-00008 pT1a (FIGO IA) ... 58.0 No White \n", "C3L-00032 pT1a (FIGO IA) ... 50.0 Yes White \n", "C3L-00090 pT1a (FIGO IA) ... 75.0 No White \n", "C3L-00098 pT1a (FIGO IA) ... 63.0 No White \n", "\n", "Name Ethnicity Gender Tumor_Site \\\n", "Patient_ID \n", "C3L-00006 Not-Hispanic or Latino Female Anterior endometrium \n", "C3L-00008 Not-Hispanic or Latino Female Posterior endometrium \n", "C3L-00032 Not-Hispanic or Latino Female Other, specify \n", "C3L-00090 Not-Hispanic or Latino Female Other, specify \n", "C3L-00098 Not-Hispanic or Latino Female Other, specify \n", "\n", "Name Tumor_Site_Other Tumor_Focality Tumor_Size_cm \\\n", "Patient_ID \n", "C3L-00006 NaN Unifocal 2.9 \n", "C3L-00008 NaN Unifocal 3.5 \n", "C3L-00032 Anterior and Posterior endometrium Unifocal 4.5 \n", "C3L-00090 Anterior and Posterior endometrium Unifocal 3.5 \n", "C3L-00098 Anterior and Posterior endometrium Unifocal 6.0 \n", "\n", "Name Num_full_term_pregnancies \n", "Patient_ID \n", "C3L-00006 1 \n", "C3L-00008 1 \n", "C3L-00032 4 or more \n", "C3L-00090 4 or more \n", "C3L-00098 2 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinical = en.get_clinical()\n", "clinical.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to donating a tumor sample, some patients also had a normal sample taken for control and comparison. We can identify these samples by looking for samples marked \"Normal\" in the \"Sample_Tumor_Normal\" column, and whose Patient IDs are the same as the Patient IDs of tumor samples, but with a \".N\" appended to the ID. For example, patient C3L-00006 provided both a tumor sample (marked C3L-00006) and a normal sample (marked C3L-00006.N). Note that the normal samples do not have many values in the clinical columns, because much of the information does not apply to non-tumor samples. Additionally, in cases where a column would have identical values for tumor and normal samples from the same patient (e.g., patient age and gender), the information is recorded only for the tumor sample." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSample_IDSample_Tumor_NormalProteomics_Tumor_NormalCountryHistologic_Grade_FIGOMyometrial_invasion_SpecifyHistologic_typeTreatment_naiveTumor_purityPath_Stage_Primary_Tumor-pT...AgeDiabetesRaceEthnicityGenderTumor_SiteTumor_Site_OtherTumor_FocalityTumor_Size_cmNum_full_term_pregnancies
Patient_ID
C3L-00006S001TumorTumorUnited StatesFIGO grade 1under 50 %EndometrioidYESNormalpT1a (FIGO IA)...64.0NoWhiteNot-Hispanic or LatinoFemaleAnterior endometriumNaNUnifocal2.91
C3L-00361S017TumorTumorUnited StatesFIGO grade 1Not identifiedEndometrioidYESNormalpT1a (FIGO IA)...64.0YesWhiteNot-Hispanic or LatinoFemaleAnterior endometriumNaNUnifocal2.7None
C3L-01246S042TumorTumorOther_specifyNaNunder 50 %SerousYESNormalpT1a (FIGO IA)...62.0NoWhiteNot reportedFemalePosterior endometriumNaNUnifocal2.31
C3L-00006.NS105NormalAdjacent_normalNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C3L-00361.NS106NormalAdjacent_normalNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C3L-01246.NS114NormalAdjacent_normalNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

6 rows × 27 columns

\n", "
" ], "text/plain": [ "Name Sample_ID Sample_Tumor_Normal Proteomics_Tumor_Normal \\\n", "Patient_ID \n", "C3L-00006 S001 Tumor Tumor \n", "C3L-00361 S017 Tumor Tumor \n", "C3L-01246 S042 Tumor Tumor \n", "C3L-00006.N S105 Normal Adjacent_normal \n", "C3L-00361.N S106 Normal Adjacent_normal \n", "C3L-01246.N S114 Normal Adjacent_normal \n", "\n", "Name Country Histologic_Grade_FIGO Myometrial_invasion_Specify \\\n", "Patient_ID \n", "C3L-00006 United States FIGO grade 1 under 50 % \n", "C3L-00361 United States FIGO grade 1 Not identified \n", "C3L-01246 Other_specify NaN under 50 % \n", "C3L-00006.N NaN NaN NaN \n", "C3L-00361.N NaN NaN NaN \n", "C3L-01246.N NaN NaN NaN \n", "\n", "Name Histologic_type Treatment_naive Tumor_purity \\\n", "Patient_ID \n", "C3L-00006 Endometrioid YES Normal \n", "C3L-00361 Endometrioid YES Normal \n", "C3L-01246 Serous YES Normal \n", "C3L-00006.N NaN NaN NaN \n", "C3L-00361.N NaN NaN NaN \n", "C3L-01246.N NaN NaN NaN \n", "\n", "Name Path_Stage_Primary_Tumor-pT ... Age Diabetes Race \\\n", "Patient_ID ... \n", "C3L-00006 pT1a (FIGO IA) ... 64.0 No White \n", "C3L-00361 pT1a (FIGO IA) ... 64.0 Yes White \n", "C3L-01246 pT1a (FIGO IA) ... 62.0 No White \n", "C3L-00006.N NaN ... NaN NaN NaN \n", "C3L-00361.N NaN ... NaN NaN NaN \n", "C3L-01246.N NaN ... NaN NaN NaN \n", "\n", "Name Ethnicity Gender Tumor_Site \\\n", "Patient_ID \n", "C3L-00006 Not-Hispanic or Latino Female Anterior endometrium \n", "C3L-00361 Not-Hispanic or Latino Female Anterior endometrium \n", "C3L-01246 Not reported Female Posterior endometrium \n", "C3L-00006.N NaN NaN NaN \n", "C3L-00361.N NaN NaN NaN \n", "C3L-01246.N NaN NaN NaN \n", "\n", "Name Tumor_Site_Other Tumor_Focality Tumor_Size_cm \\\n", "Patient_ID \n", "C3L-00006 NaN Unifocal 2.9 \n", "C3L-00361 NaN Unifocal 2.7 \n", "C3L-01246 NaN Unifocal 2.3 \n", "C3L-00006.N NaN NaN NaN \n", "C3L-00361.N NaN NaN NaN \n", "C3L-01246.N NaN NaN NaN \n", "\n", "Name Num_full_term_pregnancies \n", "Patient_ID \n", "C3L-00006 1 \n", "C3L-00361 None \n", "C3L-01246 1 \n", "C3L-00006.N NaN \n", "C3L-00361.N NaN \n", "C3L-01246.N NaN \n", "\n", "[6 rows x 27 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinical.loc[[\"C3L-00006\",\"C3L-00361\",\"C3L-01246\", \"C3L-00006.N\",\"C3L-00361.N\",\"C3L-01246.N\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mutation data\n", "\n", "Each cancer dataset contains mutation data for the cohort. The data consists of all somatic mutations found for each sample (meaning there will be many lines for each sample). Each row lists the specific gene that was mutated, the type of mutation, and the location of the mutation. This data is a direct import of a MAF file." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameGeneMutationLocation
Patient_ID
C3L-00006AAK1Missense_Mutationp.A592V
C3L-00006AANATMissense_Mutationp.R176W
C3L-00006ABCA12Frame_Shift_Delp.N1671Ifs*4
C3L-00006ABCC4Missense_Mutationp.R691H
C3L-00006ABL1Missense_Mutationp.G273R
\n", "
" ], "text/plain": [ "Name Gene Mutation Location\n", "Patient_ID \n", "C3L-00006 AAK1 Missense_Mutation p.A592V\n", "C3L-00006 AANAT Missense_Mutation p.R176W\n", "C3L-00006 ABCA12 Frame_Shift_Del p.N1671Ifs*4\n", "C3L-00006 ABCC4 Missense_Mutation p.R691H\n", "C3L-00006 ABL1 Missense_Mutation p.G273R" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "somatic_mutations = en.get_somatic_mutation()\n", "somatic_mutations.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exporting dataframes\n", "\n", "If you wish to export a dataframe to a file, simply call the dataframe's `to_csv` method, passing the path you wish to save the file to, and the value separator you want:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "clinical = en.get_clinical()\n", "clinical.to_csv(path_or_buf=\"clinical_dataframe.tsv\", sep='\\t')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting help with a dataset or function\n", "\n", "To view the documentation for a dataset, pass it to the Python `help` function, e.g. `help(en)`. You can also view the documentation for just a specific function: `help(en.join_omics_to_omics)`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method join_omics_to_omics in module cptac.dataset:\n", "\n", "join_omics_to_omics(df1_name, df2_name, genes1=None, genes2=None, how='outer', quiet=False, tissue_type='both') method of cptac.endometrial.Endometrial instance\n", " Take specified column(s) from one omics dataframe, and join to specified columns(s) from another omics dataframe. Intersection (inner join) of indices is used.\n", " \n", " Parameters:\n", " df1_name (str): Name of first omics dataframe to select columns from.\n", " df2_name (str): Name of second omics dataframe to select columns from.\n", " genes1 (str, or list or array-like of str, optional): Gene(s) for column(s) to select from df1_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe.\n", " genes2 (str, or list or array-like of str, optional): Gene(s) for Column(s) to select from df2_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe.\n", " how (str, optional): How to perform the join, acceptable values are from ['outer', 'inner', 'left', 'right']. Defaults to 'outer'.\n", " quiet (bool, optional): Whether to warn when inserting NaNs. Defaults to False.\n", " tissue_type (str): Acceptable values in [\"tumor\",\"normal\",\"both\"]. Specifies the desired tissue type desired in the dataframe. Defaults to \"both\".\n", " \n", " Returns:\n", " pandas.DataFrame: The selected columns from the two omics dataframes, joined into one dataframe.\n", "\n" ] } ], "source": [ "help(en.join_omics_to_omics)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }