{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 1: CPTAC Data Introduction\n",
    "\n",
    "The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC generates comprehensive proteomics and genomics data from clinical cohorts, typically with ~100 samples per tumor type. The graphic below summarizes the structure of each CPTAC dataset. For more information, visit the [NIH website](https://proteomics.cancer.gov/programs/cptac). \n",
    "\n",
    "<img src=\"img/Graphical_Abstract.png\" alt=\"CPTAC cohort\" width=\"700\"/>\n",
    "\n",
    "This Python package makes accessing CPTAC data easy with Python code and Jupyter notebooks. The package contains several tutorials which demonstrate data access and usage. This first tutorial serves as an introduction to the data to help users become familiar with what is included and how it is presented."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Overview\n",
    "\n",
    "Our package provides data access in a Python programming environment. If you have not installed Python or have not installed the package, see our installation documentation [here](https://paynelab.github.io/cptac/#installation).\n",
    "\n",
    "Once we have the package installed and we're in our Python environment, we begin by importing the package with a standard Python import statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cptac"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To view the available datasets, call the `cptac.list_datasets()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Description</th>\n",
       "      <th>Data reuse status</th>\n",
       "      <th>Publication link</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dataset name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Brca</th>\n",
       "      <td>breast cancer</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/33212010/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ccrcc</th>\n",
       "      <td>clear cell renal cell carcinoma (kidney)</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/31675502/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Colon</th>\n",
       "      <td>colorectal cancer</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/31031003/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Endometrial</th>\n",
       "      <td>endometrial carcinoma (uterine)</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/32059776/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gbm</th>\n",
       "      <td>glioblastoma</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/33577785/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hnscc</th>\n",
       "      <td>head and neck squamous cell carcinoma</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/33417831/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lscc</th>\n",
       "      <td>lung squamous cell carcinoma</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/34358469/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Luad</th>\n",
       "      <td>lung adenocarcinoma</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/32649874/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ovarian</th>\n",
       "      <td>high grade serous ovarian cancer</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/27372738/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pdac</th>\n",
       "      <td>pancreatic ductal adenocarcinoma</td>\n",
       "      <td>no restrictions</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/34534465/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UcecConf</th>\n",
       "      <td>endometrial confirmatory carcinoma</td>\n",
       "      <td>password access only</td>\n",
       "      <td>unpublished</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GbmConf</th>\n",
       "      <td>glioblastoma confirmatory</td>\n",
       "      <td>password access only</td>\n",
       "      <td>unpublished</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           Description     Data reuse status  \\\n",
       "Dataset name                                                                   \n",
       "Brca                                     breast cancer       no restrictions   \n",
       "Ccrcc         clear cell renal cell carcinoma (kidney)       no restrictions   \n",
       "Colon                                colorectal cancer       no restrictions   \n",
       "Endometrial            endometrial carcinoma (uterine)       no restrictions   \n",
       "Gbm                                       glioblastoma       no restrictions   \n",
       "Hnscc            head and neck squamous cell carcinoma       no restrictions   \n",
       "Lscc                      lung squamous cell carcinoma       no restrictions   \n",
       "Luad                               lung adenocarcinoma       no restrictions   \n",
       "Ovarian               high grade serous ovarian cancer       no restrictions   \n",
       "Pdac                  pancreatic ductal adenocarcinoma       no restrictions   \n",
       "UcecConf            endometrial confirmatory carcinoma  password access only   \n",
       "GbmConf                      glioblastoma confirmatory  password access only   \n",
       "\n",
       "                                       Publication link  \n",
       "Dataset name                                             \n",
       "Brca          https://pubmed.ncbi.nlm.nih.gov/33212010/  \n",
       "Ccrcc         https://pubmed.ncbi.nlm.nih.gov/31675502/  \n",
       "Colon         https://pubmed.ncbi.nlm.nih.gov/31031003/  \n",
       "Endometrial   https://pubmed.ncbi.nlm.nih.gov/32059776/  \n",
       "Gbm           https://pubmed.ncbi.nlm.nih.gov/33577785/  \n",
       "Hnscc         https://pubmed.ncbi.nlm.nih.gov/33417831/  \n",
       "Lscc          https://pubmed.ncbi.nlm.nih.gov/34358469/  \n",
       "Luad          https://pubmed.ncbi.nlm.nih.gov/32649874/  \n",
       "Ovarian       https://pubmed.ncbi.nlm.nih.gov/27372738/  \n",
       "Pdac          https://pubmed.ncbi.nlm.nih.gov/34534465/  \n",
       "UcecConf                                    unpublished  \n",
       "GbmConf                                     unpublished  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cptac.list_datasets()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Availability\n",
    "The goals of CPTAC as a consortium include the broad and open dissemination of cancer proteogenomic data. The timing of the a dataset's public release generally follows three stages: internal release to CPTAC investigators, public release with a publication embargo, and full public release. Each of the cancer types may be at a different data availability stage, depending on the date of data creation. In the Python `cptac` package, these three stages are dealt with as follows:\n",
    "\n",
    "**Internally released data** requires a password to download.\n",
    "\n",
    "**Embargoed release data** is publicly available, but prints an embargo statement every time you interact with the data.\n",
    "\n",
    "**Public data** is fully released without restrictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading data\n",
    "\n",
    "The cptac package stores the data files for each dataset on a remote server. When you first install cptac, you will have no data files. To install the latest version of the data files for a particular dataset, simply call the `cptac.download` function, passing the name of your desired dataset for the `dataset` parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                                \r"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cptac.download(dataset=\"endometrial\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the data\n",
    "\n",
    "Once you've downloaded a dataset, `cptac` allows you to load the dataset into a Python variable, and you can use that variable to access and work with the data. To load a particular dataset into a variable, type the name you want to give the variable, followed by `=`, and then type `cptac.` and the name of the dataset in [UpperCamelCase](https://en.wikipedia.org/wiki/Camel_case) followed by two parentheses, e.g. `cptac.Endometrial()` or `cptac.Ccrcc()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                                \r"
     ]
    }
   ],
   "source": [
    "en = cptac.Endometrial()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see what data is available, use the `en.list_data()` function. This displays the different types of data included in the dataset for this particular cancer type, each stored in a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). It also prints the dimensions of each dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Below are the dataframes contained in this dataset and their dimensions:\n",
      "\n",
      "acetylproteomics\n",
      "\t144 rows\n",
      "\t10862 columns\n",
      "circular_RNA\n",
      "\t109 rows\n",
      "\t4945 columns\n",
      "clinical\n",
      "\t144 rows\n",
      "\t27 columns\n",
      "CNV\n",
      "\t95 rows\n",
      "\t28057 columns\n",
      "derived_molecular\n",
      "\t144 rows\n",
      "\t125 columns\n",
      "experimental_design\n",
      "\t144 rows\n",
      "\t26 columns\n",
      "followup\n",
      "\t396 rows\n",
      "\t49 columns\n",
      "miRNA\n",
      "\t99 rows\n",
      "\t2337 columns\n",
      "phosphoproteomics\n",
      "\t144 rows\n",
      "\t73212 columns\n",
      "proteomics\n",
      "\t144 rows\n",
      "\t10999 columns\n",
      "somatic_mutation\n",
      "\t52560 rows\n",
      "\t3 columns\n",
      "somatic_mutation_binary\n",
      "\t95 rows\n",
      "\t51559 columns\n",
      "transcriptomics\n",
      "\t109 rows\n",
      "\t28057 columns\n"
     ]
    }
   ],
   "source": [
    "en.list_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Molecular Omics\n",
    "\n",
    "Data can be accessed through several \"get\" functions. For example, we can look at the proteomics data by using `en.get_proteomics()`. This returns a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe) containing the proteomic data. Each column in the proteomics dataframe is the quantitiative measurement for a particular protein. Each row in the proteomics dataframe is a sample of either a tumor or non-tumor from a cancer patient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Samples: ['C3L-00006', 'C3L-00008', 'C3L-00032', 'C3L-00090', 'C3L-00098', 'C3L-00136', 'C3L-00137', 'C3L-00139', 'C3L-00143', 'C3L-00145', 'C3L-00156', 'C3L-00161', 'C3L-00358', 'C3L-00361', 'C3L-00362', 'C3L-00413', 'C3L-00449', 'C3L-00563', 'C3L-00586', 'C3L-00601']\n",
      "Proteins: ['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1', 'AAGAB', 'AAK1', 'AAMDC', 'AAMP', 'AAR2', 'AARS', 'AARS2', 'AARSD1', 'AASDHPPT', 'AASS', 'AATF', 'ABAT']\n"
     ]
    }
   ],
   "source": [
    "proteomics = en.get_proteomics()\n",
    "samples = proteomics.index\n",
    "proteins = proteomics.columns\n",
    "print(\"Samples:\",samples[0:20].tolist()) #the first twenty samples\n",
    "print(\"Proteins:\",proteins[0:20].tolist()) #the first twenty proteins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataframe values\n",
    "\n",
    "Values in the dataframe are protein abundance values. Values that read \"NaN\" mean that particular sample from that patient had no data for that particular protein. For the endometrial CPTAC proteomics data, a TMT-reference channel strategy was used. A detailed description of this strategy can be found at [Nature Protocols](https://www.nature.com/articles/s41596-018-0006-9) and also at [PubMed Central](https://www.ncbi.nlm.nih.gov/pubmed/?term=29988108). This strategy ratios each sample's abundance to a pooled reference. The ratio is then log transformed. Therefore positive values indicate a measurement higher than the pooled reference; negative values are lower than the pooled reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>A1BG</th>\n",
       "      <th>A2M</th>\n",
       "      <th>A2ML1</th>\n",
       "      <th>A4GALT</th>\n",
       "      <th>AAAS</th>\n",
       "      <th>AACS</th>\n",
       "      <th>AADAT</th>\n",
       "      <th>AAED1</th>\n",
       "      <th>AAGAB</th>\n",
       "      <th>AAK1</th>\n",
       "      <th>...</th>\n",
       "      <th>ZSWIM8</th>\n",
       "      <th>ZSWIM9</th>\n",
       "      <th>ZW10</th>\n",
       "      <th>ZWILCH</th>\n",
       "      <th>ZWINT</th>\n",
       "      <th>ZXDC</th>\n",
       "      <th>ZYG11B</th>\n",
       "      <th>ZYX</th>\n",
       "      <th>ZZEF1</th>\n",
       "      <th>ZZZ3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>-1.180</td>\n",
       "      <td>-0.8630</td>\n",
       "      <td>-0.802</td>\n",
       "      <td>0.222</td>\n",
       "      <td>0.2560</td>\n",
       "      <td>0.6650</td>\n",
       "      <td>1.2800</td>\n",
       "      <td>-0.3390</td>\n",
       "      <td>0.412</td>\n",
       "      <td>-0.664</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.08770</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0229</td>\n",
       "      <td>0.1090</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.332</td>\n",
       "      <td>-0.43300</td>\n",
       "      <td>-1.020</td>\n",
       "      <td>-0.1230</td>\n",
       "      <td>-0.0859</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>-0.685</td>\n",
       "      <td>-1.0700</td>\n",
       "      <td>-0.684</td>\n",
       "      <td>0.984</td>\n",
       "      <td>0.1350</td>\n",
       "      <td>0.3340</td>\n",
       "      <td>1.3000</td>\n",
       "      <td>0.1390</td>\n",
       "      <td>1.330</td>\n",
       "      <td>-0.367</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.03560</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3630</td>\n",
       "      <td>1.0700</td>\n",
       "      <td>0.737</td>\n",
       "      <td>-0.564</td>\n",
       "      <td>-0.00461</td>\n",
       "      <td>-1.130</td>\n",
       "      <td>-0.0757</td>\n",
       "      <td>-0.4730</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>-0.528</td>\n",
       "      <td>-1.3200</td>\n",
       "      <td>0.435</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.2400</td>\n",
       "      <td>1.0400</td>\n",
       "      <td>-0.0213</td>\n",
       "      <td>-0.0479</td>\n",
       "      <td>0.419</td>\n",
       "      <td>-0.500</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00112</td>\n",
       "      <td>-0.1450</td>\n",
       "      <td>0.0105</td>\n",
       "      <td>-0.1160</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.151</td>\n",
       "      <td>-0.07400</td>\n",
       "      <td>-0.540</td>\n",
       "      <td>0.3200</td>\n",
       "      <td>-0.4190</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>-1.670</td>\n",
       "      <td>-1.1900</td>\n",
       "      <td>-0.443</td>\n",
       "      <td>0.243</td>\n",
       "      <td>-0.0993</td>\n",
       "      <td>0.7570</td>\n",
       "      <td>0.7400</td>\n",
       "      <td>-0.9290</td>\n",
       "      <td>0.229</td>\n",
       "      <td>-0.223</td>\n",
       "      <td>...</td>\n",
       "      <td>0.07250</td>\n",
       "      <td>-0.0552</td>\n",
       "      <td>-0.0714</td>\n",
       "      <td>0.0933</td>\n",
       "      <td>0.156</td>\n",
       "      <td>-0.398</td>\n",
       "      <td>-0.07520</td>\n",
       "      <td>-0.797</td>\n",
       "      <td>-0.0301</td>\n",
       "      <td>-0.4670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00098</th>\n",
       "      <td>-0.374</td>\n",
       "      <td>-0.0206</td>\n",
       "      <td>-0.537</td>\n",
       "      <td>0.311</td>\n",
       "      <td>0.3750</td>\n",
       "      <td>0.0131</td>\n",
       "      <td>-1.1000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.565</td>\n",
       "      <td>-0.101</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.17600</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.2200</td>\n",
       "      <td>-0.5620</td>\n",
       "      <td>0.937</td>\n",
       "      <td>-0.646</td>\n",
       "      <td>0.20700</td>\n",
       "      <td>-1.850</td>\n",
       "      <td>-0.1760</td>\n",
       "      <td>0.0513</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 10999 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name         A1BG     A2M  A2ML1  A4GALT    AAAS    AACS   AADAT   AAED1  \\\n",
       "Patient_ID                                                                 \n",
       "C3L-00006  -1.180 -0.8630 -0.802   0.222  0.2560  0.6650  1.2800 -0.3390   \n",
       "C3L-00008  -0.685 -1.0700 -0.684   0.984  0.1350  0.3340  1.3000  0.1390   \n",
       "C3L-00032  -0.528 -1.3200  0.435     NaN -0.2400  1.0400 -0.0213 -0.0479   \n",
       "C3L-00090  -1.670 -1.1900 -0.443   0.243 -0.0993  0.7570  0.7400 -0.9290   \n",
       "C3L-00098  -0.374 -0.0206 -0.537   0.311  0.3750  0.0131 -1.1000     NaN   \n",
       "\n",
       "Name        AAGAB   AAK1  ...   ZSWIM8  ZSWIM9    ZW10  ZWILCH  ZWINT   ZXDC  \\\n",
       "Patient_ID                ...                                                  \n",
       "C3L-00006   0.412 -0.664  ... -0.08770     NaN  0.0229  0.1090    NaN -0.332   \n",
       "C3L-00008   1.330 -0.367  ... -0.03560     NaN  0.3630  1.0700  0.737 -0.564   \n",
       "C3L-00032   0.419 -0.500  ...  0.00112 -0.1450  0.0105 -0.1160    NaN  0.151   \n",
       "C3L-00090   0.229 -0.223  ...  0.07250 -0.0552 -0.0714  0.0933  0.156 -0.398   \n",
       "C3L-00098   0.565 -0.101  ... -0.17600     NaN -1.2200 -0.5620  0.937 -0.646   \n",
       "\n",
       "Name         ZYG11B    ZYX   ZZEF1    ZZZ3  \n",
       "Patient_ID                                  \n",
       "C3L-00006  -0.43300 -1.020 -0.1230 -0.0859  \n",
       "C3L-00008  -0.00461 -1.130 -0.0757 -0.4730  \n",
       "C3L-00032  -0.07400 -0.540  0.3200 -0.4190  \n",
       "C3L-00090  -0.07520 -0.797 -0.0301 -0.4670  \n",
       "C3L-00098   0.20700 -1.850 -0.1760  0.0513  \n",
       "\n",
       "[5 rows x 10999 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "proteomics.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As seen in `en.list_data()`, other omics data are also available (e.g. transcriptomics, copy number variation, phoshoproteomics).\n",
    "\n",
    "The transcriptomics looks almost identical to the proteomics data, available in a pandas dataframe with the same convention. Each set of samples is consitent, meaning samples found in the endometrial proteomics data will be the same samples in all other endometrial dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>A1BG</th>\n",
       "      <th>A1BG-AS1</th>\n",
       "      <th>A1CF</th>\n",
       "      <th>A2M</th>\n",
       "      <th>A2M-AS1</th>\n",
       "      <th>A2ML1</th>\n",
       "      <th>A2MP1</th>\n",
       "      <th>A3GALT2</th>\n",
       "      <th>A4GALT</th>\n",
       "      <th>A4GNT</th>\n",
       "      <th>...</th>\n",
       "      <th>ZWILCH</th>\n",
       "      <th>ZWINT</th>\n",
       "      <th>ZXDA</th>\n",
       "      <th>ZXDB</th>\n",
       "      <th>ZXDC</th>\n",
       "      <th>ZYG11A</th>\n",
       "      <th>ZYG11B</th>\n",
       "      <th>ZYX</th>\n",
       "      <th>ZZEF1</th>\n",
       "      <th>ZZZ3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>4.02</td>\n",
       "      <td>2.16</td>\n",
       "      <td>3.27</td>\n",
       "      <td>13.39</td>\n",
       "      <td>5.88</td>\n",
       "      <td>6.79</td>\n",
       "      <td>1.55</td>\n",
       "      <td>0.97</td>\n",
       "      <td>10.34</td>\n",
       "      <td>1.96</td>\n",
       "      <td>...</td>\n",
       "      <td>11.06</td>\n",
       "      <td>10.73</td>\n",
       "      <td>8.40</td>\n",
       "      <td>9.78</td>\n",
       "      <td>10.88</td>\n",
       "      <td>5.93</td>\n",
       "      <td>11.52</td>\n",
       "      <td>10.23</td>\n",
       "      <td>11.50</td>\n",
       "      <td>11.47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>4.81</td>\n",
       "      <td>2.21</td>\n",
       "      <td>4.86</td>\n",
       "      <td>13.24</td>\n",
       "      <td>5.93</td>\n",
       "      <td>6.33</td>\n",
       "      <td>0.93</td>\n",
       "      <td>0.00</td>\n",
       "      <td>10.83</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>10.87</td>\n",
       "      <td>11.43</td>\n",
       "      <td>8.39</td>\n",
       "      <td>9.14</td>\n",
       "      <td>10.38</td>\n",
       "      <td>7.25</td>\n",
       "      <td>11.64</td>\n",
       "      <td>10.64</td>\n",
       "      <td>11.26</td>\n",
       "      <td>11.57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>6.24</td>\n",
       "      <td>6.43</td>\n",
       "      <td>3.68</td>\n",
       "      <td>14.32</td>\n",
       "      <td>6.53</td>\n",
       "      <td>9.42</td>\n",
       "      <td>2.79</td>\n",
       "      <td>0.00</td>\n",
       "      <td>10.98</td>\n",
       "      <td>2.13</td>\n",
       "      <td>...</td>\n",
       "      <td>10.06</td>\n",
       "      <td>10.13</td>\n",
       "      <td>8.35</td>\n",
       "      <td>9.27</td>\n",
       "      <td>10.46</td>\n",
       "      <td>6.85</td>\n",
       "      <td>11.60</td>\n",
       "      <td>10.21</td>\n",
       "      <td>11.51</td>\n",
       "      <td>11.09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>5.31</td>\n",
       "      <td>4.87</td>\n",
       "      <td>5.59</td>\n",
       "      <td>13.77</td>\n",
       "      <td>6.35</td>\n",
       "      <td>4.22</td>\n",
       "      <td>2.97</td>\n",
       "      <td>0.00</td>\n",
       "      <td>8.68</td>\n",
       "      <td>1.98</td>\n",
       "      <td>...</td>\n",
       "      <td>10.29</td>\n",
       "      <td>10.41</td>\n",
       "      <td>9.10</td>\n",
       "      <td>9.59</td>\n",
       "      <td>10.15</td>\n",
       "      <td>7.89</td>\n",
       "      <td>11.90</td>\n",
       "      <td>10.21</td>\n",
       "      <td>11.34</td>\n",
       "      <td>11.51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00098</th>\n",
       "      <td>9.84</td>\n",
       "      <td>8.83</td>\n",
       "      <td>7.00</td>\n",
       "      <td>13.12</td>\n",
       "      <td>6.49</td>\n",
       "      <td>6.83</td>\n",
       "      <td>1.80</td>\n",
       "      <td>0.00</td>\n",
       "      <td>11.42</td>\n",
       "      <td>3.28</td>\n",
       "      <td>...</td>\n",
       "      <td>10.36</td>\n",
       "      <td>11.24</td>\n",
       "      <td>8.60</td>\n",
       "      <td>9.44</td>\n",
       "      <td>11.80</td>\n",
       "      <td>9.32</td>\n",
       "      <td>11.97</td>\n",
       "      <td>9.77</td>\n",
       "      <td>11.37</td>\n",
       "      <td>12.35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 28057 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        A1BG  A1BG-AS1  A1CF    A2M  A2M-AS1  A2ML1  A2MP1  A3GALT2  \\\n",
       "Patient_ID                                                                \n",
       "C3L-00006   4.02      2.16  3.27  13.39     5.88   6.79   1.55     0.97   \n",
       "C3L-00008   4.81      2.21  4.86  13.24     5.93   6.33   0.93     0.00   \n",
       "C3L-00032   6.24      6.43  3.68  14.32     6.53   9.42   2.79     0.00   \n",
       "C3L-00090   5.31      4.87  5.59  13.77     6.35   4.22   2.97     0.00   \n",
       "C3L-00098   9.84      8.83  7.00  13.12     6.49   6.83   1.80     0.00   \n",
       "\n",
       "Name        A4GALT  A4GNT  ...  ZWILCH  ZWINT  ZXDA  ZXDB   ZXDC  ZYG11A  \\\n",
       "Patient_ID                 ...                                             \n",
       "C3L-00006    10.34   1.96  ...   11.06  10.73  8.40  9.78  10.88    5.93   \n",
       "C3L-00008    10.83   0.00  ...   10.87  11.43  8.39  9.14  10.38    7.25   \n",
       "C3L-00032    10.98   2.13  ...   10.06  10.13  8.35  9.27  10.46    6.85   \n",
       "C3L-00090     8.68   1.98  ...   10.29  10.41  9.10  9.59  10.15    7.89   \n",
       "C3L-00098    11.42   3.28  ...   10.36  11.24  8.60  9.44  11.80    9.32   \n",
       "\n",
       "Name        ZYG11B    ZYX  ZZEF1   ZZZ3  \n",
       "Patient_ID                               \n",
       "C3L-00006    11.52  10.23  11.50  11.47  \n",
       "C3L-00008    11.64  10.64  11.26  11.57  \n",
       "C3L-00032    11.60  10.21  11.51  11.09  \n",
       "C3L-00090    11.90  10.21  11.34  11.51  \n",
       "C3L-00098    11.97   9.77  11.37  12.35  \n",
       "\n",
       "[5 rows x 28057 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transcriptomics = en.get_transcriptomics()\n",
    "transcriptomics.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clinical Data\n",
    "\n",
    "The clinical dataframe lists clinical information for the patient associated with each sample (e.g. age, race, diabetes status, tumor size). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>Sample_ID</th>\n",
       "      <th>Sample_Tumor_Normal</th>\n",
       "      <th>Proteomics_Tumor_Normal</th>\n",
       "      <th>Country</th>\n",
       "      <th>Histologic_Grade_FIGO</th>\n",
       "      <th>Myometrial_invasion_Specify</th>\n",
       "      <th>Histologic_type</th>\n",
       "      <th>Treatment_naive</th>\n",
       "      <th>Tumor_purity</th>\n",
       "      <th>Path_Stage_Primary_Tumor-pT</th>\n",
       "      <th>...</th>\n",
       "      <th>Age</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Race</th>\n",
       "      <th>Ethnicity</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Tumor_Site</th>\n",
       "      <th>Tumor_Site_Other</th>\n",
       "      <th>Tumor_Focality</th>\n",
       "      <th>Tumor_Size_cm</th>\n",
       "      <th>Num_full_term_pregnancies</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>S001</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>FIGO grade 1</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Endometrioid</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>64.0</td>\n",
       "      <td>No</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Anterior endometrium</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>2.9</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>S002</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>FIGO grade 1</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Endometrioid</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>58.0</td>\n",
       "      <td>No</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Posterior endometrium</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>S003</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>FIGO grade 2</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Endometrioid</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>50.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Other, specify</td>\n",
       "      <td>Anterior and Posterior endometrium</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>4.5</td>\n",
       "      <td>4 or more</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>S005</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>FIGO grade 2</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Endometrioid</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>75.0</td>\n",
       "      <td>No</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Other, specify</td>\n",
       "      <td>Anterior and Posterior endometrium</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>3.5</td>\n",
       "      <td>4 or more</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00098</th>\n",
       "      <td>S006</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>NaN</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Serous</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>63.0</td>\n",
       "      <td>No</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Other, specify</td>\n",
       "      <td>Anterior  and Posterior endometrium</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>6.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name       Sample_ID Sample_Tumor_Normal Proteomics_Tumor_Normal  \\\n",
       "Patient_ID                                                         \n",
       "C3L-00006       S001               Tumor                   Tumor   \n",
       "C3L-00008       S002               Tumor                   Tumor   \n",
       "C3L-00032       S003               Tumor                   Tumor   \n",
       "C3L-00090       S005               Tumor                   Tumor   \n",
       "C3L-00098       S006               Tumor                   Tumor   \n",
       "\n",
       "Name              Country Histologic_Grade_FIGO Myometrial_invasion_Specify  \\\n",
       "Patient_ID                                                                    \n",
       "C3L-00006   United States          FIGO grade 1                  under 50 %   \n",
       "C3L-00008   United States          FIGO grade 1                  under 50 %   \n",
       "C3L-00032   United States          FIGO grade 2                  under 50 %   \n",
       "C3L-00090   United States          FIGO grade 2                  under 50 %   \n",
       "C3L-00098   United States                   NaN                  under 50 %   \n",
       "\n",
       "Name       Histologic_type Treatment_naive Tumor_purity  \\\n",
       "Patient_ID                                                \n",
       "C3L-00006     Endometrioid             YES       Normal   \n",
       "C3L-00008     Endometrioid             YES       Normal   \n",
       "C3L-00032     Endometrioid             YES       Normal   \n",
       "C3L-00090     Endometrioid             YES       Normal   \n",
       "C3L-00098           Serous             YES       Normal   \n",
       "\n",
       "Name       Path_Stage_Primary_Tumor-pT  ...   Age Diabetes   Race  \\\n",
       "Patient_ID                              ...                         \n",
       "C3L-00006               pT1a (FIGO IA)  ...  64.0       No  White   \n",
       "C3L-00008               pT1a (FIGO IA)  ...  58.0       No  White   \n",
       "C3L-00032               pT1a (FIGO IA)  ...  50.0      Yes  White   \n",
       "C3L-00090               pT1a (FIGO IA)  ...  75.0       No  White   \n",
       "C3L-00098               pT1a (FIGO IA)  ...  63.0       No  White   \n",
       "\n",
       "Name                     Ethnicity  Gender             Tumor_Site  \\\n",
       "Patient_ID                                                          \n",
       "C3L-00006   Not-Hispanic or Latino  Female   Anterior endometrium   \n",
       "C3L-00008   Not-Hispanic or Latino  Female  Posterior endometrium   \n",
       "C3L-00032   Not-Hispanic or Latino  Female         Other, specify   \n",
       "C3L-00090   Not-Hispanic or Latino  Female         Other, specify   \n",
       "C3L-00098   Not-Hispanic or Latino  Female         Other, specify   \n",
       "\n",
       "Name                           Tumor_Site_Other  Tumor_Focality Tumor_Size_cm  \\\n",
       "Patient_ID                                                                      \n",
       "C3L-00006                                   NaN        Unifocal           2.9   \n",
       "C3L-00008                                   NaN        Unifocal           3.5   \n",
       "C3L-00032    Anterior and Posterior endometrium        Unifocal           4.5   \n",
       "C3L-00090    Anterior and Posterior endometrium        Unifocal           3.5   \n",
       "C3L-00098   Anterior  and Posterior endometrium        Unifocal           6.0   \n",
       "\n",
       "Name       Num_full_term_pregnancies  \n",
       "Patient_ID                            \n",
       "C3L-00006                          1  \n",
       "C3L-00008                          1  \n",
       "C3L-00032                  4 or more  \n",
       "C3L-00090                  4 or more  \n",
       "C3L-00098                          2  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinical = en.get_clinical()\n",
    "clinical.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to donating a tumor sample, some patients also had a normal sample taken for control and comparison. We can identify these samples by looking for samples marked \"Normal\" in the \"Sample_Tumor_Normal\" column, and whose Patient IDs are the same as the Patient IDs of tumor samples, but with a \".N\" appended to the ID. For example, patient C3L-00006 provided both a tumor sample (marked C3L-00006) and a normal sample (marked C3L-00006.N). Note that the normal samples do not have many values in the clinical columns, because much of the information does not apply to non-tumor samples. Additionally, in cases where a column would have identical values for tumor and normal samples from the same patient (e.g., patient age and gender), the information is recorded only for the tumor sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>Sample_ID</th>\n",
       "      <th>Sample_Tumor_Normal</th>\n",
       "      <th>Proteomics_Tumor_Normal</th>\n",
       "      <th>Country</th>\n",
       "      <th>Histologic_Grade_FIGO</th>\n",
       "      <th>Myometrial_invasion_Specify</th>\n",
       "      <th>Histologic_type</th>\n",
       "      <th>Treatment_naive</th>\n",
       "      <th>Tumor_purity</th>\n",
       "      <th>Path_Stage_Primary_Tumor-pT</th>\n",
       "      <th>...</th>\n",
       "      <th>Age</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Race</th>\n",
       "      <th>Ethnicity</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Tumor_Site</th>\n",
       "      <th>Tumor_Site_Other</th>\n",
       "      <th>Tumor_Focality</th>\n",
       "      <th>Tumor_Size_cm</th>\n",
       "      <th>Num_full_term_pregnancies</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>S001</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>FIGO grade 1</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Endometrioid</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>64.0</td>\n",
       "      <td>No</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Anterior endometrium</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>2.9</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00361</th>\n",
       "      <td>S017</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>United States</td>\n",
       "      <td>FIGO grade 1</td>\n",
       "      <td>Not identified</td>\n",
       "      <td>Endometrioid</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>64.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>White</td>\n",
       "      <td>Not-Hispanic or Latino</td>\n",
       "      <td>Female</td>\n",
       "      <td>Anterior endometrium</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>2.7</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-01246</th>\n",
       "      <td>S042</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>Other_specify</td>\n",
       "      <td>NaN</td>\n",
       "      <td>under 50 %</td>\n",
       "      <td>Serous</td>\n",
       "      <td>YES</td>\n",
       "      <td>Normal</td>\n",
       "      <td>pT1a (FIGO IA)</td>\n",
       "      <td>...</td>\n",
       "      <td>62.0</td>\n",
       "      <td>No</td>\n",
       "      <td>White</td>\n",
       "      <td>Not reported</td>\n",
       "      <td>Female</td>\n",
       "      <td>Posterior endometrium</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unifocal</td>\n",
       "      <td>2.3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00006.N</th>\n",
       "      <td>S105</td>\n",
       "      <td>Normal</td>\n",
       "      <td>Adjacent_normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00361.N</th>\n",
       "      <td>S106</td>\n",
       "      <td>Normal</td>\n",
       "      <td>Adjacent_normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-01246.N</th>\n",
       "      <td>S114</td>\n",
       "      <td>Normal</td>\n",
       "      <td>Adjacent_normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        Sample_ID Sample_Tumor_Normal Proteomics_Tumor_Normal  \\\n",
       "Patient_ID                                                          \n",
       "C3L-00006        S001               Tumor                   Tumor   \n",
       "C3L-00361        S017               Tumor                   Tumor   \n",
       "C3L-01246        S042               Tumor                   Tumor   \n",
       "C3L-00006.N      S105              Normal         Adjacent_normal   \n",
       "C3L-00361.N      S106              Normal         Adjacent_normal   \n",
       "C3L-01246.N      S114              Normal         Adjacent_normal   \n",
       "\n",
       "Name               Country Histologic_Grade_FIGO Myometrial_invasion_Specify  \\\n",
       "Patient_ID                                                                     \n",
       "C3L-00006    United States          FIGO grade 1                  under 50 %   \n",
       "C3L-00361    United States          FIGO grade 1              Not identified   \n",
       "C3L-01246    Other_specify                   NaN                  under 50 %   \n",
       "C3L-00006.N            NaN                   NaN                         NaN   \n",
       "C3L-00361.N            NaN                   NaN                         NaN   \n",
       "C3L-01246.N            NaN                   NaN                         NaN   \n",
       "\n",
       "Name        Histologic_type Treatment_naive Tumor_purity  \\\n",
       "Patient_ID                                                 \n",
       "C3L-00006      Endometrioid             YES       Normal   \n",
       "C3L-00361      Endometrioid             YES       Normal   \n",
       "C3L-01246            Serous             YES       Normal   \n",
       "C3L-00006.N             NaN             NaN          NaN   \n",
       "C3L-00361.N             NaN             NaN          NaN   \n",
       "C3L-01246.N             NaN             NaN          NaN   \n",
       "\n",
       "Name        Path_Stage_Primary_Tumor-pT  ...   Age Diabetes   Race  \\\n",
       "Patient_ID                               ...                         \n",
       "C3L-00006                pT1a (FIGO IA)  ...  64.0       No  White   \n",
       "C3L-00361                pT1a (FIGO IA)  ...  64.0      Yes  White   \n",
       "C3L-01246                pT1a (FIGO IA)  ...  62.0       No  White   \n",
       "C3L-00006.N                         NaN  ...   NaN      NaN    NaN   \n",
       "C3L-00361.N                         NaN  ...   NaN      NaN    NaN   \n",
       "C3L-01246.N                         NaN  ...   NaN      NaN    NaN   \n",
       "\n",
       "Name                      Ethnicity  Gender             Tumor_Site  \\\n",
       "Patient_ID                                                           \n",
       "C3L-00006    Not-Hispanic or Latino  Female   Anterior endometrium   \n",
       "C3L-00361    Not-Hispanic or Latino  Female   Anterior endometrium   \n",
       "C3L-01246              Not reported  Female  Posterior endometrium   \n",
       "C3L-00006.N                     NaN     NaN                    NaN   \n",
       "C3L-00361.N                     NaN     NaN                    NaN   \n",
       "C3L-01246.N                     NaN     NaN                    NaN   \n",
       "\n",
       "Name         Tumor_Site_Other  Tumor_Focality Tumor_Size_cm  \\\n",
       "Patient_ID                                                    \n",
       "C3L-00006                 NaN        Unifocal           2.9   \n",
       "C3L-00361                 NaN        Unifocal           2.7   \n",
       "C3L-01246                 NaN        Unifocal           2.3   \n",
       "C3L-00006.N               NaN             NaN           NaN   \n",
       "C3L-00361.N               NaN             NaN           NaN   \n",
       "C3L-01246.N               NaN             NaN           NaN   \n",
       "\n",
       "Name        Num_full_term_pregnancies  \n",
       "Patient_ID                             \n",
       "C3L-00006                           1  \n",
       "C3L-00361                        None  \n",
       "C3L-01246                           1  \n",
       "C3L-00006.N                       NaN  \n",
       "C3L-00361.N                       NaN  \n",
       "C3L-01246.N                       NaN  \n",
       "\n",
       "[6 rows x 27 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinical.loc[[\"C3L-00006\",\"C3L-00361\",\"C3L-01246\", \"C3L-00006.N\",\"C3L-00361.N\",\"C3L-01246.N\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mutation data\n",
    "\n",
    "Each cancer dataset contains mutation data for the cohort. The data consists of all somatic mutations found for each sample (meaning there will be many lines for each sample). Each row lists the specific gene that was mutated, the type of mutation, and the location of the mutation. This data is a direct import of a MAF file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>Gene</th>\n",
       "      <th>Mutation</th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>AAK1</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.A592V</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>AANAT</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.R176W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>ABCA12</td>\n",
       "      <td>Frame_Shift_Del</td>\n",
       "      <td>p.N1671Ifs*4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>ABCC4</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.R691H</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>ABL1</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.G273R</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Name          Gene           Mutation      Location\n",
       "Patient_ID                                         \n",
       "C3L-00006     AAK1  Missense_Mutation       p.A592V\n",
       "C3L-00006    AANAT  Missense_Mutation       p.R176W\n",
       "C3L-00006   ABCA12    Frame_Shift_Del  p.N1671Ifs*4\n",
       "C3L-00006    ABCC4  Missense_Mutation       p.R691H\n",
       "C3L-00006     ABL1  Missense_Mutation       p.G273R"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "somatic_mutations = en.get_somatic_mutation()\n",
    "somatic_mutations.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exporting dataframes\n",
    "\n",
    "If you wish to export a dataframe to a file, simply call the dataframe's `to_csv` method, passing the path you wish to save the file to, and the value separator you want:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "clinical = en.get_clinical()\n",
    "clinical.to_csv(path_or_buf=\"clinical_dataframe.tsv\", sep='\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting help with a dataset or function\n",
    "\n",
    "To view the documentation for a dataset, pass it to the Python `help` function, e.g. `help(en)`. You can also view the documentation for just a specific function: `help(en.join_omics_to_omics)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on method join_omics_to_omics in module cptac.dataset:\n",
      "\n",
      "join_omics_to_omics(df1_name, df2_name, genes1=None, genes2=None, how='outer', quiet=False, tissue_type='both') method of cptac.endometrial.Endometrial instance\n",
      "    Take specified column(s) from one omics dataframe, and join to specified columns(s) from another omics dataframe. Intersection (inner join) of indices is used.\n",
      "    \n",
      "    Parameters:\n",
      "    df1_name (str): Name of first omics dataframe to select columns from.\n",
      "    df2_name (str): Name of second omics dataframe to select columns from.\n",
      "    genes1 (str, or list or array-like of str, optional): Gene(s) for column(s) to select from df1_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe.\n",
      "    genes2 (str, or list or array-like of str, optional): Gene(s) for Column(s) to select from df2_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe.\n",
      "    how (str, optional): How to perform the join, acceptable values are from ['outer', 'inner', 'left', 'right']. Defaults to 'outer'.\n",
      "    quiet (bool, optional): Whether to warn when inserting NaNs. Defaults to False.\n",
      "    tissue_type (str): Acceptable values in [\"tumor\",\"normal\",\"both\"]. Specifies the desired tissue type desired in the dataframe. Defaults to \"both\".\n",
      "    \n",
      "    Returns:\n",
      "    pandas.DataFrame: The selected columns from the two omics dataframes, joined into one dataframe.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(en.join_omics_to_omics)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}