{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 3: Joining dataframes with `cptac`\n",
    "\n",
    "In this tutorial, we provide several examples of how to use the built-in `cptac` functions for joining different dataframes.\n",
    "\n",
    "We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Data type</th>\n",
       "      <th>Available sources</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CNV</td>\n",
       "      <td>[bcm, washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>circular_RNA</td>\n",
       "      <td>[bcm]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>miRNA</td>\n",
       "      <td>[bcm, washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>proteomics</td>\n",
       "      <td>[bcm, umich]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>transcriptomics</td>\n",
       "      <td>[bcm, broad, washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>ancestry_prediction</td>\n",
       "      <td>[harmonized]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>somatic_mutation</td>\n",
       "      <td>[harmonized, washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>clinical</td>\n",
       "      <td>[mssm]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>follow-up</td>\n",
       "      <td>[mssm]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>medical_history</td>\n",
       "      <td>[mssm]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>acetylproteomics</td>\n",
       "      <td>[umich]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>phosphoproteomics</td>\n",
       "      <td>[umich]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>cibersort</td>\n",
       "      <td>[washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>hla_typing</td>\n",
       "      <td>[washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>tumor_purity</td>\n",
       "      <td>[washu]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>xcell</td>\n",
       "      <td>[washu]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Data type    Available sources\n",
       "0                   CNV         [bcm, washu]\n",
       "1          circular_RNA                [bcm]\n",
       "2                 miRNA         [bcm, washu]\n",
       "3            proteomics         [bcm, umich]\n",
       "4       transcriptomics  [bcm, broad, washu]\n",
       "5   ancestry_prediction         [harmonized]\n",
       "6      somatic_mutation  [harmonized, washu]\n",
       "7              clinical               [mssm]\n",
       "8             follow-up               [mssm]\n",
       "9       medical_history               [mssm]\n",
       "10     acetylproteomics              [umich]\n",
       "11    phosphoproteomics              [umich]\n",
       "12            cibersort              [washu]\n",
       "13           hla_typing              [washu]\n",
       "14         tumor_purity              [washu]\n",
       "15                xcell              [washu]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Start by importing the cptac package\n",
    "import cptac\n",
    "\n",
    "# Create an endometrial data object, named 'en'\n",
    "en = cptac.Ucec()\n",
    "\n",
    "# List the available data sources\n",
    "en.list_data_sources()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "en.list_data_sources() shows the types of data available in the dataset and their respective sources. For example, you see proteomics data is available from umich, transcriptomics data from bcm, broad, washu and so forth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>A1BG</th>\n",
       "      <th>A1BG-AS1</th>\n",
       "      <th>A1CF</th>\n",
       "      <th>A2M</th>\n",
       "      <th>A2M-AS1</th>\n",
       "      <th>A2ML1</th>\n",
       "      <th>A2ML1-AS1</th>\n",
       "      <th>A2ML1-AS2</th>\n",
       "      <th>A2MP1</th>\n",
       "      <th>A3GALT2</th>\n",
       "      <th>...</th>\n",
       "      <th>ZXDB</th>\n",
       "      <th>ZXDC</th>\n",
       "      <th>ZYG11A</th>\n",
       "      <th>ZYG11AP1</th>\n",
       "      <th>ZYG11B</th>\n",
       "      <th>ZYX</th>\n",
       "      <th>ZYXP1</th>\n",
       "      <th>ZZEF1</th>\n",
       "      <th>hsa-mir-1253</th>\n",
       "      <th>hsa-mir-423</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th>ENSG00000121410.12</th>\n",
       "      <th>ENSG00000268895.6</th>\n",
       "      <th>ENSG00000148584.15</th>\n",
       "      <th>ENSG00000175899.15</th>\n",
       "      <th>ENSG00000245105.4</th>\n",
       "      <th>ENSG00000166535.20</th>\n",
       "      <th>ENSG00000256661.1</th>\n",
       "      <th>ENSG00000256904.1</th>\n",
       "      <th>ENSG00000256069.7</th>\n",
       "      <th>ENSG00000184389.9</th>\n",
       "      <th>...</th>\n",
       "      <th>ENSG00000198455.4</th>\n",
       "      <th>ENSG00000070476.15</th>\n",
       "      <th>ENSG00000203995.10</th>\n",
       "      <th>ENSG00000232242.2</th>\n",
       "      <th>ENSG00000162378.13</th>\n",
       "      <th>ENSG00000159840.16</th>\n",
       "      <th>ENSG00000274572.1</th>\n",
       "      <th>ENSG00000074755.15</th>\n",
       "      <th>ENSG00000272920.1</th>\n",
       "      <th>ENSG00000266919.3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>2.54</td>\n",
       "      <td>5.11</td>\n",
       "      <td>3.60</td>\n",
       "      <td>13.75</td>\n",
       "      <td>6.45</td>\n",
       "      <td>7.08</td>\n",
       "      <td>1.80</td>\n",
       "      <td>0.00</td>\n",
       "      <td>2.60</td>\n",
       "      <td>1.16</td>\n",
       "      <td>...</td>\n",
       "      <td>10.17</td>\n",
       "      <td>10.61</td>\n",
       "      <td>5.54</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.85</td>\n",
       "      <td>10.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.87</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>4.40</td>\n",
       "      <td>4.63</td>\n",
       "      <td>5.49</td>\n",
       "      <td>13.89</td>\n",
       "      <td>6.61</td>\n",
       "      <td>6.97</td>\n",
       "      <td>0.00</td>\n",
       "      <td>2.74</td>\n",
       "      <td>3.25</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>9.79</td>\n",
       "      <td>10.48</td>\n",
       "      <td>7.79</td>\n",
       "      <td>0.0</td>\n",
       "      <td>12.28</td>\n",
       "      <td>11.28</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.93</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>4.83</td>\n",
       "      <td>7.26</td>\n",
       "      <td>3.73</td>\n",
       "      <td>14.48</td>\n",
       "      <td>6.91</td>\n",
       "      <td>9.56</td>\n",
       "      <td>0.98</td>\n",
       "      <td>0.00</td>\n",
       "      <td>3.26</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>9.43</td>\n",
       "      <td>9.97</td>\n",
       "      <td>6.48</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.72</td>\n",
       "      <td>10.37</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>4.73</td>\n",
       "      <td>6.01</td>\n",
       "      <td>5.37</td>\n",
       "      <td>15.17</td>\n",
       "      <td>7.93</td>\n",
       "      <td>3.86</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>3.73</td>\n",
       "      <td>1.15</td>\n",
       "      <td>...</td>\n",
       "      <td>9.23</td>\n",
       "      <td>10.37</td>\n",
       "      <td>7.47</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.86</td>\n",
       "      <td>10.13</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.19</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>4.14</td>\n",
       "      <td>6.24</td>\n",
       "      <td>5.69</td>\n",
       "      <td>13.87</td>\n",
       "      <td>6.79</td>\n",
       "      <td>4.32</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>3.23</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>9.69</td>\n",
       "      <td>9.64</td>\n",
       "      <td>7.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.98</td>\n",
       "      <td>10.31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.45</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 59286 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name                      A1BG          A1BG-AS1               A1CF  \\\n",
       "Database_ID ENSG00000121410.12 ENSG00000268895.6 ENSG00000148584.15   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                 2.54              5.11               3.60   \n",
       "C3L-00008                 4.40              4.63               5.49   \n",
       "C3L-00032                 4.83              7.26               3.73   \n",
       "C3L-00084                 4.73              6.01               5.37   \n",
       "C3L-00090                 4.14              6.24               5.69   \n",
       "\n",
       "Name                       A2M           A2M-AS1              A2ML1  \\\n",
       "Database_ID ENSG00000175899.15 ENSG00000245105.4 ENSG00000166535.20   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                13.75              6.45               7.08   \n",
       "C3L-00008                13.89              6.61               6.97   \n",
       "C3L-00032                14.48              6.91               9.56   \n",
       "C3L-00084                15.17              7.93               3.86   \n",
       "C3L-00090                13.87              6.79               4.32   \n",
       "\n",
       "Name                A2ML1-AS1         A2ML1-AS2             A2MP1  \\\n",
       "Database_ID ENSG00000256661.1 ENSG00000256904.1 ENSG00000256069.7   \n",
       "Patient_ID                                                          \n",
       "C3L-00006                1.80              0.00              2.60   \n",
       "C3L-00008                0.00              2.74              3.25   \n",
       "C3L-00032                0.98              0.00              3.26   \n",
       "C3L-00084                0.00              0.00              3.73   \n",
       "C3L-00090                0.00              0.00              3.23   \n",
       "\n",
       "Name                  A3GALT2  ...              ZXDB               ZXDC  \\\n",
       "Database_ID ENSG00000184389.9  ... ENSG00000198455.4 ENSG00000070476.15   \n",
       "Patient_ID                     ...                                        \n",
       "C3L-00006                1.16  ...             10.17              10.61   \n",
       "C3L-00008                0.00  ...              9.79              10.48   \n",
       "C3L-00032                0.00  ...              9.43               9.97   \n",
       "C3L-00084                1.15  ...              9.23              10.37   \n",
       "C3L-00090                0.00  ...              9.69               9.64   \n",
       "\n",
       "Name                    ZYG11A          ZYG11AP1             ZYG11B  \\\n",
       "Database_ID ENSG00000203995.10 ENSG00000232242.2 ENSG00000162378.13   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                 5.54               0.0              11.85   \n",
       "C3L-00008                 7.79               0.0              12.28   \n",
       "C3L-00032                 6.48               0.0              11.72   \n",
       "C3L-00084                 7.47               0.0              11.86   \n",
       "C3L-00090                 7.60               0.0              11.98   \n",
       "\n",
       "Name                       ZYX             ZYXP1              ZZEF1  \\\n",
       "Database_ID ENSG00000159840.16 ENSG00000274572.1 ENSG00000074755.15   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                10.60               0.0              11.87   \n",
       "C3L-00008                11.28               0.0              11.93   \n",
       "C3L-00032                10.37               0.0              11.70   \n",
       "C3L-00084                10.13               0.0              11.19   \n",
       "C3L-00090                10.31               0.0              11.45   \n",
       "\n",
       "Name             hsa-mir-1253       hsa-mir-423  \n",
       "Database_ID ENSG00000272920.1 ENSG00000266919.3  \n",
       "Patient_ID                                       \n",
       "C3L-00006                 0.0               0.0  \n",
       "C3L-00008                 0.0               0.0  \n",
       "C3L-00032                 0.0               0.0  \n",
       "C3L-00084                 0.0               0.0  \n",
       "C3L-00090                 0.0               0.0  \n",
       "\n",
       "[5 rows x 59286 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Retrieve the transcriptomics data from bcm\n",
    "bcm_data = en.get_transcriptomics('bcm')\n",
    "\n",
    "# Display the first few rows of the dataframe\n",
    "bcm_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above code, get_transcriptomics('bcm') is used to retrieve the transcriptomics data from bcm. Each row represents a different patient, and each column corresponds to a different gene."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## General format\n",
    "\n",
    "cptac has a helpful function called `multi_join`. It allows data from several different cptac dataframes to be joined at the same time.\n",
    "\n",
    "To use `multi_join`, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.\n",
    "\n",
    "Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.\n",
    "\n",
    "If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.\n",
    "\n",
    "The join functions use logic analogous to an SQL INNER JOIN."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Join dictionary\n",
    "\n",
    "The main parameter for the `multi_join` function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:\n",
    "\n",
    "`{('umich', 'proteomics'): ''}`\n",
    "\n",
    "or\n",
    "\n",
    "`{\"umich proteomics\": ''}`\n",
    "\n",
    "as the join dictionary would each result in `multi_join` returning a dataframe containing only awg proteomics data.\n",
    "\n",
    "You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join omics to omics\n",
    "\n",
    "`multi_join` can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\\Users\\sabme\\anaconda3\\lib\\threading.py, line 910)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>ARF5_umich_proteomics</th>\n",
       "      <th>M6PR_umich_proteomics</th>\n",
       "      <th>ESRRA_umich_proteomics</th>\n",
       "      <th>FKBP4_umich_proteomics</th>\n",
       "      <th>NDUFAF7_umich_proteomics</th>\n",
       "      <th>FUCA2_umich_proteomics</th>\n",
       "      <th>DBNDD1_umich_proteomics</th>\n",
       "      <th>SEMA3F_umich_proteomics</th>\n",
       "      <th>CFTR_umich_proteomics</th>\n",
       "      <th>CYP51A1_umich_proteomics</th>\n",
       "      <th>...</th>\n",
       "      <th>ZXDB_bcm_transcriptomics</th>\n",
       "      <th>ZXDC_bcm_transcriptomics</th>\n",
       "      <th>ZYG11A_bcm_transcriptomics</th>\n",
       "      <th>ZYG11AP1_bcm_transcriptomics</th>\n",
       "      <th>ZYG11B_bcm_transcriptomics</th>\n",
       "      <th>ZYX_bcm_transcriptomics</th>\n",
       "      <th>ZYXP1_bcm_transcriptomics</th>\n",
       "      <th>ZZEF1_bcm_transcriptomics</th>\n",
       "      <th>hsa-mir-1253_bcm_transcriptomics</th>\n",
       "      <th>hsa-mir-423_bcm_transcriptomics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th>ENSP00000000233.5</th>\n",
       "      <th>ENSP00000000412.3</th>\n",
       "      <th>ENSP00000000442.6</th>\n",
       "      <th>ENSP00000001008.4</th>\n",
       "      <th>ENSP00000002125.4</th>\n",
       "      <th>ENSP00000002165.5</th>\n",
       "      <th>ENSP00000002501.6</th>\n",
       "      <th>ENSP00000002829.3</th>\n",
       "      <th>ENSP00000003084.6</th>\n",
       "      <th>ENSP00000003100.8</th>\n",
       "      <th>...</th>\n",
       "      <th>ENSG00000198455.4</th>\n",
       "      <th>ENSG00000070476.15</th>\n",
       "      <th>ENSG00000203995.10</th>\n",
       "      <th>ENSG00000232242.2</th>\n",
       "      <th>ENSG00000162378.13</th>\n",
       "      <th>ENSG00000159840.16</th>\n",
       "      <th>ENSG00000274572.1</th>\n",
       "      <th>ENSG00000074755.15</th>\n",
       "      <th>ENSG00000272920.1</th>\n",
       "      <th>ENSG00000266919.3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>-0.056513</td>\n",
       "      <td>0.016557</td>\n",
       "      <td>0.002569</td>\n",
       "      <td>0.389819</td>\n",
       "      <td>0.603610</td>\n",
       "      <td>-0.332543</td>\n",
       "      <td>-0.790426</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.822732</td>\n",
       "      <td>0.039134</td>\n",
       "      <td>...</td>\n",
       "      <td>10.17</td>\n",
       "      <td>10.61</td>\n",
       "      <td>5.54</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.85</td>\n",
       "      <td>10.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.87</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>0.549959</td>\n",
       "      <td>-0.206129</td>\n",
       "      <td>0.905784</td>\n",
       "      <td>-0.303631</td>\n",
       "      <td>0.018767</td>\n",
       "      <td>0.503513</td>\n",
       "      <td>0.950955</td>\n",
       "      <td>0.080142</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.063213</td>\n",
       "      <td>...</td>\n",
       "      <td>9.79</td>\n",
       "      <td>10.48</td>\n",
       "      <td>7.79</td>\n",
       "      <td>0.0</td>\n",
       "      <td>12.28</td>\n",
       "      <td>11.28</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.93</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>0.088681</td>\n",
       "      <td>-0.154447</td>\n",
       "      <td>-0.190515</td>\n",
       "      <td>0.170753</td>\n",
       "      <td>0.196356</td>\n",
       "      <td>0.544194</td>\n",
       "      <td>-0.179078</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.377405</td>\n",
       "      <td>...</td>\n",
       "      <td>9.43</td>\n",
       "      <td>9.97</td>\n",
       "      <td>6.48</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.72</td>\n",
       "      <td>10.37</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>-0.846555</td>\n",
       "      <td>0.027740</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.178700</td>\n",
       "      <td>0.264054</td>\n",
       "      <td>-0.183548</td>\n",
       "      <td>0.077215</td>\n",
       "      <td>-0.247164</td>\n",
       "      <td>0.152277</td>\n",
       "      <td>-0.279549</td>\n",
       "      <td>...</td>\n",
       "      <td>9.23</td>\n",
       "      <td>10.37</td>\n",
       "      <td>7.47</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.86</td>\n",
       "      <td>10.13</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.19</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>0.539019</td>\n",
       "      <td>0.956619</td>\n",
       "      <td>-0.039516</td>\n",
       "      <td>0.323656</td>\n",
       "      <td>0.064605</td>\n",
       "      <td>0.173433</td>\n",
       "      <td>-0.524325</td>\n",
       "      <td>-0.038590</td>\n",
       "      <td>-0.311486</td>\n",
       "      <td>0.309905</td>\n",
       "      <td>...</td>\n",
       "      <td>9.69</td>\n",
       "      <td>9.64</td>\n",
       "      <td>7.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.98</td>\n",
       "      <td>10.31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.45</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 71948 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        ARF5_umich_proteomics M6PR_umich_proteomics  \\\n",
       "Database_ID     ENSP00000000233.5     ENSP00000000412.3   \n",
       "Patient_ID                                                \n",
       "C3L-00006               -0.056513              0.016557   \n",
       "C3L-00008                0.549959             -0.206129   \n",
       "C3L-00032                0.088681             -0.154447   \n",
       "C3L-00084               -0.846555              0.027740   \n",
       "C3L-00090                0.539019              0.956619   \n",
       "\n",
       "Name        ESRRA_umich_proteomics FKBP4_umich_proteomics  \\\n",
       "Database_ID      ENSP00000000442.6      ENSP00000001008.4   \n",
       "Patient_ID                                                  \n",
       "C3L-00006                 0.002569               0.389819   \n",
       "C3L-00008                 0.905784              -0.303631   \n",
       "C3L-00032                -0.190515               0.170753   \n",
       "C3L-00084                      NaN               0.178700   \n",
       "C3L-00090                -0.039516               0.323656   \n",
       "\n",
       "Name        NDUFAF7_umich_proteomics FUCA2_umich_proteomics  \\\n",
       "Database_ID        ENSP00000002125.4      ENSP00000002165.5   \n",
       "Patient_ID                                                    \n",
       "C3L-00006                   0.603610              -0.332543   \n",
       "C3L-00008                   0.018767               0.503513   \n",
       "C3L-00032                   0.196356               0.544194   \n",
       "C3L-00084                   0.264054              -0.183548   \n",
       "C3L-00090                   0.064605               0.173433   \n",
       "\n",
       "Name        DBNDD1_umich_proteomics SEMA3F_umich_proteomics  \\\n",
       "Database_ID       ENSP00000002501.6       ENSP00000002829.3   \n",
       "Patient_ID                                                    \n",
       "C3L-00006                 -0.790426                     NaN   \n",
       "C3L-00008                  0.950955                0.080142   \n",
       "C3L-00032                 -0.179078                     NaN   \n",
       "C3L-00084                  0.077215               -0.247164   \n",
       "C3L-00090                 -0.524325               -0.038590   \n",
       "\n",
       "Name        CFTR_umich_proteomics CYP51A1_umich_proteomics  ...  \\\n",
       "Database_ID     ENSP00000003084.6        ENSP00000003100.8  ...   \n",
       "Patient_ID                                                  ...   \n",
       "C3L-00006                0.822732                 0.039134  ...   \n",
       "C3L-00008                     NaN                -0.063213  ...   \n",
       "C3L-00032                     NaN                 0.377405  ...   \n",
       "C3L-00084                0.152277                -0.279549  ...   \n",
       "C3L-00090               -0.311486                 0.309905  ...   \n",
       "\n",
       "Name        ZXDB_bcm_transcriptomics ZXDC_bcm_transcriptomics  \\\n",
       "Database_ID        ENSG00000198455.4       ENSG00000070476.15   \n",
       "Patient_ID                                                      \n",
       "C3L-00006                      10.17                    10.61   \n",
       "C3L-00008                       9.79                    10.48   \n",
       "C3L-00032                       9.43                     9.97   \n",
       "C3L-00084                       9.23                    10.37   \n",
       "C3L-00090                       9.69                     9.64   \n",
       "\n",
       "Name        ZYG11A_bcm_transcriptomics ZYG11AP1_bcm_transcriptomics  \\\n",
       "Database_ID         ENSG00000203995.10            ENSG00000232242.2   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                         5.54                          0.0   \n",
       "C3L-00008                         7.79                          0.0   \n",
       "C3L-00032                         6.48                          0.0   \n",
       "C3L-00084                         7.47                          0.0   \n",
       "C3L-00090                         7.60                          0.0   \n",
       "\n",
       "Name        ZYG11B_bcm_transcriptomics ZYX_bcm_transcriptomics  \\\n",
       "Database_ID         ENSG00000162378.13      ENSG00000159840.16   \n",
       "Patient_ID                                                       \n",
       "C3L-00006                        11.85                   10.60   \n",
       "C3L-00008                        12.28                   11.28   \n",
       "C3L-00032                        11.72                   10.37   \n",
       "C3L-00084                        11.86                   10.13   \n",
       "C3L-00090                        11.98                   10.31   \n",
       "\n",
       "Name        ZYXP1_bcm_transcriptomics ZZEF1_bcm_transcriptomics  \\\n",
       "Database_ID         ENSG00000274572.1        ENSG00000074755.15   \n",
       "Patient_ID                                                        \n",
       "C3L-00006                         0.0                     11.87   \n",
       "C3L-00008                         0.0                     11.93   \n",
       "C3L-00032                         0.0                     11.70   \n",
       "C3L-00084                         0.0                     11.19   \n",
       "C3L-00090                         0.0                     11.45   \n",
       "\n",
       "Name        hsa-mir-1253_bcm_transcriptomics hsa-mir-423_bcm_transcriptomics  \n",
       "Database_ID                ENSG00000272920.1               ENSG00000266919.3  \n",
       "Patient_ID                                                                    \n",
       "C3L-00006                                0.0                             0.0  \n",
       "C3L-00008                                0.0                             0.0  \n",
       "C3L-00032                                0.0                             0.0  \n",
       "C3L-00084                                0.0                             0.0  \n",
       "C3L-00090                                0.0                             0.0  \n",
       "\n",
       "[5 rows x 71948 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Joining two -omics dataframes together using multi_join\n",
    "prot_and_tran = en.multi_join({\"umich proteomics\":'', \"bcm transcriptomics\":''})\n",
    "prot_and_tran.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, multi_join is used to join proteomics data from umich and transcriptomics data from bcm into one combined dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>ARF5_umich_proteomics</th>\n",
       "      <th>A1BG_bcm_transcriptomics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th>ENSP00000000233.5</th>\n",
       "      <th>ENSG00000121410.12</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>-0.056513</td>\n",
       "      <td>2.54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>0.549959</td>\n",
       "      <td>4.40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>0.088681</td>\n",
       "      <td>4.83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>-0.846555</td>\n",
       "      <td>4.73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>0.539019</td>\n",
       "      <td>4.14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        ARF5_umich_proteomics A1BG_bcm_transcriptomics\n",
       "Database_ID     ENSP00000000233.5       ENSG00000121410.12\n",
       "Patient_ID                                                \n",
       "C3L-00006               -0.056513                     2.54\n",
       "C3L-00008                0.549959                     4.40\n",
       "C3L-00032                0.088681                     4.83\n",
       "C3L-00084               -0.846555                     4.73\n",
       "C3L-00090                0.539019                     4.14"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using multi_join with specified columns\n",
    "prot_and_tran_selected = en.multi_join({\"umich proteomics\":'ARF5', \"bcm transcriptomics\":'A1BG'})\n",
    "prot_and_tran_selected.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, multi_join is used again, but this time only the 'ARF5' column from the proteomics data and the 'A1BG' column from the transcriptomics data are included in the resulting dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join metadata to omics\n",
    "\n",
    "The `multi_join` function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>tumor_code</th>\n",
       "      <th>discovery_study</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>confirmatory_study</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>ethnicity_race_ancestry_identified</th>\n",
       "      <th>...</th>\n",
       "      <th>ZXDB_bcm_transcriptomics</th>\n",
       "      <th>ZXDC_bcm_transcriptomics</th>\n",
       "      <th>ZYG11A_bcm_transcriptomics</th>\n",
       "      <th>ZYG11AP1_bcm_transcriptomics</th>\n",
       "      <th>ZYG11B_bcm_transcriptomics</th>\n",
       "      <th>ZYX_bcm_transcriptomics</th>\n",
       "      <th>ZYXP1_bcm_transcriptomics</th>\n",
       "      <th>ZZEF1_bcm_transcriptomics</th>\n",
       "      <th>hsa-mir-1253_bcm_transcriptomics</th>\n",
       "      <th>hsa-mir-423_bcm_transcriptomics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>...</th>\n",
       "      <th>ENSG00000198455.4</th>\n",
       "      <th>ENSG00000070476.15</th>\n",
       "      <th>ENSG00000203995.10</th>\n",
       "      <th>ENSG00000232242.2</th>\n",
       "      <th>ENSG00000162378.13</th>\n",
       "      <th>ENSG00000159840.16</th>\n",
       "      <th>ENSG00000274572.1</th>\n",
       "      <th>ENSG00000074755.15</th>\n",
       "      <th>ENSG00000272920.1</th>\n",
       "      <th>ENSG00000266919.3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor_and_Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>10.17</td>\n",
       "      <td>10.61</td>\n",
       "      <td>5.54</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.85</td>\n",
       "      <td>10.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.87</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.79</td>\n",
       "      <td>10.48</td>\n",
       "      <td>7.79</td>\n",
       "      <td>0.0</td>\n",
       "      <td>12.28</td>\n",
       "      <td>11.28</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.93</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.43</td>\n",
       "      <td>9.97</td>\n",
       "      <td>6.48</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.72</td>\n",
       "      <td>10.37</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.23</td>\n",
       "      <td>10.37</td>\n",
       "      <td>7.47</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.86</td>\n",
       "      <td>10.13</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.19</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.69</td>\n",
       "      <td>9.64</td>\n",
       "      <td>7.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.98</td>\n",
       "      <td>10.31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.45</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 59410 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        tumor_code discovery_study type_of_analyzed_samples_mssm_clinical  \\\n",
       "Database_ID                                                                     \n",
       "Patient_ID                                                                      \n",
       "C3L-00006         UCEC             Yes                       Tumor_and_Normal   \n",
       "C3L-00008         UCEC             Yes                                  Tumor   \n",
       "C3L-00032         UCEC             Yes                                  Tumor   \n",
       "C3L-00084         UCEC             Yes                                  Tumor   \n",
       "C3L-00090         UCEC             Yes                                  Tumor   \n",
       "\n",
       "Name        confirmatory_study type_of_analyzed_samples_mssm_clinical age  \\\n",
       "Database_ID                                                                 \n",
       "Patient_ID                                                                  \n",
       "C3L-00006                  NaN                                    NaN  64   \n",
       "C3L-00008                  NaN                                    NaN  58   \n",
       "C3L-00032                  NaN                                    NaN  50   \n",
       "C3L-00084                  NaN                                    NaN  74   \n",
       "C3L-00090                  NaN                                    NaN  75   \n",
       "\n",
       "Name            sex   race               ethnicity  \\\n",
       "Database_ID                                          \n",
       "Patient_ID                                           \n",
       "C3L-00006    Female  White  Not Hispanic or Latino   \n",
       "C3L-00008    Female  White  Not Hispanic or Latino   \n",
       "C3L-00032    Female  White  Not Hispanic or Latino   \n",
       "C3L-00084    Female  White  Not Hispanic or Latino   \n",
       "C3L-00090    Female  White  Not Hispanic or Latino   \n",
       "\n",
       "Name        ethnicity_race_ancestry_identified  ... ZXDB_bcm_transcriptomics  \\\n",
       "Database_ID                                     ...        ENSG00000198455.4   \n",
       "Patient_ID                                      ...                            \n",
       "C3L-00006                                White  ...                    10.17   \n",
       "C3L-00008                                White  ...                     9.79   \n",
       "C3L-00032                                White  ...                     9.43   \n",
       "C3L-00084                                White  ...                     9.23   \n",
       "C3L-00090                                White  ...                     9.69   \n",
       "\n",
       "Name        ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics  \\\n",
       "Database_ID       ENSG00000070476.15         ENSG00000203995.10   \n",
       "Patient_ID                                                        \n",
       "C3L-00006                      10.61                       5.54   \n",
       "C3L-00008                      10.48                       7.79   \n",
       "C3L-00032                       9.97                       6.48   \n",
       "C3L-00084                      10.37                       7.47   \n",
       "C3L-00090                       9.64                       7.60   \n",
       "\n",
       "Name        ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics  \\\n",
       "Database_ID            ENSG00000232242.2         ENSG00000162378.13   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                            0.0                      11.85   \n",
       "C3L-00008                            0.0                      12.28   \n",
       "C3L-00032                            0.0                      11.72   \n",
       "C3L-00084                            0.0                      11.86   \n",
       "C3L-00090                            0.0                      11.98   \n",
       "\n",
       "Name        ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics  \\\n",
       "Database_ID      ENSG00000159840.16         ENSG00000274572.1   \n",
       "Patient_ID                                                      \n",
       "C3L-00006                     10.60                       0.0   \n",
       "C3L-00008                     11.28                       0.0   \n",
       "C3L-00032                     10.37                       0.0   \n",
       "C3L-00084                     10.13                       0.0   \n",
       "C3L-00090                     10.31                       0.0   \n",
       "\n",
       "Name        ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics  \\\n",
       "Database_ID        ENSG00000074755.15                ENSG00000272920.1   \n",
       "Patient_ID                                                               \n",
       "C3L-00006                       11.87                              0.0   \n",
       "C3L-00008                       11.93                              0.0   \n",
       "C3L-00032                       11.70                              0.0   \n",
       "C3L-00084                       11.19                              0.0   \n",
       "C3L-00090                       11.45                              0.0   \n",
       "\n",
       "Name        hsa-mir-423_bcm_transcriptomics  \n",
       "Database_ID               ENSG00000266919.3  \n",
       "Patient_ID                                   \n",
       "C3L-00006                               0.0  \n",
       "C3L-00008                               0.0  \n",
       "C3L-00032                               0.0  \n",
       "C3L-00084                               0.0  \n",
       "C3L-00090                               0.0  \n",
       "\n",
       "[5 rows x 59410 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Join a metadata dataframe with an -omics dataframe\n",
    "clin_and_tran = en.multi_join({\"mssm clinical\":'', \"bcm transcriptomics\":''})\n",
    "clin_and_tran.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Joining only specific columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>age</th>\n",
       "      <th>Overall survival, days</th>\n",
       "      <th>ZYX_bcm_transcriptomics</th>\n",
       "      <th>ZZEF1_bcm_transcriptomics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>ENSG00000159840.16</th>\n",
       "      <th>ENSG00000074755.15</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>64</td>\n",
       "      <td>737.0</td>\n",
       "      <td>10.60</td>\n",
       "      <td>11.87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>58</td>\n",
       "      <td>898.0</td>\n",
       "      <td>11.28</td>\n",
       "      <td>11.93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>50</td>\n",
       "      <td>1710.0</td>\n",
       "      <td>10.37</td>\n",
       "      <td>11.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>74</td>\n",
       "      <td>335.0</td>\n",
       "      <td>10.13</td>\n",
       "      <td>11.19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>75</td>\n",
       "      <td>1281.0</td>\n",
       "      <td>10.31</td>\n",
       "      <td>11.45</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        age Overall survival, days ZYX_bcm_transcriptomics  \\\n",
       "Database_ID                                 ENSG00000159840.16   \n",
       "Patient_ID                                                       \n",
       "C3L-00006    64                  737.0                   10.60   \n",
       "C3L-00008    58                  898.0                   11.28   \n",
       "C3L-00032    50                 1710.0                   10.37   \n",
       "C3L-00084    74                  335.0                   10.13   \n",
       "C3L-00090    75                 1281.0                   10.31   \n",
       "\n",
       "Name        ZZEF1_bcm_transcriptomics  \n",
       "Database_ID        ENSG00000074755.15  \n",
       "Patient_ID                             \n",
       "C3L-00006                       11.87  \n",
       "C3L-00008                       11.93  \n",
       "C3L-00032                       11.70  \n",
       "C3L-00084                       11.19  \n",
       "C3L-00090                       11.45  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clin_and_tran = en.multi_join({\"mssm clinical\": [\"age\", \"Overall survival, days\"], \"bcm transcriptomics\": [\"ZYX\", 'ZZEF1']})\n",
    "clin_and_tran.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join metadata to metadata\n",
    "\n",
    "Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string `''` or an empty list `[]` for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>tumor_code</th>\n",
       "      <th>discovery_study</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>confirmatory_study</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>ethnicity_race_ancestry_identified</th>\n",
       "      <th>...</th>\n",
       "      <th>ZXDB_bcm_transcriptomics</th>\n",
       "      <th>ZXDC_bcm_transcriptomics</th>\n",
       "      <th>ZYG11A_bcm_transcriptomics</th>\n",
       "      <th>ZYG11AP1_bcm_transcriptomics</th>\n",
       "      <th>ZYG11B_bcm_transcriptomics</th>\n",
       "      <th>ZYX_bcm_transcriptomics</th>\n",
       "      <th>ZYXP1_bcm_transcriptomics</th>\n",
       "      <th>ZZEF1_bcm_transcriptomics</th>\n",
       "      <th>hsa-mir-1253_bcm_transcriptomics</th>\n",
       "      <th>hsa-mir-423_bcm_transcriptomics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>...</th>\n",
       "      <th>ENSG00000198455.4</th>\n",
       "      <th>ENSG00000070476.15</th>\n",
       "      <th>ENSG00000203995.10</th>\n",
       "      <th>ENSG00000232242.2</th>\n",
       "      <th>ENSG00000162378.13</th>\n",
       "      <th>ENSG00000159840.16</th>\n",
       "      <th>ENSG00000274572.1</th>\n",
       "      <th>ENSG00000074755.15</th>\n",
       "      <th>ENSG00000272920.1</th>\n",
       "      <th>ENSG00000266919.3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor_and_Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>10.17</td>\n",
       "      <td>10.61</td>\n",
       "      <td>5.54</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.85</td>\n",
       "      <td>10.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.87</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.79</td>\n",
       "      <td>10.48</td>\n",
       "      <td>7.79</td>\n",
       "      <td>0.0</td>\n",
       "      <td>12.28</td>\n",
       "      <td>11.28</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.93</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.43</td>\n",
       "      <td>9.97</td>\n",
       "      <td>6.48</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.72</td>\n",
       "      <td>10.37</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.23</td>\n",
       "      <td>10.37</td>\n",
       "      <td>7.47</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.86</td>\n",
       "      <td>10.13</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.19</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>9.69</td>\n",
       "      <td>9.64</td>\n",
       "      <td>7.60</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.98</td>\n",
       "      <td>10.31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.45</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 59410 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        tumor_code discovery_study type_of_analyzed_samples_mssm_clinical  \\\n",
       "Database_ID                                                                     \n",
       "Patient_ID                                                                      \n",
       "C3L-00006         UCEC             Yes                       Tumor_and_Normal   \n",
       "C3L-00008         UCEC             Yes                                  Tumor   \n",
       "C3L-00032         UCEC             Yes                                  Tumor   \n",
       "C3L-00084         UCEC             Yes                                  Tumor   \n",
       "C3L-00090         UCEC             Yes                                  Tumor   \n",
       "\n",
       "Name        confirmatory_study type_of_analyzed_samples_mssm_clinical age  \\\n",
       "Database_ID                                                                 \n",
       "Patient_ID                                                                  \n",
       "C3L-00006                  NaN                                    NaN  64   \n",
       "C3L-00008                  NaN                                    NaN  58   \n",
       "C3L-00032                  NaN                                    NaN  50   \n",
       "C3L-00084                  NaN                                    NaN  74   \n",
       "C3L-00090                  NaN                                    NaN  75   \n",
       "\n",
       "Name            sex   race               ethnicity  \\\n",
       "Database_ID                                          \n",
       "Patient_ID                                           \n",
       "C3L-00006    Female  White  Not Hispanic or Latino   \n",
       "C3L-00008    Female  White  Not Hispanic or Latino   \n",
       "C3L-00032    Female  White  Not Hispanic or Latino   \n",
       "C3L-00084    Female  White  Not Hispanic or Latino   \n",
       "C3L-00090    Female  White  Not Hispanic or Latino   \n",
       "\n",
       "Name        ethnicity_race_ancestry_identified  ... ZXDB_bcm_transcriptomics  \\\n",
       "Database_ID                                     ...        ENSG00000198455.4   \n",
       "Patient_ID                                      ...                            \n",
       "C3L-00006                                White  ...                    10.17   \n",
       "C3L-00008                                White  ...                     9.79   \n",
       "C3L-00032                                White  ...                     9.43   \n",
       "C3L-00084                                White  ...                     9.23   \n",
       "C3L-00090                                White  ...                     9.69   \n",
       "\n",
       "Name        ZXDC_bcm_transcriptomics ZYG11A_bcm_transcriptomics  \\\n",
       "Database_ID       ENSG00000070476.15         ENSG00000203995.10   \n",
       "Patient_ID                                                        \n",
       "C3L-00006                      10.61                       5.54   \n",
       "C3L-00008                      10.48                       7.79   \n",
       "C3L-00032                       9.97                       6.48   \n",
       "C3L-00084                      10.37                       7.47   \n",
       "C3L-00090                       9.64                       7.60   \n",
       "\n",
       "Name        ZYG11AP1_bcm_transcriptomics ZYG11B_bcm_transcriptomics  \\\n",
       "Database_ID            ENSG00000232242.2         ENSG00000162378.13   \n",
       "Patient_ID                                                            \n",
       "C3L-00006                            0.0                      11.85   \n",
       "C3L-00008                            0.0                      12.28   \n",
       "C3L-00032                            0.0                      11.72   \n",
       "C3L-00084                            0.0                      11.86   \n",
       "C3L-00090                            0.0                      11.98   \n",
       "\n",
       "Name        ZYX_bcm_transcriptomics ZYXP1_bcm_transcriptomics  \\\n",
       "Database_ID      ENSG00000159840.16         ENSG00000274572.1   \n",
       "Patient_ID                                                      \n",
       "C3L-00006                     10.60                       0.0   \n",
       "C3L-00008                     11.28                       0.0   \n",
       "C3L-00032                     10.37                       0.0   \n",
       "C3L-00084                     10.13                       0.0   \n",
       "C3L-00090                     10.31                       0.0   \n",
       "\n",
       "Name        ZZEF1_bcm_transcriptomics hsa-mir-1253_bcm_transcriptomics  \\\n",
       "Database_ID        ENSG00000074755.15                ENSG00000272920.1   \n",
       "Patient_ID                                                               \n",
       "C3L-00006                       11.87                              0.0   \n",
       "C3L-00008                       11.93                              0.0   \n",
       "C3L-00032                       11.70                              0.0   \n",
       "C3L-00084                       11.19                              0.0   \n",
       "C3L-00090                       11.45                              0.0   \n",
       "\n",
       "Name        hsa-mir-423_bcm_transcriptomics  \n",
       "Database_ID               ENSG00000266919.3  \n",
       "Patient_ID                                   \n",
       "C3L-00006                               0.0  \n",
       "C3L-00008                               0.0  \n",
       "C3L-00032                               0.0  \n",
       "C3L-00084                               0.0  \n",
       "C3L-00090                               0.0  \n",
       "\n",
       "[5 rows x 59410 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clin_and_tran = en.multi_join({\n",
    "    \"mssm clinical\": \"\",\n",
    "    \"bcm transcriptomics\": '' # Note that by using an empty string or list as the value, we join the entire dataframe\n",
    "})\n",
    "\n",
    "clin_and_tran.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join many datatypes together\n",
    "\n",
    "If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for `multi_join` can take is your imagination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>ARF5_umich_proteomics</th>\n",
       "      <th>A1BG_bcm_transcriptomics</th>\n",
       "      <th>tumor_code</th>\n",
       "      <th>discovery_study</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>confirmatory_study</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>...</th>\n",
       "      <th>additional_treatment_immuno_for_new_tumor</th>\n",
       "      <th>number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional</th>\n",
       "      <th>number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis</th>\n",
       "      <th>Recurrence-free survival, days</th>\n",
       "      <th>Recurrence-free survival from collection, days</th>\n",
       "      <th>Recurrence status (1, yes; 0, no)</th>\n",
       "      <th>Overall survival, days</th>\n",
       "      <th>Overall survival from collection, days</th>\n",
       "      <th>Survival status (1, dead; 0, alive)</th>\n",
       "      <th>Sample_Status</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Database_ID</th>\n",
       "      <th>ENSP00000000233.5</th>\n",
       "      <th>ENSG00000121410.12</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>...</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>-0.056513</td>\n",
       "      <td>2.54</td>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor_and_Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>737.0</td>\n",
       "      <td>737.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>0.549959</td>\n",
       "      <td>4.40</td>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>898.0</td>\n",
       "      <td>898.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>0.088681</td>\n",
       "      <td>4.83</td>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1710.0</td>\n",
       "      <td>1710.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>-0.846555</td>\n",
       "      <td>4.73</td>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>335.0</td>\n",
       "      <td>335.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>0.539019</td>\n",
       "      <td>4.14</td>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1281.0</td>\n",
       "      <td>1287.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 127 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        ARF5_umich_proteomics A1BG_bcm_transcriptomics tumor_code  \\\n",
       "Database_ID     ENSP00000000233.5       ENSG00000121410.12              \n",
       "Patient_ID                                                              \n",
       "C3L-00006               -0.056513                     2.54       UCEC   \n",
       "C3L-00008                0.549959                     4.40       UCEC   \n",
       "C3L-00032                0.088681                     4.83       UCEC   \n",
       "C3L-00084               -0.846555                     4.73       UCEC   \n",
       "C3L-00090                0.539019                     4.14       UCEC   \n",
       "\n",
       "Name        discovery_study type_of_analyzed_samples_mssm_clinical  \\\n",
       "Database_ID                                                          \n",
       "Patient_ID                                                           \n",
       "C3L-00006               Yes                       Tumor_and_Normal   \n",
       "C3L-00008               Yes                                  Tumor   \n",
       "C3L-00032               Yes                                  Tumor   \n",
       "C3L-00084               Yes                                  Tumor   \n",
       "C3L-00090               Yes                                  Tumor   \n",
       "\n",
       "Name        confirmatory_study type_of_analyzed_samples_mssm_clinical age  \\\n",
       "Database_ID                                                                 \n",
       "Patient_ID                                                                  \n",
       "C3L-00006                  NaN                                    NaN  64   \n",
       "C3L-00008                  NaN                                    NaN  58   \n",
       "C3L-00032                  NaN                                    NaN  50   \n",
       "C3L-00084                  NaN                                    NaN  74   \n",
       "C3L-00090                  NaN                                    NaN  75   \n",
       "\n",
       "Name            sex   race  ... additional_treatment_immuno_for_new_tumor  \\\n",
       "Database_ID                 ...                                             \n",
       "Patient_ID                  ...                                             \n",
       "C3L-00006    Female  White  ...                                       NaN   \n",
       "C3L-00008    Female  White  ...                                       NaN   \n",
       "C3L-00032    Female  White  ...                                       NaN   \n",
       "C3L-00084    Female  White  ...                                       NaN   \n",
       "C3L-00090    Female  White  ...                                        No   \n",
       "\n",
       "Name        number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional  \\\n",
       "Database_ID                                                                                                                            \n",
       "Patient_ID                                                                                                                             \n",
       "C3L-00006                                                  NaN                                                                         \n",
       "C3L-00008                                                  NaN                                                                         \n",
       "C3L-00032                                                  NaN                                                                         \n",
       "C3L-00084                                                  NaN                                                                         \n",
       "C3L-00090                                                  NaN                                                                         \n",
       "\n",
       "Name        number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis  \\\n",
       "Database_ID                                                                                                                         \n",
       "Patient_ID                                                                                                                          \n",
       "C3L-00006                                                  NaN                                                                      \n",
       "C3L-00008                                                  NaN                                                                      \n",
       "C3L-00032                                                  NaN                                                                      \n",
       "C3L-00084                                                  NaN                                                                      \n",
       "C3L-00090                                                  NaN                                                                      \n",
       "\n",
       "Name        Recurrence-free survival, days  \\\n",
       "Database_ID                                  \n",
       "Patient_ID                                   \n",
       "C3L-00006                              NaN   \n",
       "C3L-00008                              NaN   \n",
       "C3L-00032                              NaN   \n",
       "C3L-00084                              NaN   \n",
       "C3L-00090                             50.0   \n",
       "\n",
       "Name        Recurrence-free survival from collection, days  \\\n",
       "Database_ID                                                  \n",
       "Patient_ID                                                   \n",
       "C3L-00006                                              NaN   \n",
       "C3L-00008                                              NaN   \n",
       "C3L-00032                                              NaN   \n",
       "C3L-00084                                              NaN   \n",
       "C3L-00090                                             56.0   \n",
       "\n",
       "Name        Recurrence status (1, yes; 0, no) Overall survival, days  \\\n",
       "Database_ID                                                            \n",
       "Patient_ID                                                             \n",
       "C3L-00006                                 0.0                  737.0   \n",
       "C3L-00008                                 0.0                  898.0   \n",
       "C3L-00032                                 0.0                 1710.0   \n",
       "C3L-00084                                 0.0                  335.0   \n",
       "C3L-00090                                 1.0                 1281.0   \n",
       "\n",
       "Name        Overall survival from collection, days  \\\n",
       "Database_ID                                          \n",
       "Patient_ID                                           \n",
       "C3L-00006                                    737.0   \n",
       "C3L-00008                                    898.0   \n",
       "C3L-00032                                   1710.0   \n",
       "C3L-00084                                    335.0   \n",
       "C3L-00090                                   1287.0   \n",
       "\n",
       "Name        Survival status (1, dead; 0, alive) Sample_Status  \n",
       "Database_ID                                                    \n",
       "Patient_ID                                                     \n",
       "C3L-00006                                   0.0         Tumor  \n",
       "C3L-00008                                   0.0         Tumor  \n",
       "C3L-00032                                   0.0         Tumor  \n",
       "C3L-00084                                   0.0         Tumor  \n",
       "C3L-00090                                   1.0         Tumor  \n",
       "\n",
       "[5 rows x 127 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joining_dictionary = {\"umich proteomics\": \"ARF5\", \"bcm transcriptomics\": \"A1BG\", \"mssm clinical\": [], \"washu somatic_mutation\": []}\n",
    "en.multi_join(joining_dictionary).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`multi_join` does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>type_of_analyzed_samples_mssm_clinical</th>\n",
       "      <th>discovery_study</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>Tumor_and_Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Name       type_of_analyzed_samples_mssm_clinical  \\\n",
       "Patient_ID                                          \n",
       "C3L-00006                        Tumor_and_Normal   \n",
       "C3L-00008                                   Tumor   \n",
       "C3L-00032                                   Tumor   \n",
       "C3L-00084                                   Tumor   \n",
       "C3L-00090                                   Tumor   \n",
       "\n",
       "Name       type_of_analyzed_samples_mssm_clinical discovery_study  \n",
       "Patient_ID                                                         \n",
       "C3L-00006                                     NaN             Yes  \n",
       "C3L-00008                                     NaN             Yes  \n",
       "C3L-00032                                     NaN             Yes  \n",
       "C3L-00084                                     NaN             Yes  \n",
       "C3L-00090                                     NaN             Yes  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_type_and_discovery = en.multi_join({\"mssm clinical\": ['type_of_analyzed_samples', 'discovery_study']})\n",
    "sample_type_and_discovery.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join omics to mutations\n",
    "\n",
    "Joining an -omics dataframe with the mutation data for a specified gene or genes involves specific steps. It's worth noting that because there might be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation.\n",
    "\n",
    "For samples with no mutation for a particular gene, the list will contain either \"Wildtype_Tumor\" or \"Wildtype_Normal\", depending on whether the sample is a tumor or normal one. The mutation status column will contain either \"Single_mutation\", \"Multiple_mutation\", \"Wildtype_Tumor\", or \"Wildtype_Normal\", which aids with parsing.\n",
    "\n",
    "Let's consider an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 325)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>ARF5_umich_proteomics</th>\n",
       "      <th>M6PR_umich_proteomics</th>\n",
       "      <th>SHANK2_Mutation</th>\n",
       "      <th>SHANK2_Location</th>\n",
       "      <th>SHANK2_Mutation_Status</th>\n",
       "      <th>Sample_Status</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>-0.056513</td>\n",
       "      <td>0.016557</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.S1692R]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>0.549959</td>\n",
       "      <td>-0.206129</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>0.088681</td>\n",
       "      <td>-0.154447</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>-0.846555</td>\n",
       "      <td>0.027740</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>0.539019</td>\n",
       "      <td>0.956619</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00098</th>\n",
       "      <td>-0.017370</td>\n",
       "      <td>0.125574</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00136</th>\n",
       "      <td>0.230347</td>\n",
       "      <td>0.575436</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00137</th>\n",
       "      <td>0.191915</td>\n",
       "      <td>0.113577</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00139</th>\n",
       "      <td>-0.410142</td>\n",
       "      <td>0.381355</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00143</th>\n",
       "      <td>-0.170514</td>\n",
       "      <td>1.008577</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Name        ARF5_umich_proteomics  M6PR_umich_proteomics      SHANK2_Mutation  \\\n",
       "Patient_ID                                                                      \n",
       "C3L-00006               -0.056513               0.016557  [Missense_Mutation]   \n",
       "C3L-00008                0.549959              -0.206129     [Wildtype_Tumor]   \n",
       "C3L-00032                0.088681              -0.154447     [Wildtype_Tumor]   \n",
       "C3L-00084               -0.846555               0.027740     [Wildtype_Tumor]   \n",
       "C3L-00090                0.539019               0.956619     [Wildtype_Tumor]   \n",
       "C3L-00098               -0.017370               0.125574     [Wildtype_Tumor]   \n",
       "C3L-00136                0.230347               0.575436     [Wildtype_Tumor]   \n",
       "C3L-00137                0.191915               0.113577     [Wildtype_Tumor]   \n",
       "C3L-00139               -0.410142               0.381355     [Wildtype_Tumor]   \n",
       "C3L-00143               -0.170514               1.008577     [Wildtype_Tumor]   \n",
       "\n",
       "Name       SHANK2_Location SHANK2_Mutation_Status Sample_Status  \n",
       "Patient_ID                                                       \n",
       "C3L-00006       [p.S1692R]        Single_mutation         Tumor  \n",
       "C3L-00008    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00032    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00084    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00090    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00098    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00136    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00137    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00139    [No_mutation]         Wildtype_Tumor         Tumor  \n",
       "C3L-00143    [No_mutation]         Wildtype_Tumor         Tumor  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "somatic_mutations = en.get_somatic_mutation('harmonized')\n",
    "selected_prot_and_som_mut = en.join_omics_to_mutations(\n",
    "    omics_name = \"proteomics\",\n",
    "    mutations_genes = \"SHANK2\",\n",
    "    omics_genes = [\"ARF5\", \"M6PR\"],\n",
    "    omics_source = 'umich',\n",
    "    mutations_source = 'harmonized')\n",
    "selected_prot_and_som_mut.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the code above, we're joining proteomics data and somatic mutation data. The gene for the mutation data is \"SHANK2\" and the genes for the proteomics data are \"ARF5\" and \"M6PR\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering multiple mutations\n",
    "\n",
    "If there are multiple mutations, you can use the multi_join function to filter them. The function allows you to specify certain mutation types or locations to prioritize, and it provides a default sorting hierarchy for all other mutations.\n",
    "\n",
    "Here are some examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n",
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3972322211.py, line 1)\n",
      "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n",
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3972322211.py, line 5)\n",
      "cptac warning: Filter value p.R130Q does not exist in the mutations data for the SHANK2 gene, though it exists for other genes. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n",
      "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n",
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 141 samples for the SHANK2 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3972322211.py, line 9)\n"
     ]
    }
   ],
   "source": [
    "SHANK2_default_filter = en.multi_join({\"umich proteomics\": [\"ARF5\", \"M6PR\"],\n",
    "                                     \"harmonized somatic_mutation\": \"SHANK2\"},\n",
    "                                    mutations_filter=[])\n",
    "\n",
    "SHANK2_simple_filter = en.multi_join({\"umich proteomics\": [\"ARF5\", \"M6PR\"],\n",
    "                                    \"harmonized somatic_mutation\": \"SHANK2\"},\n",
    "                                   mutations_filter=[\"Missense_Mutation\"])\n",
    "\n",
    "PTEN_complex_filter = en.multi_join({\"umich proteomics\": [\"ARF5\", \"M6PR\"],\n",
    "                                    \"harmonized somatic_mutation\": \"SHANK2\"}, \n",
    "                                    mutations_filter=[\"p.R130Q\", \"Nonsense_Mutation\"])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mutations_filter parameter allows you to specify the mutations you're interested in. If you don't provide any specific mutations (i.e., you pass an empty list), it will use a default hierarchy, choosing truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join metadata to mutations\n",
    "\n",
    "Joining metadata to mutation data follows the same process as joining other datatypes. You can also use the mutations_filter parameter to filter multiple mutations.\n",
    "\n",
    "For instance, you can use the get_clinical function to retrieve clinical data, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>tumor_code</th>\n",
       "      <th>discovery_study</th>\n",
       "      <th>type_of_analyzed_samples</th>\n",
       "      <th>confirmatory_study</th>\n",
       "      <th>type_of_analyzed_samples</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>ethnicity_race_ancestry_identified</th>\n",
       "      <th>...</th>\n",
       "      <th>additional_treatment_pharmaceutical_therapy_for_new_tumor</th>\n",
       "      <th>additional_treatment_immuno_for_new_tumor</th>\n",
       "      <th>number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional</th>\n",
       "      <th>number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis</th>\n",
       "      <th>Recurrence-free survival, days</th>\n",
       "      <th>Recurrence-free survival from collection, days</th>\n",
       "      <th>Recurrence status (1, yes; 0, no)</th>\n",
       "      <th>Overall survival, days</th>\n",
       "      <th>Overall survival from collection, days</th>\n",
       "      <th>Survival status (1, dead; 0, alive)</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor_and_Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>737.0</td>\n",
       "      <td>737.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>898.0</td>\n",
       "      <td>898.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1710.0</td>\n",
       "      <td>1710.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>335.0</td>\n",
       "      <td>335.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>White</td>\n",
       "      <td>...</td>\n",
       "      <td>Yes</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1281.0</td>\n",
       "      <td>1287.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01520</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Slavonic</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>287.0</td>\n",
       "      <td>278.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01521</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Slavonic</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>728.0</td>\n",
       "      <td>681.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01537</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Slavonic</td>\n",
       "      <td>...</td>\n",
       "      <td>Yes</td>\n",
       "      <td>No</td>\n",
       "      <td>62.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>58.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>1</td>\n",
       "      <td>698.0</td>\n",
       "      <td>671.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01802</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>85</td>\n",
       "      <td>Female</td>\n",
       "      <td>Black or African American</td>\n",
       "      <td>Not Hispanic or Latino</td>\n",
       "      <td>American</td>\n",
       "      <td>...</td>\n",
       "      <td>No</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>598.0</td>\n",
       "      <td>563.0</td>\n",
       "      <td>1</td>\n",
       "      <td>775.0</td>\n",
       "      <td>740.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01825</th>\n",
       "      <td>UCEC</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Tumor</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>70</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Slavonic</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>687.0</td>\n",
       "      <td>661.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>103 rows × 124 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name       tumor_code discovery_study type_of_analyzed_samples  \\\n",
       "Patient_ID                                                       \n",
       "C3L-00006        UCEC             Yes         Tumor_and_Normal   \n",
       "C3L-00008        UCEC             Yes                    Tumor   \n",
       "C3L-00032        UCEC             Yes                    Tumor   \n",
       "C3L-00084        UCEC             Yes                    Tumor   \n",
       "C3L-00090        UCEC             Yes                    Tumor   \n",
       "...               ...             ...                      ...   \n",
       "C3N-01520        UCEC             Yes                    Tumor   \n",
       "C3N-01521        UCEC             Yes                    Tumor   \n",
       "C3N-01537        UCEC             Yes                    Tumor   \n",
       "C3N-01802        UCEC             Yes                    Tumor   \n",
       "C3N-01825        UCEC             Yes                    Tumor   \n",
       "\n",
       "Name       confirmatory_study type_of_analyzed_samples age     sex  \\\n",
       "Patient_ID                                                           \n",
       "C3L-00006                 NaN                      NaN  64  Female   \n",
       "C3L-00008                 NaN                      NaN  58  Female   \n",
       "C3L-00032                 NaN                      NaN  50  Female   \n",
       "C3L-00084                 NaN                      NaN  74  Female   \n",
       "C3L-00090                 NaN                      NaN  75  Female   \n",
       "...                       ...                      ...  ..     ...   \n",
       "C3N-01520                 NaN                      NaN  69  Female   \n",
       "C3N-01521                 NaN                      NaN  75  Female   \n",
       "C3N-01537                 NaN                      NaN  74  Female   \n",
       "C3N-01802                 NaN                      NaN  85  Female   \n",
       "C3N-01825                 NaN                      NaN  70  Female   \n",
       "\n",
       "Name                             race               ethnicity  \\\n",
       "Patient_ID                                                      \n",
       "C3L-00006                       White  Not Hispanic or Latino   \n",
       "C3L-00008                       White  Not Hispanic or Latino   \n",
       "C3L-00032                       White  Not Hispanic or Latino   \n",
       "C3L-00084                       White  Not Hispanic or Latino   \n",
       "C3L-00090                       White  Not Hispanic or Latino   \n",
       "...                               ...                     ...   \n",
       "C3N-01520                     Unknown                 Unknown   \n",
       "C3N-01521                     Unknown                 Unknown   \n",
       "C3N-01537                     Unknown                 Unknown   \n",
       "C3N-01802   Black or African American  Not Hispanic or Latino   \n",
       "C3N-01825                     Unknown                 Unknown   \n",
       "\n",
       "Name       ethnicity_race_ancestry_identified  ...  \\\n",
       "Patient_ID                                     ...   \n",
       "C3L-00006                               White  ...   \n",
       "C3L-00008                               White  ...   \n",
       "C3L-00032                               White  ...   \n",
       "C3L-00084                               White  ...   \n",
       "C3L-00090                               White  ...   \n",
       "...                                       ...  ...   \n",
       "C3N-01520                            Slavonic  ...   \n",
       "C3N-01521                            Slavonic  ...   \n",
       "C3N-01537                            Slavonic  ...   \n",
       "C3N-01802                            American  ...   \n",
       "C3N-01825                            Slavonic  ...   \n",
       "\n",
       "Name       additional_treatment_pharmaceutical_therapy_for_new_tumor  \\\n",
       "Patient_ID                                                             \n",
       "C3L-00006                                                 NaN          \n",
       "C3L-00008                                                 NaN          \n",
       "C3L-00032                                                 NaN          \n",
       "C3L-00084                                                 NaN          \n",
       "C3L-00090                                                 Yes          \n",
       "...                                                       ...          \n",
       "C3N-01520                                                 NaN          \n",
       "C3N-01521                                                 NaN          \n",
       "C3N-01537                                                 Yes          \n",
       "C3N-01802                                                  No          \n",
       "C3N-01825                                                 NaN          \n",
       "\n",
       "Name       additional_treatment_immuno_for_new_tumor  \\\n",
       "Patient_ID                                             \n",
       "C3L-00006                                        NaN   \n",
       "C3L-00008                                        NaN   \n",
       "C3L-00032                                        NaN   \n",
       "C3L-00084                                        NaN   \n",
       "C3L-00090                                         No   \n",
       "...                                              ...   \n",
       "C3N-01520                                        NaN   \n",
       "C3N-01521                                        NaN   \n",
       "C3N-01537                                         No   \n",
       "C3N-01802                                         No   \n",
       "C3N-01825                                        NaN   \n",
       "\n",
       "Name       number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional  \\\n",
       "Patient_ID                                                                                                                            \n",
       "C3L-00006                                                 NaN                                                                         \n",
       "C3L-00008                                                 NaN                                                                         \n",
       "C3L-00032                                                 NaN                                                                         \n",
       "C3L-00084                                                 NaN                                                                         \n",
       "C3L-00090                                                 NaN                                                                         \n",
       "...                                                       ...                                                                         \n",
       "C3N-01520                                                 NaN                                                                         \n",
       "C3N-01521                                                 NaN                                                                         \n",
       "C3N-01537                                                62.0                                                                         \n",
       "C3N-01802                                                 NaN                                                                         \n",
       "C3N-01825                                                 NaN                                                                         \n",
       "\n",
       "Name       number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis  \\\n",
       "Patient_ID                                                                                                                         \n",
       "C3L-00006                                                 NaN                                                                      \n",
       "C3L-00008                                                 NaN                                                                      \n",
       "C3L-00032                                                 NaN                                                                      \n",
       "C3L-00084                                                 NaN                                                                      \n",
       "C3L-00090                                                 NaN                                                                      \n",
       "...                                                       ...                                                                      \n",
       "C3N-01520                                                 NaN                                                                      \n",
       "C3N-01521                                                 NaN                                                                      \n",
       "C3N-01537                                                 NaN                                                                      \n",
       "C3N-01802                                                 NaN                                                                      \n",
       "C3N-01825                                                 NaN                                                                      \n",
       "\n",
       "Name       Recurrence-free survival, days  \\\n",
       "Patient_ID                                  \n",
       "C3L-00006                             NaN   \n",
       "C3L-00008                             NaN   \n",
       "C3L-00032                             NaN   \n",
       "C3L-00084                             NaN   \n",
       "C3L-00090                            50.0   \n",
       "...                                   ...   \n",
       "C3N-01520                             NaN   \n",
       "C3N-01521                             NaN   \n",
       "C3N-01537                            58.0   \n",
       "C3N-01802                           598.0   \n",
       "C3N-01825                             NaN   \n",
       "\n",
       "Name       Recurrence-free survival from collection, days  \\\n",
       "Patient_ID                                                  \n",
       "C3L-00006                                             NaN   \n",
       "C3L-00008                                             NaN   \n",
       "C3L-00032                                             NaN   \n",
       "C3L-00084                                             NaN   \n",
       "C3L-00090                                            56.0   \n",
       "...                                                   ...   \n",
       "C3N-01520                                             NaN   \n",
       "C3N-01521                                             NaN   \n",
       "C3N-01537                                            31.0   \n",
       "C3N-01802                                           563.0   \n",
       "C3N-01825                                             NaN   \n",
       "\n",
       "Name       Recurrence status (1, yes; 0, no) Overall survival, days  \\\n",
       "Patient_ID                                                            \n",
       "C3L-00006                                  0                  737.0   \n",
       "C3L-00008                                  0                  898.0   \n",
       "C3L-00032                                  0                 1710.0   \n",
       "C3L-00084                                  0                  335.0   \n",
       "C3L-00090                                  1                 1281.0   \n",
       "...                                      ...                    ...   \n",
       "C3N-01520                                  0                  287.0   \n",
       "C3N-01521                                  0                  728.0   \n",
       "C3N-01537                                  1                  698.0   \n",
       "C3N-01802                                  1                  775.0   \n",
       "C3N-01825                                  0                  687.0   \n",
       "\n",
       "Name       Overall survival from collection, days  \\\n",
       "Patient_ID                                          \n",
       "C3L-00006                                   737.0   \n",
       "C3L-00008                                   898.0   \n",
       "C3L-00032                                  1710.0   \n",
       "C3L-00084                                   335.0   \n",
       "C3L-00090                                  1287.0   \n",
       "...                                           ...   \n",
       "C3N-01520                                   278.0   \n",
       "C3N-01521                                   681.0   \n",
       "C3N-01537                                   671.0   \n",
       "C3N-01802                                   740.0   \n",
       "C3N-01825                                   661.0   \n",
       "\n",
       "Name       Survival status (1, dead; 0, alive)  \n",
       "Patient_ID                                      \n",
       "C3L-00006                                  0.0  \n",
       "C3L-00008                                  0.0  \n",
       "C3L-00032                                  0.0  \n",
       "C3L-00084                                  0.0  \n",
       "C3L-00090                                  1.0  \n",
       "...                                        ...  \n",
       "C3N-01520                                  1.0  \n",
       "C3N-01521                                  0.0  \n",
       "C3N-01537                                  0.0  \n",
       "C3N-01802                                  0.0  \n",
       "C3N-01825                                  0.0  \n",
       "\n",
       "[103 rows x 124 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en.get_clinical('mssm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n",
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 437)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>SHANK2_Mutation</th>\n",
       "      <th>SHANK2_Location</th>\n",
       "      <th>SHANK2_Mutation_Status</th>\n",
       "      <th>Sample_Status</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.S1692R</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01520</th>\n",
       "      <td>69</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.P1586S</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01521</th>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01537</th>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01802</th>\n",
       "      <td>85</td>\n",
       "      <td>Female</td>\n",
       "      <td>Black or African American</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01825</th>\n",
       "      <td>70</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>103 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name       age     sex                       race    SHANK2_Mutation  \\\n",
       "Patient_ID                                                             \n",
       "C3L-00006   64  Female                      White  Missense_Mutation   \n",
       "C3L-00008   58  Female                      White     Wildtype_Tumor   \n",
       "C3L-00032   50  Female                      White     Wildtype_Tumor   \n",
       "C3L-00084   74  Female                      White     Wildtype_Tumor   \n",
       "C3L-00090   75  Female                      White     Wildtype_Tumor   \n",
       "...         ..     ...                        ...                ...   \n",
       "C3N-01520   69  Female                    Unknown  Missense_Mutation   \n",
       "C3N-01521   75  Female                    Unknown     Wildtype_Tumor   \n",
       "C3N-01537   74  Female                    Unknown     Wildtype_Tumor   \n",
       "C3N-01802   85  Female  Black or African American     Wildtype_Tumor   \n",
       "C3N-01825   70  Female                    Unknown     Wildtype_Tumor   \n",
       "\n",
       "Name       SHANK2_Location SHANK2_Mutation_Status Sample_Status  \n",
       "Patient_ID                                                       \n",
       "C3L-00006         p.S1692R        Single_mutation         Tumor  \n",
       "C3L-00008      No_mutation         Wildtype_Tumor         Tumor  \n",
       "C3L-00032      No_mutation         Wildtype_Tumor         Tumor  \n",
       "C3L-00084      No_mutation         Wildtype_Tumor         Tumor  \n",
       "C3L-00090      No_mutation         Wildtype_Tumor         Tumor  \n",
       "...                    ...                    ...           ...  \n",
       "C3N-01520         p.P1586S        Single_mutation         Tumor  \n",
       "C3N-01521      No_mutation         Wildtype_Tumor         Tumor  \n",
       "C3N-01537      No_mutation         Wildtype_Tumor         Tumor  \n",
       "C3N-01802      No_mutation         Wildtype_Tumor         Tumor  \n",
       "C3N-01825      No_mutation         Wildtype_Tumor         Tumor  \n",
       "\n",
       "[103 rows x 7 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en.join_metadata_to_mutations(\n",
    "    metadata_name=\"clinical\",\n",
    "    metadata_source=\"mssm\",\n",
    "    metadata_cols=[\"age\", \"sex\", \"race\"],\n",
    "    mutations_source=\"harmonized\",\n",
    "    mutations_genes=\"SHANK2\",\n",
    "    mutations_filter=[\"Missense_Mutation\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This command joins the age, sex, and race metadata with the mutation data for the SHANK2 gene, filtering out all mutations except Missense_Mutations.\n",
    "\n",
    "If you need to join metadata to a larger number of mutation genes, the multi_join function can be useful. Below, we join the same metadata with the mutation data for SHANK2, PTEN, and TP53 genes. Here we do not filter mutations. Remember, by default, the mutations_filter parameter of multi_join behaves the same as the join_metadata_to_mutations function - it returns all mutations as lists in the output dataframe, regardless of the number of mutations for a given sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3189298179.py, line 1)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>SHANK2_Mutation</th>\n",
       "      <th>SHANK2_Location</th>\n",
       "      <th>SHANK2_Mutation_Status</th>\n",
       "      <th>PTEN_Mutation</th>\n",
       "      <th>PTEN_Location</th>\n",
       "      <th>PTEN_Mutation_Status</th>\n",
       "      <th>TP53_Mutation</th>\n",
       "      <th>TP53_Location</th>\n",
       "      <th>TP53_Mutation_Status</th>\n",
       "      <th>Sample_Status</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.S1692R]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>[Missense_Mutation, Nonsense_Mutation]</td>\n",
       "      <td>[p.R130Q, p.R233*]</td>\n",
       "      <td>Multiple_mutation</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.R248W]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.G127R]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Nonsense_Mutation]</td>\n",
       "      <td>[p.W111*]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.R130G]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01520</th>\n",
       "      <td>69</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.P1586S]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>[Frame_Shift_Del, Frame_Shift_Ins]</td>\n",
       "      <td>[p.N323fs, p.D268fs]</td>\n",
       "      <td>Multiple_mutation</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01521</th>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.H193L]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01537</th>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01802</th>\n",
       "      <td>85</td>\n",
       "      <td>Female</td>\n",
       "      <td>Black or African American</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.P27S]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01825</th>\n",
       "      <td>70</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Wildtype_Tumor]</td>\n",
       "      <td>[No_mutation]</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>[Missense_Mutation]</td>\n",
       "      <td>[p.R175H]</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>103 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name       age     sex                       race      SHANK2_Mutation  \\\n",
       "Patient_ID                                                               \n",
       "C3L-00006   64  Female                      White  [Missense_Mutation]   \n",
       "C3L-00008   58  Female                      White     [Wildtype_Tumor]   \n",
       "C3L-00032   50  Female                      White     [Wildtype_Tumor]   \n",
       "C3L-00084   74  Female                      White     [Wildtype_Tumor]   \n",
       "C3L-00090   75  Female                      White     [Wildtype_Tumor]   \n",
       "...         ..     ...                        ...                  ...   \n",
       "C3N-01520   69  Female                    Unknown  [Missense_Mutation]   \n",
       "C3N-01521   75  Female                    Unknown     [Wildtype_Tumor]   \n",
       "C3N-01537   74  Female                    Unknown     [Wildtype_Tumor]   \n",
       "C3N-01802   85  Female  Black or African American     [Wildtype_Tumor]   \n",
       "C3N-01825   70  Female                    Unknown     [Wildtype_Tumor]   \n",
       "\n",
       "Name       SHANK2_Location SHANK2_Mutation_Status  \\\n",
       "Patient_ID                                          \n",
       "C3L-00006       [p.S1692R]        Single_mutation   \n",
       "C3L-00008    [No_mutation]         Wildtype_Tumor   \n",
       "C3L-00032    [No_mutation]         Wildtype_Tumor   \n",
       "C3L-00084    [No_mutation]         Wildtype_Tumor   \n",
       "C3L-00090    [No_mutation]         Wildtype_Tumor   \n",
       "...                    ...                    ...   \n",
       "C3N-01520       [p.P1586S]        Single_mutation   \n",
       "C3N-01521    [No_mutation]         Wildtype_Tumor   \n",
       "C3N-01537    [No_mutation]         Wildtype_Tumor   \n",
       "C3N-01802    [No_mutation]         Wildtype_Tumor   \n",
       "C3N-01825    [No_mutation]         Wildtype_Tumor   \n",
       "\n",
       "Name                                 PTEN_Mutation         PTEN_Location  \\\n",
       "Patient_ID                                                                 \n",
       "C3L-00006   [Missense_Mutation, Nonsense_Mutation]    [p.R130Q, p.R233*]   \n",
       "C3L-00008                      [Missense_Mutation]             [p.G127R]   \n",
       "C3L-00032                      [Nonsense_Mutation]             [p.W111*]   \n",
       "C3L-00084                         [Wildtype_Tumor]         [No_mutation]   \n",
       "C3L-00090                      [Missense_Mutation]             [p.R130G]   \n",
       "...                                            ...                   ...   \n",
       "C3N-01520       [Frame_Shift_Del, Frame_Shift_Ins]  [p.N323fs, p.D268fs]   \n",
       "C3N-01521                         [Wildtype_Tumor]         [No_mutation]   \n",
       "C3N-01537                         [Wildtype_Tumor]         [No_mutation]   \n",
       "C3N-01802                         [Wildtype_Tumor]         [No_mutation]   \n",
       "C3N-01825                         [Wildtype_Tumor]         [No_mutation]   \n",
       "\n",
       "Name       PTEN_Mutation_Status        TP53_Mutation  TP53_Location  \\\n",
       "Patient_ID                                                            \n",
       "C3L-00006     Multiple_mutation  [Missense_Mutation]      [p.R248W]   \n",
       "C3L-00008       Single_mutation     [Wildtype_Tumor]  [No_mutation]   \n",
       "C3L-00032       Single_mutation     [Wildtype_Tumor]  [No_mutation]   \n",
       "C3L-00084        Wildtype_Tumor     [Wildtype_Tumor]  [No_mutation]   \n",
       "C3L-00090       Single_mutation     [Wildtype_Tumor]  [No_mutation]   \n",
       "...                         ...                  ...            ...   \n",
       "C3N-01520     Multiple_mutation     [Wildtype_Tumor]  [No_mutation]   \n",
       "C3N-01521        Wildtype_Tumor  [Missense_Mutation]      [p.H193L]   \n",
       "C3N-01537        Wildtype_Tumor     [Wildtype_Tumor]  [No_mutation]   \n",
       "C3N-01802        Wildtype_Tumor  [Missense_Mutation]       [p.P27S]   \n",
       "C3N-01825        Wildtype_Tumor  [Missense_Mutation]      [p.R175H]   \n",
       "\n",
       "Name       TP53_Mutation_Status Sample_Status  \n",
       "Patient_ID                                     \n",
       "C3L-00006       Single_mutation         Tumor  \n",
       "C3L-00008        Wildtype_Tumor         Tumor  \n",
       "C3L-00032        Wildtype_Tumor         Tumor  \n",
       "C3L-00084        Wildtype_Tumor         Tumor  \n",
       "C3L-00090        Wildtype_Tumor         Tumor  \n",
       "...                         ...           ...  \n",
       "C3N-01520        Wildtype_Tumor         Tumor  \n",
       "C3N-01521       Single_mutation         Tumor  \n",
       "C3N-01537        Wildtype_Tumor         Tumor  \n",
       "C3N-01802       Single_mutation         Tumor  \n",
       "C3N-01825       Single_mutation         Tumor  \n",
       "\n",
       "[103 rows x 13 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "en.multi_join({\"mssm clinical\": [\"age\", \"sex\", \"race\"],\n",
    "               \"harmonized somatic_mutation\": [\"SHANK2\", \"PTEN\", \"TP53\"]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is an example of joining clinical data with mutations while filtering specific mutations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "cptac warning: Unknown mutation type Intron. Assigned lowest priority in filtering. (C:\\Users\\sabme\\anaconda3\\lib\\site-packages\\cptac\\cancers\\cancer.py, line 525)\n",
      "cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 92 samples for the SHANK2 gene, 28 samples for the PTEN gene, 80 samples for the TP53 gene (C:\\Users\\sabme\\AppData\\Local\\Temp\\ipykernel_2264\\3101478147.py, line 1)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>SHANK2_Mutation</th>\n",
       "      <th>SHANK2_Location</th>\n",
       "      <th>SHANK2_Mutation_Status</th>\n",
       "      <th>PTEN_Mutation</th>\n",
       "      <th>PTEN_Location</th>\n",
       "      <th>PTEN_Mutation_Status</th>\n",
       "      <th>TP53_Mutation</th>\n",
       "      <th>TP53_Location</th>\n",
       "      <th>TP53_Mutation_Status</th>\n",
       "      <th>Sample_Status</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C3L-00006</th>\n",
       "      <td>64</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.S1692R</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.R130Q</td>\n",
       "      <td>Multiple_mutation</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.R248W</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00008</th>\n",
       "      <td>58</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.G127R</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00032</th>\n",
       "      <td>50</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Nonsense_Mutation</td>\n",
       "      <td>p.W111*</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00084</th>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3L-00090</th>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>White</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.R130G</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01520</th>\n",
       "      <td>69</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.P1586S</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Frame_Shift_Ins</td>\n",
       "      <td>p.D268fs</td>\n",
       "      <td>Multiple_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01521</th>\n",
       "      <td>75</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.H193L</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01537</th>\n",
       "      <td>74</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01802</th>\n",
       "      <td>85</td>\n",
       "      <td>Female</td>\n",
       "      <td>Black or African American</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.P27S</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C3N-01825</th>\n",
       "      <td>70</td>\n",
       "      <td>Female</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>No_mutation</td>\n",
       "      <td>Wildtype_Tumor</td>\n",
       "      <td>Missense_Mutation</td>\n",
       "      <td>p.R175H</td>\n",
       "      <td>Single_mutation</td>\n",
       "      <td>Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>103 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Name       age     sex                       race    SHANK2_Mutation  \\\n",
       "Patient_ID                                                             \n",
       "C3L-00006   64  Female                      White  Missense_Mutation   \n",
       "C3L-00008   58  Female                      White     Wildtype_Tumor   \n",
       "C3L-00032   50  Female                      White     Wildtype_Tumor   \n",
       "C3L-00084   74  Female                      White     Wildtype_Tumor   \n",
       "C3L-00090   75  Female                      White     Wildtype_Tumor   \n",
       "...         ..     ...                        ...                ...   \n",
       "C3N-01520   69  Female                    Unknown  Missense_Mutation   \n",
       "C3N-01521   75  Female                    Unknown     Wildtype_Tumor   \n",
       "C3N-01537   74  Female                    Unknown     Wildtype_Tumor   \n",
       "C3N-01802   85  Female  Black or African American     Wildtype_Tumor   \n",
       "C3N-01825   70  Female                    Unknown     Wildtype_Tumor   \n",
       "\n",
       "Name       SHANK2_Location SHANK2_Mutation_Status      PTEN_Mutation  \\\n",
       "Patient_ID                                                             \n",
       "C3L-00006         p.S1692R        Single_mutation  Missense_Mutation   \n",
       "C3L-00008      No_mutation         Wildtype_Tumor  Missense_Mutation   \n",
       "C3L-00032      No_mutation         Wildtype_Tumor  Nonsense_Mutation   \n",
       "C3L-00084      No_mutation         Wildtype_Tumor     Wildtype_Tumor   \n",
       "C3L-00090      No_mutation         Wildtype_Tumor  Missense_Mutation   \n",
       "...                    ...                    ...                ...   \n",
       "C3N-01520         p.P1586S        Single_mutation    Frame_Shift_Ins   \n",
       "C3N-01521      No_mutation         Wildtype_Tumor     Wildtype_Tumor   \n",
       "C3N-01537      No_mutation         Wildtype_Tumor     Wildtype_Tumor   \n",
       "C3N-01802      No_mutation         Wildtype_Tumor     Wildtype_Tumor   \n",
       "C3N-01825      No_mutation         Wildtype_Tumor     Wildtype_Tumor   \n",
       "\n",
       "Name       PTEN_Location PTEN_Mutation_Status      TP53_Mutation  \\\n",
       "Patient_ID                                                         \n",
       "C3L-00006        p.R130Q    Multiple_mutation  Missense_Mutation   \n",
       "C3L-00008        p.G127R      Single_mutation     Wildtype_Tumor   \n",
       "C3L-00032        p.W111*      Single_mutation     Wildtype_Tumor   \n",
       "C3L-00084    No_mutation       Wildtype_Tumor     Wildtype_Tumor   \n",
       "C3L-00090        p.R130G      Single_mutation     Wildtype_Tumor   \n",
       "...                  ...                  ...                ...   \n",
       "C3N-01520       p.D268fs    Multiple_mutation     Wildtype_Tumor   \n",
       "C3N-01521    No_mutation       Wildtype_Tumor  Missense_Mutation   \n",
       "C3N-01537    No_mutation       Wildtype_Tumor     Wildtype_Tumor   \n",
       "C3N-01802    No_mutation       Wildtype_Tumor  Missense_Mutation   \n",
       "C3N-01825    No_mutation       Wildtype_Tumor  Missense_Mutation   \n",
       "\n",
       "Name       TP53_Location TP53_Mutation_Status Sample_Status  \n",
       "Patient_ID                                                   \n",
       "C3L-00006        p.R248W      Single_mutation         Tumor  \n",
       "C3L-00008    No_mutation       Wildtype_Tumor         Tumor  \n",
       "C3L-00032    No_mutation       Wildtype_Tumor         Tumor  \n",
       "C3L-00084    No_mutation       Wildtype_Tumor         Tumor  \n",
       "C3L-00090    No_mutation       Wildtype_Tumor         Tumor  \n",
       "...                  ...                  ...           ...  \n",
       "C3N-01520    No_mutation       Wildtype_Tumor         Tumor  \n",
       "C3N-01521        p.H193L      Single_mutation         Tumor  \n",
       "C3N-01537    No_mutation       Wildtype_Tumor         Tumor  \n",
       "C3N-01802         p.P27S      Single_mutation         Tumor  \n",
       "C3N-01825        p.R175H      Single_mutation         Tumor  \n",
       "\n",
       "[103 rows x 13 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "survival_and_SHANK2 = en.multi_join({\"mssm clinical\": [\"age\", \"sex\", \"race\"],\n",
    "               \"harmonized somatic_mutation\": [\"SHANK2\", \"PTEN\", \"TP53\"]}, \n",
    "               mutations_filter=[\"Missense_Mutation\"])\n",
    "\n",
    "survival_and_SHANK2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that the mutations_filter parameter receives a list. In this example, it is filtering only the \"Missense_Mutation\" type for all genes specified."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exporting dataframes\n",
    "\n",
    "If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "survival_and_SHANK2.to_csv(path_or_buf=\"histologic_type_and_PTEN_mutation.tsv\", sep='\\t')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}