{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Case 8: Outliers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When looking at data, we often want to identify outliers, extremely high or low data points. In this use case we will show you how to use the Blacksheep package to find these in the CPTAC data. For more detailed information about the Blacksheep package see [this](https://github.com/ruggleslab/blackSheep/) repository.\n",
    "\n",
    "In the CPTAC breast cancer study ([here](https://www.nature.com/articles/nature18003)) it was shown that tumors classified as HER-2 enriched are frequently outliers for high abundance of ERBB2 phosphorylation, protein and mRNA (see [figure 4](https://www.nature.com/articles/nature18003/figures/4) of the manuscript). In this use case we will show that same phenomena in an independent cohort of breast cancer tumors, whose data are included in the cptac package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Importing packages and setting up your notebook\n",
    "\n",
    "Before we begin performing the analysis, we must import the packages we will be using. In this first code block, we import the standard set of data science packages.\n",
    "\n",
    "We will need an external package called blacksheep. To install it run the following on your command line:\n",
    "```\n",
    "pip install blksheep\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this next code block we import the blacksheep and cptac packages and grab our proteomic and clinical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                         \r"
     ]
    }
   ],
   "source": [
    "import blacksheep\n",
    "import cptac\n",
    "brca = cptac.Brca()\n",
    "clinical = brca.get_clinical()\n",
    "proteomics = brca.get_proteomics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Binarize Data\n",
    "\n",
    "The Blacksheep package requires that annotations are a binary variable. Our cptac tumors are divided into 4 subtypes: LumA, LumB, Basal, and Her2. We will use the binarize_annotations function to create a binary table of these PAM50 tumor classifications. We will call this table 'annotations'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PAM50_LumA</th>\n",
       "      <th>PAM50_Basal</th>\n",
       "      <th>PAM50_LumB</th>\n",
       "      <th>PAM50_Her2</th>\n",
       "      <th>PAM50_Normal-like</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>CPT000814</th>\n",
       "      <td>not-LumA</td>\n",
       "      <td>Basal</td>\n",
       "      <td>not-LumB</td>\n",
       "      <td>not-Her2</td>\n",
       "      <td>not-Normal-like</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CPT001846</th>\n",
       "      <td>not-LumA</td>\n",
       "      <td>Basal</td>\n",
       "      <td>not-LumB</td>\n",
       "      <td>not-Her2</td>\n",
       "      <td>not-Normal-like</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X01BR001</th>\n",
       "      <td>not-LumA</td>\n",
       "      <td>Basal</td>\n",
       "      <td>not-LumB</td>\n",
       "      <td>not-Her2</td>\n",
       "      <td>not-Normal-like</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X01BR008</th>\n",
       "      <td>not-LumA</td>\n",
       "      <td>Basal</td>\n",
       "      <td>not-LumB</td>\n",
       "      <td>not-Her2</td>\n",
       "      <td>not-Normal-like</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X01BR009</th>\n",
       "      <td>not-LumA</td>\n",
       "      <td>Basal</td>\n",
       "      <td>not-LumB</td>\n",
       "      <td>not-Her2</td>\n",
       "      <td>not-Normal-like</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           PAM50_LumA PAM50_Basal PAM50_LumB PAM50_Her2 PAM50_Normal-like\n",
       "Patient_ID                                                               \n",
       "CPT000814    not-LumA       Basal   not-LumB   not-Her2   not-Normal-like\n",
       "CPT001846    not-LumA       Basal   not-LumB   not-Her2   not-Normal-like\n",
       "X01BR001     not-LumA       Basal   not-LumB   not-Her2   not-Normal-like\n",
       "X01BR008     not-LumA       Basal   not-LumB   not-Her2   not-Normal-like\n",
       "X01BR009     not-LumA       Basal   not-LumB   not-Her2   not-Normal-like"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "annotations = clinical[['PAM50']].copy()\n",
    "annotations = blacksheep.binarize_annotations(annotations)\n",
    "annotations.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Perform Outlier Analysis\n",
    "\n",
    "Now that our dataframes are correctly formatted, we will start looking for outliers.\n",
    "\n",
    "We will start by using the deva function found in the blacksheep package. This function takes the proteomics data frame (which we transpose to fit the requirements of the function), and the annotations data frame that includes the binarized columns. We also indicate that we want to look for up regulated genes, and that we do not want to aggregate the data. The function returns two things:\n",
    "1. A data object with a dataframe which states whether a sample is an outlier for a specific protein. In the code block below we named this 'outliers'\n",
    "2. A data object with a dataframe with the Q Values showing if a gene shows an enrichment in outliers for a specific subset of tumors as defined in annotations. In the code block below, we named this 'qvalues'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07/29/2021 16:11:25:WARNING:No rows tested for fisherFDR_PAM50_LumA_not-LumA\n",
      "07/29/2021 16:11:25:WARNING:No rows tested for fisherFDR_PAM50_LumA_LumA\n",
      "07/29/2021 16:11:25:WARNING:No rows tested for fisherFDR_PAM50_Basal_not-Basal\n",
      "07/29/2021 16:11:26:WARNING:No rows tested for fisherFDR_PAM50_LumB_not-LumB\n",
      "07/29/2021 16:11:26:WARNING:No rows tested for fisherFDR_PAM50_Her2_not-Her2\n",
      "07/29/2021 16:11:28:WARNING:No rows tested for fisherFDR_PAM50_Normal-like_not-Normal-like\n"
     ]
    }
   ],
   "source": [
    "outliers, qvalues = blacksheep.deva(proteomics.transpose(),\n",
    "                                      annotations,\n",
    "                                      up_or_down='up',\n",
    "                                      aggregate=False,\n",
    "                                      frac_filter=0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Inspect Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because these two tables that are returned are quite complex, we will now look at each of these individually.\n",
    "\n",
    "The outliers table indicates whether each sample is an outlier for a particular gene. In this use case, we will focus on ERBB2. The first line below simplifies the index for each row by dropping the database id and leaving the gene name. We also only print off a portion of the table for brevity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CPT000814_outliers</th>\n",
       "      <th>CPT001846_outliers</th>\n",
       "      <th>X01BR001_outliers</th>\n",
       "      <th>X01BR008_outliers</th>\n",
       "      <th>X01BR009_outliers</th>\n",
       "      <th>X01BR010_outliers</th>\n",
       "      <th>X01BR015_outliers</th>\n",
       "      <th>X01BR017_outliers</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ERBB2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       CPT000814_outliers  CPT001846_outliers  X01BR001_outliers  \\\n",
       "Name                                                               \n",
       "ERBB2                 0.0                 0.0                0.0   \n",
       "\n",
       "       X01BR008_outliers  X01BR009_outliers  X01BR010_outliers  \\\n",
       "Name                                                             \n",
       "ERBB2                0.0                0.0                0.0   \n",
       "\n",
       "       X01BR015_outliers  X01BR017_outliers  \n",
       "Name                                         \n",
       "ERBB2                0.0                1.0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outliers.df.index = outliers.df.index.droplevel('Database_ID')\n",
    "erbb2_outliers = outliers.df[outliers.df.index.str.match('ERBB2')]\n",
    "erbb2_outliers.iloc[:, :8]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the chart above you can see that most of the samples have 0, indiciating that the sample is not an outlier for ERBB2 protein abundance. X01BR017, however, has a 1, indicating that particular sample is an outlier.\n",
    "\n",
    "The Outliers table contains boolean columns for both outlier and notOutliers. The notOutliers columns are redundant so we will remove them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "erbb2_outliers = erbb2_outliers.loc[:,~erbb2_outliers.columns.str.endswith('_notOutliers')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now complile a list of all the samples that were considered to be outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['X01BR017_outliers', 'X05BR026_outliers', 'X09BR004_outliers', 'X09BR005_outliers', 'X11BR004_outliers', 'X11BR010_outliers', 'X11BR011_outliers', 'X11BR028_outliers', 'X11BR030_outliers', 'X11BR038_outliers', 'X11BR060_outliers', 'X18BR009_outliers', 'X21BR001_outliers', 'X22BR005_outliers']\n"
     ]
    }
   ],
   "source": [
    "outlier_list = erbb2_outliers.columns[erbb2_outliers.isin([1.0]).all()].tolist()\n",
    "print(outlier_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Visualizing Outliers\n",
    "To understand what this means, we will plot the proteomics data for the ERBB2 gene and label the outlier samples. Before we graph the result we will join the proteomics and clinical data, isolating the PAM50 subtype and ERBB2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_data = brca.join_metadata_to_omics(metadata_df_name=\"clinical\", \n",
    "                                            omics_df_name=\"proteomics\", \n",
    "                                            metadata_cols=[\"PAM50\"],\n",
    "                                            omics_genes=['ERBB2'])\n",
    "combined_data.columns = combined_data.columns.droplevel(\"Database_ID\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now create the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 800x800 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(8, 8))\n",
    "sns.set_palette('colorblind')\n",
    "ax = sns.boxplot(data=combined_data, showfliers=False, y='ERBB2_proteomics', color='lightgray')\n",
    "left = False\n",
    "# This for loop labels all the specific outlier data points.\n",
    "for sample in outlier_list:\n",
    "    if left:\n",
    "        position = -0.08\n",
    "        left = False\n",
    "    else:\n",
    "        position = 0.01\n",
    "        left = True\n",
    "    sample = sample.split(\"_\")[0]\n",
    "    ax.annotate(sample, (position, combined_data.transpose()[sample].values[1]))\n",
    "ax = sns.swarmplot(data=combined_data, y='ERBB2_proteomics')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see from this graph, the samples we labeled, which had a 1.0 in the outliers table were all located at the top of the graph, indicating they are very highly expressed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Looking at the Qvalue table\n",
    "\n",
    "Let's now take a look at the Qvalues table. Remember that the qvalues table indicates the probability that a gene shows an enrichment in outliers for categories defined in our annotation dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>fisherFDR_PAM50_Basal_Basal</th>\n",
       "      <th>fisherFDR_PAM50_LumB_LumB</th>\n",
       "      <th>fisherFDR_PAM50_Her2_Her2</th>\n",
       "      <th>fisherFDR_PAM50_Normal-like_Normal-like</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th>Database_ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A2ML1</th>\n",
       "      <th>NP_653271.2|NP_001269353.1</th>\n",
       "      <td>1.441146e-07</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ABCC11</th>\n",
       "      <th>NP_149163.2|NP_660187.1|NP_150229.2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.001545</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ABCC5</th>\n",
       "      <th>NP_005679.2|NP_001306961.1|NP_001018881.1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.093281</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ACACB</th>\n",
       "      <th>NP_001084.3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.068994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ACAD8</th>\n",
       "      <th>NP_055199.1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.093281</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  fisherFDR_PAM50_Basal_Basal  \\\n",
       "Name   Database_ID                                                              \n",
       "A2ML1  NP_653271.2|NP_001269353.1                                1.441146e-07   \n",
       "ABCC11 NP_149163.2|NP_660187.1|NP_150229.2                                NaN   \n",
       "ABCC5  NP_005679.2|NP_001306961.1|NP_001018881.1                          NaN   \n",
       "ACACB  NP_001084.3                                                        NaN   \n",
       "ACAD8  NP_055199.1                                                        NaN   \n",
       "\n",
       "                                                  fisherFDR_PAM50_LumB_LumB  \\\n",
       "Name   Database_ID                                                            \n",
       "A2ML1  NP_653271.2|NP_001269353.1                                       NaN   \n",
       "ABCC11 NP_149163.2|NP_660187.1|NP_150229.2                              NaN   \n",
       "ABCC5  NP_005679.2|NP_001306961.1|NP_001018881.1                        NaN   \n",
       "ACACB  NP_001084.3                                                      NaN   \n",
       "ACAD8  NP_055199.1                                                      NaN   \n",
       "\n",
       "                                                  fisherFDR_PAM50_Her2_Her2  \\\n",
       "Name   Database_ID                                                            \n",
       "A2ML1  NP_653271.2|NP_001269353.1                                       NaN   \n",
       "ABCC11 NP_149163.2|NP_660187.1|NP_150229.2                         0.001545   \n",
       "ABCC5  NP_005679.2|NP_001306961.1|NP_001018881.1                        NaN   \n",
       "ACACB  NP_001084.3                                                      NaN   \n",
       "ACAD8  NP_055199.1                                                      NaN   \n",
       "\n",
       "                                                  fisherFDR_PAM50_Normal-like_Normal-like  \n",
       "Name   Database_ID                                                                         \n",
       "A2ML1  NP_653271.2|NP_001269353.1                                                     NaN  \n",
       "ABCC11 NP_149163.2|NP_660187.1|NP_150229.2                                            NaN  \n",
       "ABCC5  NP_005679.2|NP_001306961.1|NP_001018881.1                                 0.093281  \n",
       "ACACB  NP_001084.3                                                               0.068994  \n",
       "ACAD8  NP_055199.1                                                               0.093281  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qvalues.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This table includes all the q-values. Before really analyzing the table we will want to remove any insignificant q-values. For our purposes we will remove any q-values that are greater than 0.05."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in qvalues.df.columns:\n",
    "    qvalues.df.loc[qvalues.df[col] > 0.05, col] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now isolate the ERBB2 gene."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>fisherFDR_PAM50_Basal_Basal</th>\n",
       "      <th>fisherFDR_PAM50_LumB_LumB</th>\n",
       "      <th>fisherFDR_PAM50_Her2_Her2</th>\n",
       "      <th>fisherFDR_PAM50_Normal-like_Normal-like</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ERBB2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000366</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Name  fisherFDR_PAM50_Basal_Basal  fisherFDR_PAM50_LumB_LumB  \\\n",
       "0  ERBB2                          NaN                        NaN   \n",
       "\n",
       "   fisherFDR_PAM50_Her2_Her2  fisherFDR_PAM50_Normal-like_Normal-like  \n",
       "0                   0.000366                                      NaN  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qvalues.df.index = qvalues.df.index.droplevel('Database_ID')\n",
    "qvalues = qvalues.df[qvalues.df.index.str.match('ERBB2')]\n",
    "erbb2_qvalues = qvalues.reset_index()['Name'] == 'ERBB2'\n",
    "qvalues = qvalues.reset_index()[erbb2_qvalues]\n",
    "qvalues.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see that the only PAM50 subtype that has a significant enrichment is the Her2, which is exactly what is to be expected. To visualize this pattern, we will create a graph similiar to the one above, but with each of the categories in the PAM50 category differentially colored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 800x800 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(8, 8))\n",
    "sns.set_palette('colorblind')\n",
    "cols = {'Basal': 0, 'Her2':1, 'LumA':2, 'LumB':3, 'Normal-like':4}\n",
    "ax = sns.boxplot(data=combined_data, y='ERBB2_proteomics', x='PAM50', color='lightgray')\n",
    "ax = sns.swarmplot(data=combined_data, y='ERBB2_proteomics',x='PAM50', hue='PAM50')\n",
    "for sample in outlier_list:\n",
    "    sample = sample.split(\"_\")[0]\n",
    "    ax.annotate(sample, (cols[combined_data.transpose()[sample].values[0]], combined_data.transpose()[sample].values[1]))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the distribution of the graph you can see that distribution of the Her2 category is much different than the distributions of the other catgeories. The median of the proteomic data in the Her2 category is much higher than other categories, with many more data points in the upper portion of the graph."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional Applications"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have just walked through one example of how you might use the Outlier Analysis. Using this same approach, you can run the outlier analysis on a number of different clinical attributes, cohorts, and omics data. For example, you may look for outliers within the transcriptomics of the Endometrial cancer type using the clinical attribute of Histological_type. You can also look at more than one clinical attribute at a time by appending more attributes to your annotations table, or you can look for downregulated omics by chaning the 'up_or_down' variable of the run_outliers function."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}