{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing ALDEx2 feature differentials using Qurro\n",
    "\n",
    "In this example, we use transcriptomic data from [TCGA](https://portal.gdc.cancer.gov/repository). We downloaded \n",
    "100 gene expression files from lung squamous cell carcinoma (LUSC) primary tumors and 49 solid tissue normal expression files. We have pre-processed this data into a feature table for ease of use, but the gdc manifest file is also provided for your convenience as well as the script used to aggregate all the files.\n",
    "\n",
    "**Please note that this example is primarily intended to demonstrate how to use Qurro**, rather than to demonstrate \"best practices\" in using ALDEx2 or in analyzing transcriptomic data. We designed Qurro in the context of microbiome / metabolome data, but it should be applicable to arbitrary compositional datasets :)\n",
    "\n",
    "[1] Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., ... & Cancer Genome Atlas Rese to arch Network. (2013). The cancer genome atlas pan-cancer analysis project. *Nature Genetics, 45*(10), 1113.\n",
    "\n",
    "## Requirements:\n",
    "\n",
    "This notebook requires **qurro** and **mygene** to be installed on the Python side of things. The R package [ALDEx2](http://bioconductor.org/packages/release/bioc/html/ALDEx2.html) is also required for the `run_aldex.R` script.\n",
    "\n",
    "### 2022 note on running this notebook\n",
    "\n",
    "In previous years, we installed ALDEx2 through conda using the [`bioconductor-aldex2`](https://anaconda.org/bioconda/bioconductor-aldex2) conda package. However, it no longer seems easily possible to install this package (at least on my laptop). Therefore, we now recommend installing ALDEx2 directly through Bioconductor in R; see [ALDEx2's Bioconductor documentation](https://bioconductor.org/packages/release/bioc/html/ALDEx2.html) for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Setting up\n",
    "In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clear the output directory so we can write these files there\n",
    "!rm -rf output\n",
    "# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist\n",
    "# (if it does already exist, -p ensures that an error won't be thrown)\n",
    "!mkdir -p output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Processing the feature table\n",
    "\n",
    "The original feature table has over 60,000 features, which is too computationally expensive for a simple tutorial example like this. To speed things up, we'll filter this table to use the top 1000 genes by total abundance. (Again, this is just for demonstrative purposes -- in practice you should think carefully about when or when not to filter your data.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import mygene\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(60483, 149)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-43-5670-11A</th>\n",
       "      <th>TCGA-77-8008-11A</th>\n",
       "      <th>TCGA-18-3410-01A</th>\n",
       "      <th>TCGA-43-6771-11A</th>\n",
       "      <th>TCGA-66-2758-01A</th>\n",
       "      <th>TCGA-90-7767-01A</th>\n",
       "      <th>TCGA-66-2795-01A</th>\n",
       "      <th>TCGA-77-7138-01A</th>\n",
       "      <th>TCGA-39-5019-01A</th>\n",
       "      <th>TCGA-90-6837-11A</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-56-7823-01B</th>\n",
       "      <th>TCGA-77-7142-11A</th>\n",
       "      <th>TCGA-22-5483-11A</th>\n",
       "      <th>TCGA-77-8153-01A</th>\n",
       "      <th>TCGA-85-A4JC-01A</th>\n",
       "      <th>TCGA-39-5040-11A</th>\n",
       "      <th>TCGA-43-6143-11A</th>\n",
       "      <th>TCGA-77-7335-11A</th>\n",
       "      <th>TCGA-85-7698-01A</th>\n",
       "      <th>TCGA-43-2581-01A</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature-id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000000003.13</th>\n",
       "      <td>4707</td>\n",
       "      <td>1016</td>\n",
       "      <td>2286</td>\n",
       "      <td>1238</td>\n",
       "      <td>2982</td>\n",
       "      <td>2795</td>\n",
       "      <td>1516</td>\n",
       "      <td>5507</td>\n",
       "      <td>5504</td>\n",
       "      <td>2079</td>\n",
       "      <td>...</td>\n",
       "      <td>1988</td>\n",
       "      <td>1059</td>\n",
       "      <td>1297</td>\n",
       "      <td>1970</td>\n",
       "      <td>2450</td>\n",
       "      <td>1432</td>\n",
       "      <td>1500</td>\n",
       "      <td>1790</td>\n",
       "      <td>3722</td>\n",
       "      <td>2471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000005.5</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000419.11</th>\n",
       "      <td>2019</td>\n",
       "      <td>1731</td>\n",
       "      <td>2123</td>\n",
       "      <td>2121</td>\n",
       "      <td>3868</td>\n",
       "      <td>2722</td>\n",
       "      <td>2130</td>\n",
       "      <td>4834</td>\n",
       "      <td>4557</td>\n",
       "      <td>1264</td>\n",
       "      <td>...</td>\n",
       "      <td>1194</td>\n",
       "      <td>1449</td>\n",
       "      <td>1382</td>\n",
       "      <td>747</td>\n",
       "      <td>1741</td>\n",
       "      <td>3167</td>\n",
       "      <td>1454</td>\n",
       "      <td>1569</td>\n",
       "      <td>2477</td>\n",
       "      <td>1826</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000457.12</th>\n",
       "      <td>1062</td>\n",
       "      <td>968</td>\n",
       "      <td>1883</td>\n",
       "      <td>655</td>\n",
       "      <td>800</td>\n",
       "      <td>1987</td>\n",
       "      <td>935</td>\n",
       "      <td>898</td>\n",
       "      <td>1014</td>\n",
       "      <td>840</td>\n",
       "      <td>...</td>\n",
       "      <td>449</td>\n",
       "      <td>883</td>\n",
       "      <td>614</td>\n",
       "      <td>678</td>\n",
       "      <td>305</td>\n",
       "      <td>705</td>\n",
       "      <td>917</td>\n",
       "      <td>658</td>\n",
       "      <td>961</td>\n",
       "      <td>470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000460.15</th>\n",
       "      <td>204</td>\n",
       "      <td>202</td>\n",
       "      <td>1923</td>\n",
       "      <td>205</td>\n",
       "      <td>706</td>\n",
       "      <td>2191</td>\n",
       "      <td>956</td>\n",
       "      <td>927</td>\n",
       "      <td>1347</td>\n",
       "      <td>264</td>\n",
       "      <td>...</td>\n",
       "      <td>748</td>\n",
       "      <td>159</td>\n",
       "      <td>180</td>\n",
       "      <td>887</td>\n",
       "      <td>275</td>\n",
       "      <td>181</td>\n",
       "      <td>206</td>\n",
       "      <td>153</td>\n",
       "      <td>581</td>\n",
       "      <td>770</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 149 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                    TCGA-43-5670-11A  TCGA-77-8008-11A  TCGA-18-3410-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000000003.13              4707              1016              2286   \n",
       "ENSG00000000005.5                  5                 1                 1   \n",
       "ENSG00000000419.11              2019              1731              2123   \n",
       "ENSG00000000457.12              1062               968              1883   \n",
       "ENSG00000000460.15               204               202              1923   \n",
       "\n",
       "                    TCGA-43-6771-11A  TCGA-66-2758-01A  TCGA-90-7767-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000000003.13              1238              2982              2795   \n",
       "ENSG00000000005.5                  5                 1                 0   \n",
       "ENSG00000000419.11              2121              3868              2722   \n",
       "ENSG00000000457.12               655               800              1987   \n",
       "ENSG00000000460.15               205               706              2191   \n",
       "\n",
       "                    TCGA-66-2795-01A  TCGA-77-7138-01A  TCGA-39-5019-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000000003.13              1516              5507              5504   \n",
       "ENSG00000000005.5                  0                 0                 2   \n",
       "ENSG00000000419.11              2130              4834              4557   \n",
       "ENSG00000000457.12               935               898              1014   \n",
       "ENSG00000000460.15               956               927              1347   \n",
       "\n",
       "                    TCGA-90-6837-11A  ...  TCGA-56-7823-01B  TCGA-77-7142-11A  \\\n",
       "feature-id                            ...                                       \n",
       "ENSG00000000003.13              2079  ...              1988              1059   \n",
       "ENSG00000000005.5                  3  ...                 1                 5   \n",
       "ENSG00000000419.11              1264  ...              1194              1449   \n",
       "ENSG00000000457.12               840  ...               449               883   \n",
       "ENSG00000000460.15               264  ...               748               159   \n",
       "\n",
       "                    TCGA-22-5483-11A  TCGA-77-8153-01A  TCGA-85-A4JC-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000000003.13              1297              1970              2450   \n",
       "ENSG00000000005.5                  4                 0                 0   \n",
       "ENSG00000000419.11              1382               747              1741   \n",
       "ENSG00000000457.12               614               678               305   \n",
       "ENSG00000000460.15               180               887               275   \n",
       "\n",
       "                    TCGA-39-5040-11A  TCGA-43-6143-11A  TCGA-77-7335-11A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000000003.13              1432              1500              1790   \n",
       "ENSG00000000005.5                  4                 3                 0   \n",
       "ENSG00000000419.11              3167              1454              1569   \n",
       "ENSG00000000457.12               705               917               658   \n",
       "ENSG00000000460.15               181               206               153   \n",
       "\n",
       "                    TCGA-85-7698-01A  TCGA-43-2581-01A  \n",
       "feature-id                                              \n",
       "ENSG00000000003.13              3722              2471  \n",
       "ENSG00000000005.5                  1                 0  \n",
       "ENSG00000000419.11              2477              1826  \n",
       "ENSG00000000457.12               961               470  \n",
       "ENSG00000000460.15               581               770  \n",
       "\n",
       "[5 rows x 149 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_table = pd.read_csv(\n",
    "    \"input/TCGA_LUSC_expression_feature_table.tsv\",\n",
    "    sep=\"\\t\",\n",
    "    index_col=0,\n",
    ")\n",
    "print(feature_table.shape)\n",
    "feature_table.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1000, 149)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-43-5670-11A</th>\n",
       "      <th>TCGA-77-8008-11A</th>\n",
       "      <th>TCGA-18-3410-01A</th>\n",
       "      <th>TCGA-43-6771-11A</th>\n",
       "      <th>TCGA-66-2758-01A</th>\n",
       "      <th>TCGA-90-7767-01A</th>\n",
       "      <th>TCGA-66-2795-01A</th>\n",
       "      <th>TCGA-77-7138-01A</th>\n",
       "      <th>TCGA-39-5019-01A</th>\n",
       "      <th>TCGA-90-6837-11A</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-56-7823-01B</th>\n",
       "      <th>TCGA-77-7142-11A</th>\n",
       "      <th>TCGA-22-5483-11A</th>\n",
       "      <th>TCGA-77-8153-01A</th>\n",
       "      <th>TCGA-85-A4JC-01A</th>\n",
       "      <th>TCGA-39-5040-11A</th>\n",
       "      <th>TCGA-43-6143-11A</th>\n",
       "      <th>TCGA-77-7335-11A</th>\n",
       "      <th>TCGA-85-7698-01A</th>\n",
       "      <th>TCGA-43-2581-01A</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature-id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000168484.11</th>\n",
       "      <td>972916</td>\n",
       "      <td>843573</td>\n",
       "      <td>223</td>\n",
       "      <td>1850407</td>\n",
       "      <td>1139</td>\n",
       "      <td>135</td>\n",
       "      <td>488</td>\n",
       "      <td>2352</td>\n",
       "      <td>1482</td>\n",
       "      <td>1815346</td>\n",
       "      <td>...</td>\n",
       "      <td>386</td>\n",
       "      <td>846652</td>\n",
       "      <td>1635957</td>\n",
       "      <td>42</td>\n",
       "      <td>6</td>\n",
       "      <td>1734787</td>\n",
       "      <td>2453372</td>\n",
       "      <td>1001158</td>\n",
       "      <td>699</td>\n",
       "      <td>490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000185303.14</th>\n",
       "      <td>975800</td>\n",
       "      <td>887399</td>\n",
       "      <td>10662</td>\n",
       "      <td>1711785</td>\n",
       "      <td>29460</td>\n",
       "      <td>458</td>\n",
       "      <td>3623</td>\n",
       "      <td>4214</td>\n",
       "      <td>178301</td>\n",
       "      <td>686350</td>\n",
       "      <td>...</td>\n",
       "      <td>501</td>\n",
       "      <td>860576</td>\n",
       "      <td>2331898</td>\n",
       "      <td>1074</td>\n",
       "      <td>8</td>\n",
       "      <td>1001212</td>\n",
       "      <td>1513129</td>\n",
       "      <td>967935</td>\n",
       "      <td>2460</td>\n",
       "      <td>7480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198804.2</th>\n",
       "      <td>550882</td>\n",
       "      <td>567646</td>\n",
       "      <td>391264</td>\n",
       "      <td>765533</td>\n",
       "      <td>222015</td>\n",
       "      <td>88019</td>\n",
       "      <td>119269</td>\n",
       "      <td>206227</td>\n",
       "      <td>522146</td>\n",
       "      <td>380501</td>\n",
       "      <td>...</td>\n",
       "      <td>141559</td>\n",
       "      <td>449164</td>\n",
       "      <td>439839</td>\n",
       "      <td>252347</td>\n",
       "      <td>252814</td>\n",
       "      <td>353950</td>\n",
       "      <td>881008</td>\n",
       "      <td>166618</td>\n",
       "      <td>243607</td>\n",
       "      <td>458559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000122852.13</th>\n",
       "      <td>786296</td>\n",
       "      <td>663972</td>\n",
       "      <td>8201</td>\n",
       "      <td>1509224</td>\n",
       "      <td>15981</td>\n",
       "      <td>1544</td>\n",
       "      <td>2141</td>\n",
       "      <td>3157</td>\n",
       "      <td>151564</td>\n",
       "      <td>752807</td>\n",
       "      <td>...</td>\n",
       "      <td>432</td>\n",
       "      <td>663762</td>\n",
       "      <td>2079926</td>\n",
       "      <td>522</td>\n",
       "      <td>16</td>\n",
       "      <td>1166295</td>\n",
       "      <td>1096501</td>\n",
       "      <td>906560</td>\n",
       "      <td>780</td>\n",
       "      <td>7326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198886.2</th>\n",
       "      <td>315275</td>\n",
       "      <td>499467</td>\n",
       "      <td>179894</td>\n",
       "      <td>626067</td>\n",
       "      <td>194809</td>\n",
       "      <td>115463</td>\n",
       "      <td>104261</td>\n",
       "      <td>279515</td>\n",
       "      <td>396785</td>\n",
       "      <td>404277</td>\n",
       "      <td>...</td>\n",
       "      <td>262923</td>\n",
       "      <td>442133</td>\n",
       "      <td>429011</td>\n",
       "      <td>309030</td>\n",
       "      <td>218573</td>\n",
       "      <td>241436</td>\n",
       "      <td>696461</td>\n",
       "      <td>131715</td>\n",
       "      <td>252738</td>\n",
       "      <td>276778</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 149 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                    TCGA-43-5670-11A  TCGA-77-8008-11A  TCGA-18-3410-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000168484.11            972916            843573               223   \n",
       "ENSG00000185303.14            975800            887399             10662   \n",
       "ENSG00000198804.2             550882            567646            391264   \n",
       "ENSG00000122852.13            786296            663972              8201   \n",
       "ENSG00000198886.2             315275            499467            179894   \n",
       "\n",
       "                    TCGA-43-6771-11A  TCGA-66-2758-01A  TCGA-90-7767-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000168484.11           1850407              1139               135   \n",
       "ENSG00000185303.14           1711785             29460               458   \n",
       "ENSG00000198804.2             765533            222015             88019   \n",
       "ENSG00000122852.13           1509224             15981              1544   \n",
       "ENSG00000198886.2             626067            194809            115463   \n",
       "\n",
       "                    TCGA-66-2795-01A  TCGA-77-7138-01A  TCGA-39-5019-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000168484.11               488              2352              1482   \n",
       "ENSG00000185303.14              3623              4214            178301   \n",
       "ENSG00000198804.2             119269            206227            522146   \n",
       "ENSG00000122852.13              2141              3157            151564   \n",
       "ENSG00000198886.2             104261            279515            396785   \n",
       "\n",
       "                    TCGA-90-6837-11A  ...  TCGA-56-7823-01B  TCGA-77-7142-11A  \\\n",
       "feature-id                            ...                                       \n",
       "ENSG00000168484.11           1815346  ...               386            846652   \n",
       "ENSG00000185303.14            686350  ...               501            860576   \n",
       "ENSG00000198804.2             380501  ...            141559            449164   \n",
       "ENSG00000122852.13            752807  ...               432            663762   \n",
       "ENSG00000198886.2             404277  ...            262923            442133   \n",
       "\n",
       "                    TCGA-22-5483-11A  TCGA-77-8153-01A  TCGA-85-A4JC-01A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000168484.11           1635957                42                 6   \n",
       "ENSG00000185303.14           2331898              1074                 8   \n",
       "ENSG00000198804.2             439839            252347            252814   \n",
       "ENSG00000122852.13           2079926               522                16   \n",
       "ENSG00000198886.2             429011            309030            218573   \n",
       "\n",
       "                    TCGA-39-5040-11A  TCGA-43-6143-11A  TCGA-77-7335-11A  \\\n",
       "feature-id                                                                 \n",
       "ENSG00000168484.11           1734787           2453372           1001158   \n",
       "ENSG00000185303.14           1001212           1513129            967935   \n",
       "ENSG00000198804.2             353950            881008            166618   \n",
       "ENSG00000122852.13           1166295           1096501            906560   \n",
       "ENSG00000198886.2             241436            696461            131715   \n",
       "\n",
       "                    TCGA-85-7698-01A  TCGA-43-2581-01A  \n",
       "feature-id                                              \n",
       "ENSG00000168484.11               699               490  \n",
       "ENSG00000185303.14              2460              7480  \n",
       "ENSG00000198804.2             243607            458559  \n",
       "ENSG00000122852.13               780              7326  \n",
       "ENSG00000198886.2             252738            276778  \n",
       "\n",
       "[5 rows x 149 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_table_filt = feature_table.loc[\n",
    "    feature_table.sum(axis=1).sort_values(ascending=False).head(1000).index, :\n",
    "]\n",
    "print(feature_table_filt.shape)\n",
    "feature_table_filt.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want to strip everything after the period in the Ensembl IDs. For example, `ENSG00000000003.13` should be converted to `ENSG00000000003`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-43-5670-11A</th>\n",
       "      <th>TCGA-77-8008-11A</th>\n",
       "      <th>TCGA-18-3410-01A</th>\n",
       "      <th>TCGA-43-6771-11A</th>\n",
       "      <th>TCGA-66-2758-01A</th>\n",
       "      <th>TCGA-90-7767-01A</th>\n",
       "      <th>TCGA-66-2795-01A</th>\n",
       "      <th>TCGA-77-7138-01A</th>\n",
       "      <th>TCGA-39-5019-01A</th>\n",
       "      <th>TCGA-90-6837-11A</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-56-7823-01B</th>\n",
       "      <th>TCGA-77-7142-11A</th>\n",
       "      <th>TCGA-22-5483-11A</th>\n",
       "      <th>TCGA-77-8153-01A</th>\n",
       "      <th>TCGA-85-A4JC-01A</th>\n",
       "      <th>TCGA-39-5040-11A</th>\n",
       "      <th>TCGA-43-6143-11A</th>\n",
       "      <th>TCGA-77-7335-11A</th>\n",
       "      <th>TCGA-85-7698-01A</th>\n",
       "      <th>TCGA-43-2581-01A</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature-id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000168484</th>\n",
       "      <td>972916</td>\n",
       "      <td>843573</td>\n",
       "      <td>223</td>\n",
       "      <td>1850407</td>\n",
       "      <td>1139</td>\n",
       "      <td>135</td>\n",
       "      <td>488</td>\n",
       "      <td>2352</td>\n",
       "      <td>1482</td>\n",
       "      <td>1815346</td>\n",
       "      <td>...</td>\n",
       "      <td>386</td>\n",
       "      <td>846652</td>\n",
       "      <td>1635957</td>\n",
       "      <td>42</td>\n",
       "      <td>6</td>\n",
       "      <td>1734787</td>\n",
       "      <td>2453372</td>\n",
       "      <td>1001158</td>\n",
       "      <td>699</td>\n",
       "      <td>490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000185303</th>\n",
       "      <td>975800</td>\n",
       "      <td>887399</td>\n",
       "      <td>10662</td>\n",
       "      <td>1711785</td>\n",
       "      <td>29460</td>\n",
       "      <td>458</td>\n",
       "      <td>3623</td>\n",
       "      <td>4214</td>\n",
       "      <td>178301</td>\n",
       "      <td>686350</td>\n",
       "      <td>...</td>\n",
       "      <td>501</td>\n",
       "      <td>860576</td>\n",
       "      <td>2331898</td>\n",
       "      <td>1074</td>\n",
       "      <td>8</td>\n",
       "      <td>1001212</td>\n",
       "      <td>1513129</td>\n",
       "      <td>967935</td>\n",
       "      <td>2460</td>\n",
       "      <td>7480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198804</th>\n",
       "      <td>550882</td>\n",
       "      <td>567646</td>\n",
       "      <td>391264</td>\n",
       "      <td>765533</td>\n",
       "      <td>222015</td>\n",
       "      <td>88019</td>\n",
       "      <td>119269</td>\n",
       "      <td>206227</td>\n",
       "      <td>522146</td>\n",
       "      <td>380501</td>\n",
       "      <td>...</td>\n",
       "      <td>141559</td>\n",
       "      <td>449164</td>\n",
       "      <td>439839</td>\n",
       "      <td>252347</td>\n",
       "      <td>252814</td>\n",
       "      <td>353950</td>\n",
       "      <td>881008</td>\n",
       "      <td>166618</td>\n",
       "      <td>243607</td>\n",
       "      <td>458559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000122852</th>\n",
       "      <td>786296</td>\n",
       "      <td>663972</td>\n",
       "      <td>8201</td>\n",
       "      <td>1509224</td>\n",
       "      <td>15981</td>\n",
       "      <td>1544</td>\n",
       "      <td>2141</td>\n",
       "      <td>3157</td>\n",
       "      <td>151564</td>\n",
       "      <td>752807</td>\n",
       "      <td>...</td>\n",
       "      <td>432</td>\n",
       "      <td>663762</td>\n",
       "      <td>2079926</td>\n",
       "      <td>522</td>\n",
       "      <td>16</td>\n",
       "      <td>1166295</td>\n",
       "      <td>1096501</td>\n",
       "      <td>906560</td>\n",
       "      <td>780</td>\n",
       "      <td>7326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198886</th>\n",
       "      <td>315275</td>\n",
       "      <td>499467</td>\n",
       "      <td>179894</td>\n",
       "      <td>626067</td>\n",
       "      <td>194809</td>\n",
       "      <td>115463</td>\n",
       "      <td>104261</td>\n",
       "      <td>279515</td>\n",
       "      <td>396785</td>\n",
       "      <td>404277</td>\n",
       "      <td>...</td>\n",
       "      <td>262923</td>\n",
       "      <td>442133</td>\n",
       "      <td>429011</td>\n",
       "      <td>309030</td>\n",
       "      <td>218573</td>\n",
       "      <td>241436</td>\n",
       "      <td>696461</td>\n",
       "      <td>131715</td>\n",
       "      <td>252738</td>\n",
       "      <td>276778</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 149 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 TCGA-43-5670-11A  TCGA-77-8008-11A  TCGA-18-3410-01A  \\\n",
       "feature-id                                                              \n",
       "ENSG00000168484            972916            843573               223   \n",
       "ENSG00000185303            975800            887399             10662   \n",
       "ENSG00000198804            550882            567646            391264   \n",
       "ENSG00000122852            786296            663972              8201   \n",
       "ENSG00000198886            315275            499467            179894   \n",
       "\n",
       "                 TCGA-43-6771-11A  TCGA-66-2758-01A  TCGA-90-7767-01A  \\\n",
       "feature-id                                                              \n",
       "ENSG00000168484           1850407              1139               135   \n",
       "ENSG00000185303           1711785             29460               458   \n",
       "ENSG00000198804            765533            222015             88019   \n",
       "ENSG00000122852           1509224             15981              1544   \n",
       "ENSG00000198886            626067            194809            115463   \n",
       "\n",
       "                 TCGA-66-2795-01A  TCGA-77-7138-01A  TCGA-39-5019-01A  \\\n",
       "feature-id                                                              \n",
       "ENSG00000168484               488              2352              1482   \n",
       "ENSG00000185303              3623              4214            178301   \n",
       "ENSG00000198804            119269            206227            522146   \n",
       "ENSG00000122852              2141              3157            151564   \n",
       "ENSG00000198886            104261            279515            396785   \n",
       "\n",
       "                 TCGA-90-6837-11A  ...  TCGA-56-7823-01B  TCGA-77-7142-11A  \\\n",
       "feature-id                         ...                                       \n",
       "ENSG00000168484           1815346  ...               386            846652   \n",
       "ENSG00000185303            686350  ...               501            860576   \n",
       "ENSG00000198804            380501  ...            141559            449164   \n",
       "ENSG00000122852            752807  ...               432            663762   \n",
       "ENSG00000198886            404277  ...            262923            442133   \n",
       "\n",
       "                 TCGA-22-5483-11A  TCGA-77-8153-01A  TCGA-85-A4JC-01A  \\\n",
       "feature-id                                                              \n",
       "ENSG00000168484           1635957                42                 6   \n",
       "ENSG00000185303           2331898              1074                 8   \n",
       "ENSG00000198804            439839            252347            252814   \n",
       "ENSG00000122852           2079926               522                16   \n",
       "ENSG00000198886            429011            309030            218573   \n",
       "\n",
       "                 TCGA-39-5040-11A  TCGA-43-6143-11A  TCGA-77-7335-11A  \\\n",
       "feature-id                                                              \n",
       "ENSG00000168484           1734787           2453372           1001158   \n",
       "ENSG00000185303           1001212           1513129            967935   \n",
       "ENSG00000198804            353950            881008            166618   \n",
       "ENSG00000122852           1166295           1096501            906560   \n",
       "ENSG00000198886            241436            696461            131715   \n",
       "\n",
       "                 TCGA-85-7698-01A  TCGA-43-2581-01A  \n",
       "feature-id                                           \n",
       "ENSG00000168484               699               490  \n",
       "ENSG00000185303              2460              7480  \n",
       "ENSG00000198804            243607            458559  \n",
       "ENSG00000122852               780              7326  \n",
       "ENSG00000198886            252738            276778  \n",
       "\n",
       "[5 rows x 149 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_table_filt.index = [re.search(\"ENSG[0-9]*\", x).group() for x in feature_table_filt.index]\n",
    "feature_table_filt.index.name = \"feature-id\"\n",
    "feature_table_filt.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 149 samples but 135 cases - some of the patients have both their primary tumor and normal present in the feature table. To better facilitate comparisons, we're going to only keep the normal samples for those patients who have both. In the barcode the last 3 digits represent the sample type. `01A` or `01B` correspond to tumor sample, while `11A` corresponds to solid tissue normal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "all_cases = [re.search(\"TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}\", x).group() for x in feature_table_filt.columns]\n",
    "duplicated_cases = [barcode for barcode, count in Counter(all_cases).items() if count > 1]\n",
    "len(duplicated_cases)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "samples_to_remove = []\n",
    "for col in feature_table_filt:\n",
    "    case, sample = re.search(\"(TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4})-([0-1]{2}[AB])\", col).groups()\n",
    "    if case in duplicated_cases and sample.startswith(\"01\"):\n",
    "        samples_to_remove.append(col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that each patient only has one sample represented in the feature table, we can use case barcodes (TCGA-XX-YYYY) instead of sample barcodes (TCGA-XX-YYYY-ZZ). This will allow us to more easily compare across our metadata conditions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1000, 135)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-43-5670</th>\n",
       "      <th>TCGA-77-8008</th>\n",
       "      <th>TCGA-18-3410</th>\n",
       "      <th>TCGA-43-6771</th>\n",
       "      <th>TCGA-66-2758</th>\n",
       "      <th>TCGA-66-2795</th>\n",
       "      <th>TCGA-39-5019</th>\n",
       "      <th>TCGA-90-6837</th>\n",
       "      <th>TCGA-77-A5GB</th>\n",
       "      <th>TCGA-52-7810</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-85-8582</th>\n",
       "      <th>TCGA-77-7142</th>\n",
       "      <th>TCGA-22-5483</th>\n",
       "      <th>TCGA-77-8153</th>\n",
       "      <th>TCGA-85-A4JC</th>\n",
       "      <th>TCGA-39-5040</th>\n",
       "      <th>TCGA-43-6143</th>\n",
       "      <th>TCGA-77-7335</th>\n",
       "      <th>TCGA-85-7698</th>\n",
       "      <th>TCGA-43-2581</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature-id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000168484</th>\n",
       "      <td>972916</td>\n",
       "      <td>843573</td>\n",
       "      <td>223</td>\n",
       "      <td>1850407</td>\n",
       "      <td>1139</td>\n",
       "      <td>488</td>\n",
       "      <td>1482</td>\n",
       "      <td>1815346</td>\n",
       "      <td>4</td>\n",
       "      <td>68</td>\n",
       "      <td>...</td>\n",
       "      <td>156</td>\n",
       "      <td>846652</td>\n",
       "      <td>1635957</td>\n",
       "      <td>42</td>\n",
       "      <td>6</td>\n",
       "      <td>1734787</td>\n",
       "      <td>2453372</td>\n",
       "      <td>1001158</td>\n",
       "      <td>699</td>\n",
       "      <td>490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000185303</th>\n",
       "      <td>975800</td>\n",
       "      <td>887399</td>\n",
       "      <td>10662</td>\n",
       "      <td>1711785</td>\n",
       "      <td>29460</td>\n",
       "      <td>3623</td>\n",
       "      <td>178301</td>\n",
       "      <td>686350</td>\n",
       "      <td>451</td>\n",
       "      <td>1731</td>\n",
       "      <td>...</td>\n",
       "      <td>236</td>\n",
       "      <td>860576</td>\n",
       "      <td>2331898</td>\n",
       "      <td>1074</td>\n",
       "      <td>8</td>\n",
       "      <td>1001212</td>\n",
       "      <td>1513129</td>\n",
       "      <td>967935</td>\n",
       "      <td>2460</td>\n",
       "      <td>7480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198804</th>\n",
       "      <td>550882</td>\n",
       "      <td>567646</td>\n",
       "      <td>391264</td>\n",
       "      <td>765533</td>\n",
       "      <td>222015</td>\n",
       "      <td>119269</td>\n",
       "      <td>522146</td>\n",
       "      <td>380501</td>\n",
       "      <td>163642</td>\n",
       "      <td>321512</td>\n",
       "      <td>...</td>\n",
       "      <td>198960</td>\n",
       "      <td>449164</td>\n",
       "      <td>439839</td>\n",
       "      <td>252347</td>\n",
       "      <td>252814</td>\n",
       "      <td>353950</td>\n",
       "      <td>881008</td>\n",
       "      <td>166618</td>\n",
       "      <td>243607</td>\n",
       "      <td>458559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000122852</th>\n",
       "      <td>786296</td>\n",
       "      <td>663972</td>\n",
       "      <td>8201</td>\n",
       "      <td>1509224</td>\n",
       "      <td>15981</td>\n",
       "      <td>2141</td>\n",
       "      <td>151564</td>\n",
       "      <td>752807</td>\n",
       "      <td>265</td>\n",
       "      <td>947</td>\n",
       "      <td>...</td>\n",
       "      <td>168</td>\n",
       "      <td>663762</td>\n",
       "      <td>2079926</td>\n",
       "      <td>522</td>\n",
       "      <td>16</td>\n",
       "      <td>1166295</td>\n",
       "      <td>1096501</td>\n",
       "      <td>906560</td>\n",
       "      <td>780</td>\n",
       "      <td>7326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198886</th>\n",
       "      <td>315275</td>\n",
       "      <td>499467</td>\n",
       "      <td>179894</td>\n",
       "      <td>626067</td>\n",
       "      <td>194809</td>\n",
       "      <td>104261</td>\n",
       "      <td>396785</td>\n",
       "      <td>404277</td>\n",
       "      <td>172102</td>\n",
       "      <td>414434</td>\n",
       "      <td>...</td>\n",
       "      <td>174529</td>\n",
       "      <td>442133</td>\n",
       "      <td>429011</td>\n",
       "      <td>309030</td>\n",
       "      <td>218573</td>\n",
       "      <td>241436</td>\n",
       "      <td>696461</td>\n",
       "      <td>131715</td>\n",
       "      <td>252738</td>\n",
       "      <td>276778</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 135 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 TCGA-43-5670  TCGA-77-8008  TCGA-18-3410  TCGA-43-6771  \\\n",
       "feature-id                                                                \n",
       "ENSG00000168484        972916        843573           223       1850407   \n",
       "ENSG00000185303        975800        887399         10662       1711785   \n",
       "ENSG00000198804        550882        567646        391264        765533   \n",
       "ENSG00000122852        786296        663972          8201       1509224   \n",
       "ENSG00000198886        315275        499467        179894        626067   \n",
       "\n",
       "                 TCGA-66-2758  TCGA-66-2795  TCGA-39-5019  TCGA-90-6837  \\\n",
       "feature-id                                                                \n",
       "ENSG00000168484          1139           488          1482       1815346   \n",
       "ENSG00000185303         29460          3623        178301        686350   \n",
       "ENSG00000198804        222015        119269        522146        380501   \n",
       "ENSG00000122852         15981          2141        151564        752807   \n",
       "ENSG00000198886        194809        104261        396785        404277   \n",
       "\n",
       "                 TCGA-77-A5GB  TCGA-52-7810  ...  TCGA-85-8582  TCGA-77-7142  \\\n",
       "feature-id                                   ...                               \n",
       "ENSG00000168484             4            68  ...           156        846652   \n",
       "ENSG00000185303           451          1731  ...           236        860576   \n",
       "ENSG00000198804        163642        321512  ...        198960        449164   \n",
       "ENSG00000122852           265           947  ...           168        663762   \n",
       "ENSG00000198886        172102        414434  ...        174529        442133   \n",
       "\n",
       "                 TCGA-22-5483  TCGA-77-8153  TCGA-85-A4JC  TCGA-39-5040  \\\n",
       "feature-id                                                                \n",
       "ENSG00000168484       1635957            42             6       1734787   \n",
       "ENSG00000185303       2331898          1074             8       1001212   \n",
       "ENSG00000198804        439839        252347        252814        353950   \n",
       "ENSG00000122852       2079926           522            16       1166295   \n",
       "ENSG00000198886        429011        309030        218573        241436   \n",
       "\n",
       "                 TCGA-43-6143  TCGA-77-7335  TCGA-85-7698  TCGA-43-2581  \n",
       "feature-id                                                               \n",
       "ENSG00000168484       2453372       1001158           699           490  \n",
       "ENSG00000185303       1513129        967935          2460          7480  \n",
       "ENSG00000198804        881008        166618        243607        458559  \n",
       "ENSG00000122852       1096501        906560           780          7326  \n",
       "ENSG00000198886        696461        131715        252738        276778  \n",
       "\n",
       "[5 rows x 135 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_table_filt = feature_table_filt.drop(columns=samples_to_remove)\n",
    "feature_table_filt.columns = [\n",
    "    re.search(\"TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}\", x).group() for x in feature_table_filt.columns\n",
    "]\n",
    "print(feature_table_filt.shape)\n",
    "feature_table_filt.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_table_filt.to_csv(\n",
    "    \"output/TCGA_LUSC_expression_feature_table_filt.tsv\",\n",
    "    sep=\"\\t\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Merging the sample sheet and some clinical/exposure information into a \"sample metadata\" file\n",
    "\n",
    "We want to include several fields in the sample metadata:\n",
    "\n",
    "1. Sample type (tumor or normal)\n",
    "2. Race\n",
    "3. Age at diagnosis\n",
    "4. Gender\n",
    "5. Cigarettes per day\n",
    "6. Years smoked\n",
    "\n",
    "Of course, there are certainly other sample metadata categories that could be worth including in the visualization -- again, this is just an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>File ID</th>\n",
       "      <th>File Name</th>\n",
       "      <th>Data Category</th>\n",
       "      <th>Data Type</th>\n",
       "      <th>Project ID</th>\n",
       "      <th>Case ID</th>\n",
       "      <th>Sample ID</th>\n",
       "      <th>Sample Type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>e4c62f17-d1e8-4543-9b7e-daa2b68306e0</td>\n",
       "      <td>bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou...</td>\n",
       "      <td>Transcriptome Profiling</td>\n",
       "      <td>Gene Expression Quantification</td>\n",
       "      <td>TCGA-LUSC</td>\n",
       "      <td>TCGA-33-6737</td>\n",
       "      <td>TCGA-33-6737-01A</td>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>220a03f7-7ab6-4233-8f65-7ac5decca4b9</td>\n",
       "      <td>60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou...</td>\n",
       "      <td>Transcriptome Profiling</td>\n",
       "      <td>Gene Expression Quantification</td>\n",
       "      <td>TCGA-LUSC</td>\n",
       "      <td>TCGA-18-3410</td>\n",
       "      <td>TCGA-18-3410-01A</td>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8894c42e-ce65-4088-88e3-921ce7165261</td>\n",
       "      <td>950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou...</td>\n",
       "      <td>Transcriptome Profiling</td>\n",
       "      <td>Gene Expression Quantification</td>\n",
       "      <td>TCGA-LUSC</td>\n",
       "      <td>TCGA-33-A4WN</td>\n",
       "      <td>TCGA-33-A4WN-01A</td>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>daa44ce1-1671-46b9-aa48-2f4155f0ee49</td>\n",
       "      <td>a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou...</td>\n",
       "      <td>Transcriptome Profiling</td>\n",
       "      <td>Gene Expression Quantification</td>\n",
       "      <td>TCGA-LUSC</td>\n",
       "      <td>TCGA-56-7579</td>\n",
       "      <td>TCGA-56-7579-01A</td>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ef056c34-c2b9-47dd-afbf-ed81fc16dc74</td>\n",
       "      <td>840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou...</td>\n",
       "      <td>Transcriptome Profiling</td>\n",
       "      <td>Gene Expression Quantification</td>\n",
       "      <td>TCGA-LUSC</td>\n",
       "      <td>TCGA-34-5236</td>\n",
       "      <td>TCGA-34-5236-01A</td>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                File ID  \\\n",
       "0  e4c62f17-d1e8-4543-9b7e-daa2b68306e0   \n",
       "1  220a03f7-7ab6-4233-8f65-7ac5decca4b9   \n",
       "2  8894c42e-ce65-4088-88e3-921ce7165261   \n",
       "3  daa44ce1-1671-46b9-aa48-2f4155f0ee49   \n",
       "4  ef056c34-c2b9-47dd-afbf-ed81fc16dc74   \n",
       "\n",
       "                                           File Name            Data Category  \\\n",
       "0  bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou...  Transcriptome Profiling   \n",
       "1  60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou...  Transcriptome Profiling   \n",
       "2  950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou...  Transcriptome Profiling   \n",
       "3  a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou...  Transcriptome Profiling   \n",
       "4  840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou...  Transcriptome Profiling   \n",
       "\n",
       "                        Data Type Project ID       Case ID         Sample ID  \\\n",
       "0  Gene Expression Quantification  TCGA-LUSC  TCGA-33-6737  TCGA-33-6737-01A   \n",
       "1  Gene Expression Quantification  TCGA-LUSC  TCGA-18-3410  TCGA-18-3410-01A   \n",
       "2  Gene Expression Quantification  TCGA-LUSC  TCGA-33-A4WN  TCGA-33-A4WN-01A   \n",
       "3  Gene Expression Quantification  TCGA-LUSC  TCGA-56-7579  TCGA-56-7579-01A   \n",
       "4  Gene Expression Quantification  TCGA-LUSC  TCGA-34-5236  TCGA-34-5236-01A   \n",
       "\n",
       "     Sample Type  \n",
       "0  Primary Tumor  \n",
       "1  Primary Tumor  \n",
       "2  Primary Tumor  \n",
       "3  Primary Tumor  \n",
       "4  Primary Tumor  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_sheet = pd.read_csv(\"input/gdc_sample_sheet.2020-02-13.tsv\", sep=\"\\t\")\n",
    "sample_sheet.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample Type</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Case ID</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-33-6737</th>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-18-3410</th>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-33-A4WN</th>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-56-7579</th>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-34-5236</th>\n",
       "      <td>Primary Tumor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Sample Type\n",
       "Case ID                    \n",
       "TCGA-33-6737  Primary Tumor\n",
       "TCGA-18-3410  Primary Tumor\n",
       "TCGA-33-A4WN  Primary Tumor\n",
       "TCGA-56-7579  Primary Tumor\n",
       "TCGA-34-5236  Primary Tumor"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_sheet_new = sample_sheet.set_index(\"Case ID\", drop=True)\n",
    "sample_sheet_new = sample_sheet_new[[\"Sample Type\"]]\n",
    "sample_sheet_new.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age_at_diagnosis</th>\n",
       "      <th>race</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>submitter_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-43-5670</th>\n",
       "      <td>25652.0</td>\n",
       "      <td>white</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-43-5670</th>\n",
       "      <td>25652.0</td>\n",
       "      <td>white</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-66-2800</th>\n",
       "      <td>25810.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-66-2800</th>\n",
       "      <td>25810.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-66-2788</th>\n",
       "      <td>20637.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              age_at_diagnosis   race gender\n",
       "submitter_id                                \n",
       "TCGA-43-5670           25652.0  white   male\n",
       "TCGA-43-5670           25652.0  white   male\n",
       "TCGA-66-2800           25810.0    NaN   male\n",
       "TCGA-66-2800           25810.0    NaN   male\n",
       "TCGA-66-2788           20637.0    NaN   male"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinical = pd.read_csv(\n",
    "    \"input/clinical.tsv\", \n",
    "    sep=\"\\t\", \n",
    "    na_values=[\"--\", \"not reported\"],\n",
    "    index_col=\"submitter_id\"\n",
    ")\n",
    "clinical = clinical[[\"age_at_diagnosis\", \"race\", \"gender\"]]\n",
    "clinical.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cigarettes_per_day</th>\n",
       "      <th>years_smoked</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>submitter_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-43-5670</th>\n",
       "      <td>1.643836</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-66-2800</th>\n",
       "      <td>4.109589</td>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-66-2788</th>\n",
       "      <td>4.383562</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-77-7338</th>\n",
       "      <td>2.260274</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-56-7222</th>\n",
       "      <td>1.260274</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              cigarettes_per_day  years_smoked\n",
       "submitter_id                                  \n",
       "TCGA-43-5670            1.643836          10.0\n",
       "TCGA-66-2800            4.109589          50.0\n",
       "TCGA-66-2788            4.383562          40.0\n",
       "TCGA-77-7338            2.260274           NaN\n",
       "TCGA-56-7222            1.260274           NaN"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "exposure = pd.read_csv(\n",
    "    \"input/exposure.tsv\",\n",
    "    sep=\"\\t\",\n",
    "    na_values=[\"--\", \"Not Reported\"],\n",
    "    index_col=\"submitter_id\"\n",
    ")\n",
    "exposure = exposure[[\"cigarettes_per_day\", \"years_smoked\"]]\n",
    "exposure.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(135, 6)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample Type</th>\n",
       "      <th>age_at_diagnosis</th>\n",
       "      <th>race</th>\n",
       "      <th>gender</th>\n",
       "      <th>cigarettes_per_day</th>\n",
       "      <th>years_smoked</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sample ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-18-3410</th>\n",
       "      <td>Primary Tumor</td>\n",
       "      <td>29827.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-18-3411</th>\n",
       "      <td>Primary Tumor</td>\n",
       "      <td>23370.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>female</td>\n",
       "      <td>2.739726</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-18-3416</th>\n",
       "      <td>Primary Tumor</td>\n",
       "      <td>30435.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "      <td>2.191781</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-18-4086</th>\n",
       "      <td>Primary Tumor</td>\n",
       "      <td>23731.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "      <td>1.643836</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-18-5595</th>\n",
       "      <td>Primary Tumor</td>\n",
       "      <td>18611.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Sample Type  age_at_diagnosis race  gender  \\\n",
       "Sample ID                                                    \n",
       "TCGA-18-3410  Primary Tumor           29827.0  NaN    male   \n",
       "TCGA-18-3411  Primary Tumor           23370.0  NaN  female   \n",
       "TCGA-18-3416  Primary Tumor           30435.0  NaN    male   \n",
       "TCGA-18-4086  Primary Tumor           23731.0  NaN    male   \n",
       "TCGA-18-5595  Primary Tumor           18611.0  NaN    male   \n",
       "\n",
       "              cigarettes_per_day  years_smoked  \n",
       "Sample ID                                       \n",
       "TCGA-18-3410                 NaN           NaN  \n",
       "TCGA-18-3411            2.739726           NaN  \n",
       "TCGA-18-3416            2.191781           NaN  \n",
       "TCGA-18-4086            1.643836           NaN  \n",
       "TCGA-18-5595                 NaN           NaN  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_sheet_clinical_exposure = sample_sheet_new.join(clinical).join(exposure)\n",
    "sample_sheet_clinical_exposure.index.name = \"Sample ID\"\n",
    "sample_sheet_clinical_exposure = sample_sheet_clinical_exposure.loc[\n",
    "    ~sample_sheet_clinical_exposure.index.duplicated(keep=\"first\"), :\n",
    "]\n",
    "print(sample_sheet_clinical_exposure.shape)\n",
    "sample_sheet_clinical_exposure.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_sheet_clinical_exposure.to_csv(\"output/sample_metadata.tsv\", sep=\"\\t\", index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Running ALDEx2 on the count data\n",
    "\n",
    "[ALDEx2](https://bioconductor.org/packages/release/bioc/html/ALDEx2.html) is a tool used for determining differentially abundant features from a compositional dataset.\n",
    "\n",
    "For the sake of demonstration, we're going to compare the tumor samples to the normal samples (with the goal of finding differentially abundant features between these sample \"types\"). Of course, the way you run ALDEx2 may be more complicated if there are multiple sample metadata fields you'd like to include.\n",
    "\n",
    "We have provided a script, `run_aldex.R`, to run ALDEx2 on our processed feature table and output the results to a TSV file. Note that running ALDEx2 can take some time, depending on the size of the input dataset.\n",
    "\n",
    "[2] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. *Microbiome, 2*(1), 15."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading required package: zCompositions\n",
      "Loading required package: MASS\n",
      "Loading required package: NADA\n",
      "Loading required package: survival\n",
      "\n",
      "Attaching package: ‘NADA’\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    cor\n",
      "\n",
      "Loading required package: truncnorm\n",
      "aldex.clr: generating Monte-Carlo instances and clr values\n",
      "operating in serial mode\n",
      "removed rows with sums equal to zero\n",
      "computing center with all features\n",
      "data format is OK\n",
      "dirichlet samples complete\n",
      "transformation complete\n",
      "aldex.ttest: doing t-test\n",
      "running tests for each MC instance:\n",
      "|------------(25%)----------(50%)----------(75%)----------|\n",
      "aldex.effect: calculating effect sizes\n",
      "operating in serial mode\n",
      "sanity check complete\n",
      "rab.all  complete\n",
      "rab.win  complete\n",
      "rab of samples complete\n",
      "within sample difference calculated\n",
      "between group difference calculated\n",
      "group summaries calculated\n",
      "effect size calculated\n",
      "summarizing output\n",
      "[1] \"ALDEx2 results written to output/TCGA_LUSC_aldex_results.tsv\"\n"
     ]
    }
   ],
   "source": [
    "!Rscript run_aldex.R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert os.path.exists(\"output/TCGA_LUSC_aldex_results.tsv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Converting the feature table from TSV to BIOM\n",
    "\n",
    "Here, we're going to convert the original feature table from a tab-separated file (TSV) to a BIOM-formatted file. We need to do this because Qurro requires that the input table is in the BIOM format.\n",
    "\n",
    "### 4.1. About the BIOM format\n",
    "The BIOM format is a commonly used file format for storing and representing these sorts of feature tables. It's especially good at storing sparse datasets—that is, datasets containing a lot of zeroes. (This is useful for datasets obtained from metagenomic or marker gene sequencing, which are usually very sparse.)\n",
    "\n",
    "[3] McDonald, D., Clemente, J. C., Kuczynski, J., Rideout, J. R., Stombaugh, J., Wendel, D., ... & Knight, R. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. *Gigascience, 1*(1), 2047-217X."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "!biom convert \\\n",
    "    -i output/TCGA_LUSC_expression_feature_table_filt.tsv \\\n",
    "    -o output/TCGA_LUSC_expression_feature_table_filt.biom \\\n",
    "    --table-type=\"Gene table\" \\\n",
    "    --to-hdf5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Optional: Creating a feature metadata file\n",
    "\n",
    "We might want to add some information about each gene for Qurro exploration. For this example, we'll include the HUGO gene symbol, gene name, and chromosome. Having this sort of information available in the Qurro visualization can be useful for a few purposes -- examples include checking the name of a gene that looks particularly interesting, or searching (using Qurro's filtering tools) for only genes on a certain chromosome.\n",
    "\n",
    "**However, feature metadata isn't required to run Qurro -- if you'd like, you're welcome to skip this section and go straight to the \"Running Qurro\" section.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ENSG00000168484', 'ENSG00000185303', 'ENSG00000198804', 'ENSG00000122852', 'ENSG00000198886']\n"
     ]
    }
   ],
   "source": [
    "ensembl_ids = list(feature_table_filt.index)\n",
    "print(ensembl_ids[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "querying 1-1000...done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>symbol</th>\n",
       "      <th>genomic_pos.chr</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature-id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000168484</th>\n",
       "      <td>surfactant protein C</td>\n",
       "      <td>SFTPC</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000185303</th>\n",
       "      <td>surfactant protein A2</td>\n",
       "      <td>SFTPA2</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198804</th>\n",
       "      <td>cytochrome c oxidase subunit I</td>\n",
       "      <td>COX1</td>\n",
       "      <td>MT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000122852</th>\n",
       "      <td>surfactant protein A1</td>\n",
       "      <td>SFTPA1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000198886</th>\n",
       "      <td>NADH dehydrogenase subunit 4</td>\n",
       "      <td>ND4</td>\n",
       "      <td>MT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           name  symbol genomic_pos.chr\n",
       "feature-id                                                             \n",
       "ENSG00000168484            surfactant protein C   SFTPC               8\n",
       "ENSG00000185303           surfactant protein A2  SFTPA2              10\n",
       "ENSG00000198804  cytochrome c oxidase subunit I    COX1              MT\n",
       "ENSG00000122852           surfactant protein A1  SFTPA1              10\n",
       "ENSG00000198886    NADH dehydrogenase subunit 4     ND4              MT"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mg = mygene.MyGeneInfo()\n",
    "\n",
    "genes_df = mg.getgenes(ensembl_ids, fields=[\"name\", \"symbol\", \"genomic_pos.chr\"], as_dataframe=True)\n",
    "genes_df.index.name = \"feature-id\"\n",
    "genes_df = genes_df[[\"name\", \"symbol\", \"genomic_pos.chr\"]]\n",
    "genes_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "genes_df.to_csv(\"output/feature_metadata.tsv\", sep=\"\\t\", index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Running Qurro\n",
    "\n",
    "Now that we have everything ready, we can run Qurro!\n",
    "\n",
    "Note that if you skipped section 5 above (i.e. you didn't generate a `output/feature_metadata.tsv` file) you'll need to omit the `--feature-metadata output/feature_metadata.tsv \\` line below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully generated a visualization in the folder output/qurro.\r\n"
     ]
    }
   ],
   "source": [
    "!qurro \\\n",
    "    --ranks output/TCGA_LUSC_aldex_results.tsv \\\n",
    "    --table output/TCGA_LUSC_expression_feature_table_filt.biom \\\n",
    "    --sample-metadata output/sample_metadata.tsv \\\n",
    "    --feature-metadata output/feature_metadata.tsv \\\n",
    "    --output-dir output/qurro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert os.path.exists(\"output/qurro/index.html\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.1. Interacting with the Qurro visualization\n",
    "Open the `output/qurro/index.html` page in the web browser of your choice (Firefox / Chrome recommended, but any modern browser should be ok). From here you can explore all the options Qurro has to offer!\n",
    "\n",
    "The remainder of this section outlines one simple analysis you could do at this point.\n",
    "\n",
    "![](imgs/1.png)\n",
    "\n",
    "#### 6.1.1. Selecting an ALDEx2 field of interest\n",
    "As an example, we're going to navigate to the `Differential` selection menu and select `diff:btw`. This column contains the \"median difference in CLR values\" between the sample types we're considering: in this case, between `Solid Tissue Normal` and `Primary Tumor` samples. See the \"Explaining the outputs\" section in [the ALDEx2 documentation here](https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html#52_Explaining_the_outputs) for details on what each of the `Differential` fields output by ALDEx2 mean -- this information is really important in guiding how you interpret the corresponding rank plots.\n",
    "\n",
    "Notice how when you change the selected `Differential` field, the y-axis values and x-axis orderings of each feature in the rank plot are updated accordingly.\n",
    "\n",
    "You may also want to check the box that says `Fit bar widths to a constant plot width?`, which will squish the rank plot so that it takes up less horizontal space on the screen.\n",
    "\n",
    "![](imgs/2.png)\n",
    "\n",
    "##### 6.1.1.1. Some notes about ALDEx2 outputs, and the term `Differential`\n",
    "Not all of the fields contained in the ALDEx2 output are strictly \"differentials,\" defined in Morton and Marotz et al. 2019 as \"the logarithm of the fold change in abundance of a [feature] between two conditions.\" For example, the expected p-values in ALDEx2's output (e.g. `we:ep`) obviously don't really fit that definition.\n",
    "\n",
    "However, these fields can still be interesting or useful to visualize in a rank plot. As always, it's important to keep in mind where the results you're looking at came from to make sure you're interpreting things accurately.\n",
    "\n",
    "On a less important note, you may have noticed that `diff:btw` is listed in the ALDEx2 documentation as `diff.btw`. The periods in these field names are replaced with colons in the Qurro interface due to [a technical issue](https://github.com/biocore/qurro#temporary-caveat).\n",
    "\n",
    "#### 6.1.2. Auto-selecting extremely ranked features\n",
    "\n",
    "Enter `5` in the `Autoselecting Features` section. The autoselection tools in Qurro let you easily take the log-ratio of very high and very low ranked features -- this is useful when we'd expect the selected ranking to do a good job \"distinguishing\" features based on their association with certain sample groups. As you might imagine, this is probably the case with `diff:btw`!\n",
    "\n",
    "Click `Apply` and you should see the rank plot on the top left highlight the selected features on the rankings. The numerator features have been colored red, and the denominator features have been colored blue. (...We didn't realize until a while into developing this that a lot of these plots look like the [Tricolour](https://en.wikipedia.org/wiki/Flag_of_France).)\n",
    "\n",
    "Anyway, selecting a log-ratio also updated the sample plot (on the top right of the screen). Let's play around with it!\n",
    "\n",
    "#### 6.1.3. Adjusting the sample plot\n",
    "We can use the controls below the sample plot to relate the currently selected log-ratio to our sample metadata -- in this case, how does this log-ratio look in relation to the tumor versus non-tumor samples?\n",
    "\n",
    "To investigate that question, change the `x-axis` field to `Sample Type` (if it isn't already set to that) and check the box that says `Use boxplots for categorical data?`. You should see a boxplot appear that shows a clear separation between `Solid Tissue Normal` and `Primary Tumor` samples, which makes sense!\n",
    "\n",
    "![](imgs/3.png)\n",
    "\n",
    "#### 6.1.4. Interpreting log-ratios across groups\n",
    "\n",
    "Notice how the log-ratio values for the `Solid Tissue Normal` samples seem generally higher than those for the `Primary Tumor` samples. We selected the log-ratio of the 5% highest and 5% lowest ranked features based on the `diff:btw` field, which was defined above as the median difference in CLR values between `Solid Tissue Normal` and `Primary Tumor` samples -- therefore, we know that the numerator features in our log-ratio should be somewhat more frequent in `Solid Tissue Normal` samples, and the denominator features should be somewhat less frequent in `Solid Tissue Normal` samples (relative to `Primary Tumor` samples).\n",
    "\n",
    "Hopefully it should make sense, then, that this log-ratio is generally higher in `Solid Tissue Normal` samples as compared to the `Primary Tumor` samples.\n",
    "\n",
    "As an exercise for the reader, try setting the rank plot to use the `rab:win:Primary:Tumor` and re-applying 5% autoselection. How does the sample plot look now? From consulting the ALDEx2 documentation on what the `rab.win` output fields mean, why do you think this might be the case?\n",
    "\n",
    "#### 6.1.5. Investigating high- and low- ranked features\n",
    "\n",
    "The selected features in the log-ratio are displayed in the tables at the bottom left of the screen.\n",
    "\n",
    "These tables are interactive [DataTables](https://datatables.net/), which are really nice in that you can click on any of a table's headers to re-sort the rows of the table accordingly (in either ascending or descending order, based on the direction of the black arrow in the header cell). With the `diff:btw` 5% autoselection from earlier applied, try sorting the tables based on features' `diff:btw` values.\n",
    "\n",
    "![](imgs/4.png)\n",
    "\n",
    "Now, we have a \"list\" of the highest- and lowest-ranked features in the selection. Try scrolling over to the side of the tables and looking through the `name`s of some of these features. Do you notice any interesting patterns?\n",
    "\n",
    "It looks like there are a lot of features with `surfactant` in their names that are highly-ranked in the numerator table, and it looks like there are a lot of features with `keratin` in their names that are lowly-ranked in the denominator table.\n",
    "\n",
    "That could be worth looking into! Let's see how the log-ratio of these groups of features looks by itself. Try using the controls under the `Selecting Features by Filtering` section to select a log-ratio of features with a `name` containing the text `surfactant` to features with a `name` containing the text `keratin`.\n",
    "\n",
    "![](imgs/5.png)\n",
    "\n",
    "Interestingly, this log-ratio seems to recapitulate the patterns we got from our initial autoselection: there's still a clear separation between the normal and tumor samples, and (for the most part) the features with `surfactant` and `keratin` in their name seem to be mostly highly and lowly ranked for `diff:btw`, respectively.\n",
    "\n",
    "**Does this make sense?** This is the point in the analysis where your domain knowledge comes in. From some basic literature reviewing it seems like `keratin`-named features being associated with tumor samples makes sense, at least according to a few papers (e.g. [Trask et al. 1990](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC53678/), [Relli et al. 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6238974/)): but maybe you have a different opinion about what this trend means! Perhaps you think we could choose a better numerator for a discriminatory log-ratio than `surfactant`. In the end, it's up to you."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2. Finishing up\n",
    "That's as far as this tutorial goes for now, but we encourage you to try out some of the other things you can do in Qurro! For example, try selecting a different field in the sample plot's x-axis, or searching for genes by their chromosome.\n",
    "\n",
    "The controls in the Qurro interface enable more \"SOPs\" for exploring your data besides the basic one outlined above. **It's our hope that you take this example analysis not as a strict workflow but more as a starting point for determining how to answer your specific scientific question(s).**\n",
    "\n",
    "If you've made it this far, thanks for reading! As always, please feel free to [open an issue](https://github.com/biocore/qurro/issues) in Qurro's repository (or ask a question on the [QIIME 2 forums](https://forum.qiime2.org/)) if you have any questions, comments, or suggestions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}