{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing ALDEx2 feature differentials using Qurro\n", "\n", "In this example, we use transcriptomic data from [TCGA](https://portal.gdc.cancer.gov/repository). We downloaded \n", "100 gene expression files from lung squamous cell carcinoma (LUSC) primary tumors and 49 solid tissue normal expression files. We have pre-processed this data into a feature table for ease of use, but the gdc manifest file is also provided for your convenience as well as the script used to aggregate all the files.\n", "\n", "**Please note that this example is primarily intended to demonstrate how to use Qurro**, rather than to demonstrate \"best practices\" in using ALDEx2 or in analyzing transcriptomic data. We designed Qurro in the context of microbiome / metabolome data, but it should be applicable to arbitrary compositional datasets :)\n", "\n", "[1] Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., ... & Cancer Genome Atlas Rese to arch Network. (2013). The cancer genome atlas pan-cancer analysis project. *Nature Genetics, 45*(10), 1113.\n", "\n", "## Requirements:\n", "\n", "This notebook requires **qurro** and **mygene** to be installed on the Python side of things. The R package [ALDEx2](http://bioconductor.org/packages/release/bioc/html/ALDEx2.html) is also required for the `run_aldex.R` script.\n", "\n", "### 2022 note on running this notebook\n", "\n", "In previous years, we installed ALDEx2 through conda using the [`bioconductor-aldex2`](https://anaconda.org/bioconda/bioconductor-aldex2) conda package. However, it no longer seems easily possible to install this package (at least on my laptop). Therefore, we now recommend installing ALDEx2 directly through Bioconductor in R; see [ALDEx2's Bioconductor documentation](https://bioconductor.org/packages/release/bioc/html/ALDEx2.html) for details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Setting up\n", "In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Clear the output directory so we can write these files there\n", "!rm -rf output\n", "# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist\n", "# (if it does already exist, -p ensures that an error won't be thrown)\n", "!mkdir -p output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Processing the feature table\n", "\n", "The original feature table has over 60,000 features, which is too computationally expensive for a simple tutorial example like this. To speed things up, we'll filter this table to use the top 1000 genes by total abundance. (Again, this is just for demonstrative purposes -- in practice you should think carefully about when or when not to filter your data.)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "import mygene\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(60483, 149)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TCGA-43-5670-11ATCGA-77-8008-11ATCGA-18-3410-01ATCGA-43-6771-11ATCGA-66-2758-01ATCGA-90-7767-01ATCGA-66-2795-01ATCGA-77-7138-01ATCGA-39-5019-01ATCGA-90-6837-11A...TCGA-56-7823-01BTCGA-77-7142-11ATCGA-22-5483-11ATCGA-77-8153-01ATCGA-85-A4JC-01ATCGA-39-5040-11ATCGA-43-6143-11ATCGA-77-7335-11ATCGA-85-7698-01ATCGA-43-2581-01A
feature-id
ENSG00000000003.134707101622861238298227951516550755042079...1988105912971970245014321500179037222471
ENSG00000000005.55115100023...1540043010
ENSG00000000419.112019173121232121386827222130483445571264...119414491382747174131671454156924771826
ENSG00000000457.121062968188365580019879358981014840...449883614678305705917658961470
ENSG00000000460.15204202192320570621919569271347264...748159180887275181206153581770
\n", "

5 rows × 149 columns

\n", "
" ], "text/plain": [ " TCGA-43-5670-11A TCGA-77-8008-11A TCGA-18-3410-01A \\\n", "feature-id \n", "ENSG00000000003.13 4707 1016 2286 \n", "ENSG00000000005.5 5 1 1 \n", "ENSG00000000419.11 2019 1731 2123 \n", "ENSG00000000457.12 1062 968 1883 \n", "ENSG00000000460.15 204 202 1923 \n", "\n", " TCGA-43-6771-11A TCGA-66-2758-01A TCGA-90-7767-01A \\\n", "feature-id \n", "ENSG00000000003.13 1238 2982 2795 \n", "ENSG00000000005.5 5 1 0 \n", "ENSG00000000419.11 2121 3868 2722 \n", "ENSG00000000457.12 655 800 1987 \n", "ENSG00000000460.15 205 706 2191 \n", "\n", " TCGA-66-2795-01A TCGA-77-7138-01A TCGA-39-5019-01A \\\n", "feature-id \n", "ENSG00000000003.13 1516 5507 5504 \n", "ENSG00000000005.5 0 0 2 \n", "ENSG00000000419.11 2130 4834 4557 \n", "ENSG00000000457.12 935 898 1014 \n", "ENSG00000000460.15 956 927 1347 \n", "\n", " TCGA-90-6837-11A ... TCGA-56-7823-01B TCGA-77-7142-11A \\\n", "feature-id ... \n", "ENSG00000000003.13 2079 ... 1988 1059 \n", "ENSG00000000005.5 3 ... 1 5 \n", "ENSG00000000419.11 1264 ... 1194 1449 \n", "ENSG00000000457.12 840 ... 449 883 \n", "ENSG00000000460.15 264 ... 748 159 \n", "\n", " TCGA-22-5483-11A TCGA-77-8153-01A TCGA-85-A4JC-01A \\\n", "feature-id \n", "ENSG00000000003.13 1297 1970 2450 \n", "ENSG00000000005.5 4 0 0 \n", "ENSG00000000419.11 1382 747 1741 \n", "ENSG00000000457.12 614 678 305 \n", "ENSG00000000460.15 180 887 275 \n", "\n", " TCGA-39-5040-11A TCGA-43-6143-11A TCGA-77-7335-11A \\\n", "feature-id \n", "ENSG00000000003.13 1432 1500 1790 \n", "ENSG00000000005.5 4 3 0 \n", "ENSG00000000419.11 3167 1454 1569 \n", "ENSG00000000457.12 705 917 658 \n", "ENSG00000000460.15 181 206 153 \n", "\n", " TCGA-85-7698-01A TCGA-43-2581-01A \n", "feature-id \n", "ENSG00000000003.13 3722 2471 \n", "ENSG00000000005.5 1 0 \n", "ENSG00000000419.11 2477 1826 \n", "ENSG00000000457.12 961 470 \n", "ENSG00000000460.15 581 770 \n", "\n", "[5 rows x 149 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_table = pd.read_csv(\n", " \"input/TCGA_LUSC_expression_feature_table.tsv\",\n", " sep=\"\\t\",\n", " index_col=0,\n", ")\n", "print(feature_table.shape)\n", "feature_table.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1000, 149)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TCGA-43-5670-11ATCGA-77-8008-11ATCGA-18-3410-01ATCGA-43-6771-11ATCGA-66-2758-01ATCGA-90-7767-01ATCGA-66-2795-01ATCGA-77-7138-01ATCGA-39-5019-01ATCGA-90-6837-11A...TCGA-56-7823-01BTCGA-77-7142-11ATCGA-22-5483-11ATCGA-77-8153-01ATCGA-85-A4JC-01ATCGA-39-5040-11ATCGA-43-6143-11ATCGA-77-7335-11ATCGA-85-7698-01ATCGA-43-2581-01A
feature-id
ENSG00000168484.1197291684357322318504071139135488235214821815346...3868466521635957426173478724533721001158699490
ENSG00000185303.149758008873991066217117852946045836234214178301686350...5018605762331898107481001212151312996793524607480
ENSG00000198804.255088256764639126476553322201588019119269206227522146380501...141559449164439839252347252814353950881008166618243607458559
ENSG00000122852.137862966639728201150922415981154421413157151564752807...432663762207992652216116629510965019065607807326
ENSG00000198886.2315275499467179894626067194809115463104261279515396785404277...262923442133429011309030218573241436696461131715252738276778
\n", "

5 rows × 149 columns

\n", "
" ], "text/plain": [ " TCGA-43-5670-11A TCGA-77-8008-11A TCGA-18-3410-01A \\\n", "feature-id \n", "ENSG00000168484.11 972916 843573 223 \n", "ENSG00000185303.14 975800 887399 10662 \n", "ENSG00000198804.2 550882 567646 391264 \n", "ENSG00000122852.13 786296 663972 8201 \n", "ENSG00000198886.2 315275 499467 179894 \n", "\n", " TCGA-43-6771-11A TCGA-66-2758-01A TCGA-90-7767-01A \\\n", "feature-id \n", "ENSG00000168484.11 1850407 1139 135 \n", "ENSG00000185303.14 1711785 29460 458 \n", "ENSG00000198804.2 765533 222015 88019 \n", "ENSG00000122852.13 1509224 15981 1544 \n", "ENSG00000198886.2 626067 194809 115463 \n", "\n", " TCGA-66-2795-01A TCGA-77-7138-01A TCGA-39-5019-01A \\\n", "feature-id \n", "ENSG00000168484.11 488 2352 1482 \n", "ENSG00000185303.14 3623 4214 178301 \n", "ENSG00000198804.2 119269 206227 522146 \n", "ENSG00000122852.13 2141 3157 151564 \n", "ENSG00000198886.2 104261 279515 396785 \n", "\n", " TCGA-90-6837-11A ... TCGA-56-7823-01B TCGA-77-7142-11A \\\n", "feature-id ... \n", "ENSG00000168484.11 1815346 ... 386 846652 \n", "ENSG00000185303.14 686350 ... 501 860576 \n", "ENSG00000198804.2 380501 ... 141559 449164 \n", "ENSG00000122852.13 752807 ... 432 663762 \n", "ENSG00000198886.2 404277 ... 262923 442133 \n", "\n", " TCGA-22-5483-11A TCGA-77-8153-01A TCGA-85-A4JC-01A \\\n", "feature-id \n", "ENSG00000168484.11 1635957 42 6 \n", "ENSG00000185303.14 2331898 1074 8 \n", "ENSG00000198804.2 439839 252347 252814 \n", "ENSG00000122852.13 2079926 522 16 \n", "ENSG00000198886.2 429011 309030 218573 \n", "\n", " TCGA-39-5040-11A TCGA-43-6143-11A TCGA-77-7335-11A \\\n", "feature-id \n", "ENSG00000168484.11 1734787 2453372 1001158 \n", "ENSG00000185303.14 1001212 1513129 967935 \n", "ENSG00000198804.2 353950 881008 166618 \n", "ENSG00000122852.13 1166295 1096501 906560 \n", "ENSG00000198886.2 241436 696461 131715 \n", "\n", " TCGA-85-7698-01A TCGA-43-2581-01A \n", "feature-id \n", "ENSG00000168484.11 699 490 \n", "ENSG00000185303.14 2460 7480 \n", "ENSG00000198804.2 243607 458559 \n", "ENSG00000122852.13 780 7326 \n", "ENSG00000198886.2 252738 276778 \n", "\n", "[5 rows x 149 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_table_filt = feature_table.loc[\n", " feature_table.sum(axis=1).sort_values(ascending=False).head(1000).index, :\n", "]\n", "print(feature_table_filt.shape)\n", "feature_table_filt.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also want to strip everything after the period in the Ensembl IDs. For example, `ENSG00000000003.13` should be converted to `ENSG00000000003`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TCGA-43-5670-11ATCGA-77-8008-11ATCGA-18-3410-01ATCGA-43-6771-11ATCGA-66-2758-01ATCGA-90-7767-01ATCGA-66-2795-01ATCGA-77-7138-01ATCGA-39-5019-01ATCGA-90-6837-11A...TCGA-56-7823-01BTCGA-77-7142-11ATCGA-22-5483-11ATCGA-77-8153-01ATCGA-85-A4JC-01ATCGA-39-5040-11ATCGA-43-6143-11ATCGA-77-7335-11ATCGA-85-7698-01ATCGA-43-2581-01A
feature-id
ENSG0000016848497291684357322318504071139135488235214821815346...3868466521635957426173478724533721001158699490
ENSG000001853039758008873991066217117852946045836234214178301686350...5018605762331898107481001212151312996793524607480
ENSG0000019880455088256764639126476553322201588019119269206227522146380501...141559449164439839252347252814353950881008166618243607458559
ENSG000001228527862966639728201150922415981154421413157151564752807...432663762207992652216116629510965019065607807326
ENSG00000198886315275499467179894626067194809115463104261279515396785404277...262923442133429011309030218573241436696461131715252738276778
\n", "

5 rows × 149 columns

\n", "
" ], "text/plain": [ " TCGA-43-5670-11A TCGA-77-8008-11A TCGA-18-3410-01A \\\n", "feature-id \n", "ENSG00000168484 972916 843573 223 \n", "ENSG00000185303 975800 887399 10662 \n", "ENSG00000198804 550882 567646 391264 \n", "ENSG00000122852 786296 663972 8201 \n", "ENSG00000198886 315275 499467 179894 \n", "\n", " TCGA-43-6771-11A TCGA-66-2758-01A TCGA-90-7767-01A \\\n", "feature-id \n", "ENSG00000168484 1850407 1139 135 \n", "ENSG00000185303 1711785 29460 458 \n", "ENSG00000198804 765533 222015 88019 \n", "ENSG00000122852 1509224 15981 1544 \n", "ENSG00000198886 626067 194809 115463 \n", "\n", " TCGA-66-2795-01A TCGA-77-7138-01A TCGA-39-5019-01A \\\n", "feature-id \n", "ENSG00000168484 488 2352 1482 \n", "ENSG00000185303 3623 4214 178301 \n", "ENSG00000198804 119269 206227 522146 \n", "ENSG00000122852 2141 3157 151564 \n", "ENSG00000198886 104261 279515 396785 \n", "\n", " TCGA-90-6837-11A ... TCGA-56-7823-01B TCGA-77-7142-11A \\\n", "feature-id ... \n", "ENSG00000168484 1815346 ... 386 846652 \n", "ENSG00000185303 686350 ... 501 860576 \n", "ENSG00000198804 380501 ... 141559 449164 \n", "ENSG00000122852 752807 ... 432 663762 \n", "ENSG00000198886 404277 ... 262923 442133 \n", "\n", " TCGA-22-5483-11A TCGA-77-8153-01A TCGA-85-A4JC-01A \\\n", "feature-id \n", "ENSG00000168484 1635957 42 6 \n", "ENSG00000185303 2331898 1074 8 \n", "ENSG00000198804 439839 252347 252814 \n", "ENSG00000122852 2079926 522 16 \n", "ENSG00000198886 429011 309030 218573 \n", "\n", " TCGA-39-5040-11A TCGA-43-6143-11A TCGA-77-7335-11A \\\n", "feature-id \n", "ENSG00000168484 1734787 2453372 1001158 \n", "ENSG00000185303 1001212 1513129 967935 \n", "ENSG00000198804 353950 881008 166618 \n", "ENSG00000122852 1166295 1096501 906560 \n", "ENSG00000198886 241436 696461 131715 \n", "\n", " TCGA-85-7698-01A TCGA-43-2581-01A \n", "feature-id \n", "ENSG00000168484 699 490 \n", "ENSG00000185303 2460 7480 \n", "ENSG00000198804 243607 458559 \n", "ENSG00000122852 780 7326 \n", "ENSG00000198886 252738 276778 \n", "\n", "[5 rows x 149 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_table_filt.index = [re.search(\"ENSG[0-9]*\", x).group() for x in feature_table_filt.index]\n", "feature_table_filt.index.name = \"feature-id\"\n", "feature_table_filt.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 149 samples but 135 cases - some of the patients have both their primary tumor and normal present in the feature table. To better facilitate comparisons, we're going to only keep the normal samples for those patients who have both. In the barcode the last 3 digits represent the sample type. `01A` or `01B` correspond to tumor sample, while `11A` corresponds to solid tissue normal." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "all_cases = [re.search(\"TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}\", x).group() for x in feature_table_filt.columns]\n", "duplicated_cases = [barcode for barcode, count in Counter(all_cases).items() if count > 1]\n", "len(duplicated_cases)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "samples_to_remove = []\n", "for col in feature_table_filt:\n", " case, sample = re.search(\"(TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4})-([0-1]{2}[AB])\", col).groups()\n", " if case in duplicated_cases and sample.startswith(\"01\"):\n", " samples_to_remove.append(col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that each patient only has one sample represented in the feature table, we can use case barcodes (TCGA-XX-YYYY) instead of sample barcodes (TCGA-XX-YYYY-ZZ). This will allow us to more easily compare across our metadata conditions." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1000, 135)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TCGA-43-5670TCGA-77-8008TCGA-18-3410TCGA-43-6771TCGA-66-2758TCGA-66-2795TCGA-39-5019TCGA-90-6837TCGA-77-A5GBTCGA-52-7810...TCGA-85-8582TCGA-77-7142TCGA-22-5483TCGA-77-8153TCGA-85-A4JCTCGA-39-5040TCGA-43-6143TCGA-77-7335TCGA-85-7698TCGA-43-2581
feature-id
ENSG000001684849729168435732231850407113948814821815346468...1568466521635957426173478724533721001158699490
ENSG000001853039758008873991066217117852946036231783016863504511731...2368605762331898107481001212151312996793524607480
ENSG00000198804550882567646391264765533222015119269522146380501163642321512...198960449164439839252347252814353950881008166618243607458559
ENSG0000012285278629666397282011509224159812141151564752807265947...168663762207992652216116629510965019065607807326
ENSG00000198886315275499467179894626067194809104261396785404277172102414434...174529442133429011309030218573241436696461131715252738276778
\n", "

5 rows × 135 columns

\n", "
" ], "text/plain": [ " TCGA-43-5670 TCGA-77-8008 TCGA-18-3410 TCGA-43-6771 \\\n", "feature-id \n", "ENSG00000168484 972916 843573 223 1850407 \n", "ENSG00000185303 975800 887399 10662 1711785 \n", "ENSG00000198804 550882 567646 391264 765533 \n", "ENSG00000122852 786296 663972 8201 1509224 \n", "ENSG00000198886 315275 499467 179894 626067 \n", "\n", " TCGA-66-2758 TCGA-66-2795 TCGA-39-5019 TCGA-90-6837 \\\n", "feature-id \n", "ENSG00000168484 1139 488 1482 1815346 \n", "ENSG00000185303 29460 3623 178301 686350 \n", "ENSG00000198804 222015 119269 522146 380501 \n", "ENSG00000122852 15981 2141 151564 752807 \n", "ENSG00000198886 194809 104261 396785 404277 \n", "\n", " TCGA-77-A5GB TCGA-52-7810 ... TCGA-85-8582 TCGA-77-7142 \\\n", "feature-id ... \n", "ENSG00000168484 4 68 ... 156 846652 \n", "ENSG00000185303 451 1731 ... 236 860576 \n", "ENSG00000198804 163642 321512 ... 198960 449164 \n", "ENSG00000122852 265 947 ... 168 663762 \n", "ENSG00000198886 172102 414434 ... 174529 442133 \n", "\n", " TCGA-22-5483 TCGA-77-8153 TCGA-85-A4JC TCGA-39-5040 \\\n", "feature-id \n", "ENSG00000168484 1635957 42 6 1734787 \n", "ENSG00000185303 2331898 1074 8 1001212 \n", "ENSG00000198804 439839 252347 252814 353950 \n", "ENSG00000122852 2079926 522 16 1166295 \n", "ENSG00000198886 429011 309030 218573 241436 \n", "\n", " TCGA-43-6143 TCGA-77-7335 TCGA-85-7698 TCGA-43-2581 \n", "feature-id \n", "ENSG00000168484 2453372 1001158 699 490 \n", "ENSG00000185303 1513129 967935 2460 7480 \n", "ENSG00000198804 881008 166618 243607 458559 \n", "ENSG00000122852 1096501 906560 780 7326 \n", "ENSG00000198886 696461 131715 252738 276778 \n", "\n", "[5 rows x 135 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_table_filt = feature_table_filt.drop(columns=samples_to_remove)\n", "feature_table_filt.columns = [\n", " re.search(\"TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}\", x).group() for x in feature_table_filt.columns\n", "]\n", "print(feature_table_filt.shape)\n", "feature_table_filt.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "feature_table_filt.to_csv(\n", " \"output/TCGA_LUSC_expression_feature_table_filt.tsv\",\n", " sep=\"\\t\",\n", " index=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Merging the sample sheet and some clinical/exposure information into a \"sample metadata\" file\n", "\n", "We want to include several fields in the sample metadata:\n", "\n", "1. Sample type (tumor or normal)\n", "2. Race\n", "3. Age at diagnosis\n", "4. Gender\n", "5. Cigarettes per day\n", "6. Years smoked\n", "\n", "Of course, there are certainly other sample metadata categories that could be worth including in the visualization -- again, this is just an example." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
File IDFile NameData CategoryData TypeProject IDCase IDSample IDSample Type
0e4c62f17-d1e8-4543-9b7e-daa2b68306e0bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou...Transcriptome ProfilingGene Expression QuantificationTCGA-LUSCTCGA-33-6737TCGA-33-6737-01APrimary Tumor
1220a03f7-7ab6-4233-8f65-7ac5decca4b960040f95-8414-4956-bd8a-ec461a49207c.htseq.cou...Transcriptome ProfilingGene Expression QuantificationTCGA-LUSCTCGA-18-3410TCGA-18-3410-01APrimary Tumor
28894c42e-ce65-4088-88e3-921ce7165261950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou...Transcriptome ProfilingGene Expression QuantificationTCGA-LUSCTCGA-33-A4WNTCGA-33-A4WN-01APrimary Tumor
3daa44ce1-1671-46b9-aa48-2f4155f0ee49a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou...Transcriptome ProfilingGene Expression QuantificationTCGA-LUSCTCGA-56-7579TCGA-56-7579-01APrimary Tumor
4ef056c34-c2b9-47dd-afbf-ed81fc16dc74840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou...Transcriptome ProfilingGene Expression QuantificationTCGA-LUSCTCGA-34-5236TCGA-34-5236-01APrimary Tumor
\n", "
" ], "text/plain": [ " File ID \\\n", "0 e4c62f17-d1e8-4543-9b7e-daa2b68306e0 \n", "1 220a03f7-7ab6-4233-8f65-7ac5decca4b9 \n", "2 8894c42e-ce65-4088-88e3-921ce7165261 \n", "3 daa44ce1-1671-46b9-aa48-2f4155f0ee49 \n", "4 ef056c34-c2b9-47dd-afbf-ed81fc16dc74 \n", "\n", " File Name Data Category \\\n", "0 bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou... Transcriptome Profiling \n", "1 60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou... Transcriptome Profiling \n", "2 950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou... Transcriptome Profiling \n", "3 a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou... Transcriptome Profiling \n", "4 840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou... Transcriptome Profiling \n", "\n", " Data Type Project ID Case ID Sample ID \\\n", "0 Gene Expression Quantification TCGA-LUSC TCGA-33-6737 TCGA-33-6737-01A \n", "1 Gene Expression Quantification TCGA-LUSC TCGA-18-3410 TCGA-18-3410-01A \n", "2 Gene Expression Quantification TCGA-LUSC TCGA-33-A4WN TCGA-33-A4WN-01A \n", "3 Gene Expression Quantification TCGA-LUSC TCGA-56-7579 TCGA-56-7579-01A \n", "4 Gene Expression Quantification TCGA-LUSC TCGA-34-5236 TCGA-34-5236-01A \n", "\n", " Sample Type \n", "0 Primary Tumor \n", "1 Primary Tumor \n", "2 Primary Tumor \n", "3 Primary Tumor \n", "4 Primary Tumor " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_sheet = pd.read_csv(\"input/gdc_sample_sheet.2020-02-13.tsv\", sep=\"\\t\")\n", "sample_sheet.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sample Type
Case ID
TCGA-33-6737Primary Tumor
TCGA-18-3410Primary Tumor
TCGA-33-A4WNPrimary Tumor
TCGA-56-7579Primary Tumor
TCGA-34-5236Primary Tumor
\n", "
" ], "text/plain": [ " Sample Type\n", "Case ID \n", "TCGA-33-6737 Primary Tumor\n", "TCGA-18-3410 Primary Tumor\n", "TCGA-33-A4WN Primary Tumor\n", "TCGA-56-7579 Primary Tumor\n", "TCGA-34-5236 Primary Tumor" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_sheet_new = sample_sheet.set_index(\"Case ID\", drop=True)\n", "sample_sheet_new = sample_sheet_new[[\"Sample Type\"]]\n", "sample_sheet_new.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age_at_diagnosisracegender
submitter_id
TCGA-43-567025652.0whitemale
TCGA-43-567025652.0whitemale
TCGA-66-280025810.0NaNmale
TCGA-66-280025810.0NaNmale
TCGA-66-278820637.0NaNmale
\n", "
" ], "text/plain": [ " age_at_diagnosis race gender\n", "submitter_id \n", "TCGA-43-5670 25652.0 white male\n", "TCGA-43-5670 25652.0 white male\n", "TCGA-66-2800 25810.0 NaN male\n", "TCGA-66-2800 25810.0 NaN male\n", "TCGA-66-2788 20637.0 NaN male" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinical = pd.read_csv(\n", " \"input/clinical.tsv\", \n", " sep=\"\\t\", \n", " na_values=[\"--\", \"not reported\"],\n", " index_col=\"submitter_id\"\n", ")\n", "clinical = clinical[[\"age_at_diagnosis\", \"race\", \"gender\"]]\n", "clinical.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cigarettes_per_dayyears_smoked
submitter_id
TCGA-43-56701.64383610.0
TCGA-66-28004.10958950.0
TCGA-66-27884.38356240.0
TCGA-77-73382.260274NaN
TCGA-56-72221.260274NaN
\n", "
" ], "text/plain": [ " cigarettes_per_day years_smoked\n", "submitter_id \n", "TCGA-43-5670 1.643836 10.0\n", "TCGA-66-2800 4.109589 50.0\n", "TCGA-66-2788 4.383562 40.0\n", "TCGA-77-7338 2.260274 NaN\n", "TCGA-56-7222 1.260274 NaN" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exposure = pd.read_csv(\n", " \"input/exposure.tsv\",\n", " sep=\"\\t\",\n", " na_values=[\"--\", \"Not Reported\"],\n", " index_col=\"submitter_id\"\n", ")\n", "exposure = exposure[[\"cigarettes_per_day\", \"years_smoked\"]]\n", "exposure.head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(135, 6)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sample Typeage_at_diagnosisracegendercigarettes_per_dayyears_smoked
Sample ID
TCGA-18-3410Primary Tumor29827.0NaNmaleNaNNaN
TCGA-18-3411Primary Tumor23370.0NaNfemale2.739726NaN
TCGA-18-3416Primary Tumor30435.0NaNmale2.191781NaN
TCGA-18-4086Primary Tumor23731.0NaNmale1.643836NaN
TCGA-18-5595Primary Tumor18611.0NaNmaleNaNNaN
\n", "
" ], "text/plain": [ " Sample Type age_at_diagnosis race gender \\\n", "Sample ID \n", "TCGA-18-3410 Primary Tumor 29827.0 NaN male \n", "TCGA-18-3411 Primary Tumor 23370.0 NaN female \n", "TCGA-18-3416 Primary Tumor 30435.0 NaN male \n", "TCGA-18-4086 Primary Tumor 23731.0 NaN male \n", "TCGA-18-5595 Primary Tumor 18611.0 NaN male \n", "\n", " cigarettes_per_day years_smoked \n", "Sample ID \n", "TCGA-18-3410 NaN NaN \n", "TCGA-18-3411 2.739726 NaN \n", "TCGA-18-3416 2.191781 NaN \n", "TCGA-18-4086 1.643836 NaN \n", "TCGA-18-5595 NaN NaN " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_sheet_clinical_exposure = sample_sheet_new.join(clinical).join(exposure)\n", "sample_sheet_clinical_exposure.index.name = \"Sample ID\"\n", "sample_sheet_clinical_exposure = sample_sheet_clinical_exposure.loc[\n", " ~sample_sheet_clinical_exposure.index.duplicated(keep=\"first\"), :\n", "]\n", "print(sample_sheet_clinical_exposure.shape)\n", "sample_sheet_clinical_exposure.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "sample_sheet_clinical_exposure.to_csv(\"output/sample_metadata.tsv\", sep=\"\\t\", index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Running ALDEx2 on the count data\n", "\n", "[ALDEx2](https://bioconductor.org/packages/release/bioc/html/ALDEx2.html) is a tool used for determining differentially abundant features from a compositional dataset.\n", "\n", "For the sake of demonstration, we're going to compare the tumor samples to the normal samples (with the goal of finding differentially abundant features between these sample \"types\"). Of course, the way you run ALDEx2 may be more complicated if there are multiple sample metadata fields you'd like to include.\n", "\n", "We have provided a script, `run_aldex.R`, to run ALDEx2 on our processed feature table and output the results to a TSV file. Note that running ALDEx2 can take some time, depending on the size of the input dataset.\n", "\n", "[2] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. *Microbiome, 2*(1), 15." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading required package: zCompositions\n", "Loading required package: MASS\n", "Loading required package: NADA\n", "Loading required package: survival\n", "\n", "Attaching package: ‘NADA’\n", "\n", "The following object is masked from ‘package:stats’:\n", "\n", " cor\n", "\n", "Loading required package: truncnorm\n", "aldex.clr: generating Monte-Carlo instances and clr values\n", "operating in serial mode\n", "removed rows with sums equal to zero\n", "computing center with all features\n", "data format is OK\n", "dirichlet samples complete\n", "transformation complete\n", "aldex.ttest: doing t-test\n", "running tests for each MC instance:\n", "|------------(25%)----------(50%)----------(75%)----------|\n", "aldex.effect: calculating effect sizes\n", "operating in serial mode\n", "sanity check complete\n", "rab.all complete\n", "rab.win complete\n", "rab of samples complete\n", "within sample difference calculated\n", "between group difference calculated\n", "group summaries calculated\n", "effect size calculated\n", "summarizing output\n", "[1] \"ALDEx2 results written to output/TCGA_LUSC_aldex_results.tsv\"\n" ] } ], "source": [ "!Rscript run_aldex.R" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "assert os.path.exists(\"output/TCGA_LUSC_aldex_results.tsv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Converting the feature table from TSV to BIOM\n", "\n", "Here, we're going to convert the original feature table from a tab-separated file (TSV) to a BIOM-formatted file. We need to do this because Qurro requires that the input table is in the BIOM format.\n", "\n", "### 4.1. About the BIOM format\n", "The BIOM format is a commonly used file format for storing and representing these sorts of feature tables. It's especially good at storing sparse datasets—that is, datasets containing a lot of zeroes. (This is useful for datasets obtained from metagenomic or marker gene sequencing, which are usually very sparse.)\n", "\n", "[3] McDonald, D., Clemente, J. C., Kuczynski, J., Rideout, J. R., Stombaugh, J., Wendel, D., ... & Knight, R. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. *Gigascience, 1*(1), 2047-217X." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "!biom convert \\\n", " -i output/TCGA_LUSC_expression_feature_table_filt.tsv \\\n", " -o output/TCGA_LUSC_expression_feature_table_filt.biom \\\n", " --table-type=\"Gene table\" \\\n", " --to-hdf5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Optional: Creating a feature metadata file\n", "\n", "We might want to add some information about each gene for Qurro exploration. For this example, we'll include the HUGO gene symbol, gene name, and chromosome. Having this sort of information available in the Qurro visualization can be useful for a few purposes -- examples include checking the name of a gene that looks particularly interesting, or searching (using Qurro's filtering tools) for only genes on a certain chromosome.\n", "\n", "**However, feature metadata isn't required to run Qurro -- if you'd like, you're welcome to skip this section and go straight to the \"Running Qurro\" section.**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ENSG00000168484', 'ENSG00000185303', 'ENSG00000198804', 'ENSG00000122852', 'ENSG00000198886']\n" ] } ], "source": [ "ensembl_ids = list(feature_table_filt.index)\n", "print(ensembl_ids[:5])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "querying 1-1000...done.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesymbolgenomic_pos.chr
feature-id
ENSG00000168484surfactant protein CSFTPC8
ENSG00000185303surfactant protein A2SFTPA210
ENSG00000198804cytochrome c oxidase subunit ICOX1MT
ENSG00000122852surfactant protein A1SFTPA110
ENSG00000198886NADH dehydrogenase subunit 4ND4MT
\n", "
" ], "text/plain": [ " name symbol genomic_pos.chr\n", "feature-id \n", "ENSG00000168484 surfactant protein C SFTPC 8\n", "ENSG00000185303 surfactant protein A2 SFTPA2 10\n", "ENSG00000198804 cytochrome c oxidase subunit I COX1 MT\n", "ENSG00000122852 surfactant protein A1 SFTPA1 10\n", "ENSG00000198886 NADH dehydrogenase subunit 4 ND4 MT" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mg = mygene.MyGeneInfo()\n", "\n", "genes_df = mg.getgenes(ensembl_ids, fields=[\"name\", \"symbol\", \"genomic_pos.chr\"], as_dataframe=True)\n", "genes_df.index.name = \"feature-id\"\n", "genes_df = genes_df[[\"name\", \"symbol\", \"genomic_pos.chr\"]]\n", "genes_df.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "genes_df.to_csv(\"output/feature_metadata.tsv\", sep=\"\\t\", index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Running Qurro\n", "\n", "Now that we have everything ready, we can run Qurro!\n", "\n", "Note that if you skipped section 5 above (i.e. you didn't generate a `output/feature_metadata.tsv` file) you'll need to omit the `--feature-metadata output/feature_metadata.tsv \\` line below." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully generated a visualization in the folder output/qurro.\r\n" ] } ], "source": [ "!qurro \\\n", " --ranks output/TCGA_LUSC_aldex_results.tsv \\\n", " --table output/TCGA_LUSC_expression_feature_table_filt.biom \\\n", " --sample-metadata output/sample_metadata.tsv \\\n", " --feature-metadata output/feature_metadata.tsv \\\n", " --output-dir output/qurro" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "assert os.path.exists(\"output/qurro/index.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1. Interacting with the Qurro visualization\n", "Open the `output/qurro/index.html` page in the web browser of your choice (Firefox / Chrome recommended, but any modern browser should be ok). From here you can explore all the options Qurro has to offer!\n", "\n", "The remainder of this section outlines one simple analysis you could do at this point.\n", "\n", "![](imgs/1.png)\n", "\n", "#### 6.1.1. Selecting an ALDEx2 field of interest\n", "As an example, we're going to navigate to the `Differential` selection menu and select `diff:btw`. This column contains the \"median difference in CLR values\" between the sample types we're considering: in this case, between `Solid Tissue Normal` and `Primary Tumor` samples. See the \"Explaining the outputs\" section in [the ALDEx2 documentation here](https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html#52_Explaining_the_outputs) for details on what each of the `Differential` fields output by ALDEx2 mean -- this information is really important in guiding how you interpret the corresponding rank plots.\n", "\n", "Notice how when you change the selected `Differential` field, the y-axis values and x-axis orderings of each feature in the rank plot are updated accordingly.\n", "\n", "You may also want to check the box that says `Fit bar widths to a constant plot width?`, which will squish the rank plot so that it takes up less horizontal space on the screen.\n", "\n", "![](imgs/2.png)\n", "\n", "##### 6.1.1.1. Some notes about ALDEx2 outputs, and the term `Differential`\n", "Not all of the fields contained in the ALDEx2 output are strictly \"differentials,\" defined in Morton and Marotz et al. 2019 as \"the logarithm of the fold change in abundance of a [feature] between two conditions.\" For example, the expected p-values in ALDEx2's output (e.g. `we:ep`) obviously don't really fit that definition.\n", "\n", "However, these fields can still be interesting or useful to visualize in a rank plot. As always, it's important to keep in mind where the results you're looking at came from to make sure you're interpreting things accurately.\n", "\n", "On a less important note, you may have noticed that `diff:btw` is listed in the ALDEx2 documentation as `diff.btw`. The periods in these field names are replaced with colons in the Qurro interface due to [a technical issue](https://github.com/biocore/qurro#temporary-caveat).\n", "\n", "#### 6.1.2. Auto-selecting extremely ranked features\n", "\n", "Enter `5` in the `Autoselecting Features` section. The autoselection tools in Qurro let you easily take the log-ratio of very high and very low ranked features -- this is useful when we'd expect the selected ranking to do a good job \"distinguishing\" features based on their association with certain sample groups. As you might imagine, this is probably the case with `diff:btw`!\n", "\n", "Click `Apply` and you should see the rank plot on the top left highlight the selected features on the rankings. The numerator features have been colored red, and the denominator features have been colored blue. (...We didn't realize until a while into developing this that a lot of these plots look like the [Tricolour](https://en.wikipedia.org/wiki/Flag_of_France).)\n", "\n", "Anyway, selecting a log-ratio also updated the sample plot (on the top right of the screen). Let's play around with it!\n", "\n", "#### 6.1.3. Adjusting the sample plot\n", "We can use the controls below the sample plot to relate the currently selected log-ratio to our sample metadata -- in this case, how does this log-ratio look in relation to the tumor versus non-tumor samples?\n", "\n", "To investigate that question, change the `x-axis` field to `Sample Type` (if it isn't already set to that) and check the box that says `Use boxplots for categorical data?`. You should see a boxplot appear that shows a clear separation between `Solid Tissue Normal` and `Primary Tumor` samples, which makes sense!\n", "\n", "![](imgs/3.png)\n", "\n", "#### 6.1.4. Interpreting log-ratios across groups\n", "\n", "Notice how the log-ratio values for the `Solid Tissue Normal` samples seem generally higher than those for the `Primary Tumor` samples. We selected the log-ratio of the 5% highest and 5% lowest ranked features based on the `diff:btw` field, which was defined above as the median difference in CLR values between `Solid Tissue Normal` and `Primary Tumor` samples -- therefore, we know that the numerator features in our log-ratio should be somewhat more frequent in `Solid Tissue Normal` samples, and the denominator features should be somewhat less frequent in `Solid Tissue Normal` samples (relative to `Primary Tumor` samples).\n", "\n", "Hopefully it should make sense, then, that this log-ratio is generally higher in `Solid Tissue Normal` samples as compared to the `Primary Tumor` samples.\n", "\n", "As an exercise for the reader, try setting the rank plot to use the `rab:win:Primary:Tumor` and re-applying 5% autoselection. How does the sample plot look now? From consulting the ALDEx2 documentation on what the `rab.win` output fields mean, why do you think this might be the case?\n", "\n", "#### 6.1.5. Investigating high- and low- ranked features\n", "\n", "The selected features in the log-ratio are displayed in the tables at the bottom left of the screen.\n", "\n", "These tables are interactive [DataTables](https://datatables.net/), which are really nice in that you can click on any of a table's headers to re-sort the rows of the table accordingly (in either ascending or descending order, based on the direction of the black arrow in the header cell). With the `diff:btw` 5% autoselection from earlier applied, try sorting the tables based on features' `diff:btw` values.\n", "\n", "![](imgs/4.png)\n", "\n", "Now, we have a \"list\" of the highest- and lowest-ranked features in the selection. Try scrolling over to the side of the tables and looking through the `name`s of some of these features. Do you notice any interesting patterns?\n", "\n", "It looks like there are a lot of features with `surfactant` in their names that are highly-ranked in the numerator table, and it looks like there are a lot of features with `keratin` in their names that are lowly-ranked in the denominator table.\n", "\n", "That could be worth looking into! Let's see how the log-ratio of these groups of features looks by itself. Try using the controls under the `Selecting Features by Filtering` section to select a log-ratio of features with a `name` containing the text `surfactant` to features with a `name` containing the text `keratin`.\n", "\n", "![](imgs/5.png)\n", "\n", "Interestingly, this log-ratio seems to recapitulate the patterns we got from our initial autoselection: there's still a clear separation between the normal and tumor samples, and (for the most part) the features with `surfactant` and `keratin` in their name seem to be mostly highly and lowly ranked for `diff:btw`, respectively.\n", "\n", "**Does this make sense?** This is the point in the analysis where your domain knowledge comes in. From some basic literature reviewing it seems like `keratin`-named features being associated with tumor samples makes sense, at least according to a few papers (e.g. [Trask et al. 1990](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC53678/), [Relli et al. 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6238974/)): but maybe you have a different opinion about what this trend means! Perhaps you think we could choose a better numerator for a discriminatory log-ratio than `surfactant`. In the end, it's up to you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2. Finishing up\n", "That's as far as this tutorial goes for now, but we encourage you to try out some of the other things you can do in Qurro! For example, try selecting a different field in the sample plot's x-axis, or searching for genes by their chromosome.\n", "\n", "The controls in the Qurro interface enable more \"SOPs\" for exploring your data besides the basic one outlined above. **It's our hope that you take this example analysis not as a strict workflow but more as a starting point for determining how to answer your specific scientific question(s).**\n", "\n", "If you've made it this far, thanks for reading! As always, please feel free to [open an issue](https://github.com/biocore/qurro/issues) in Qurro's repository (or ask a question on the [QIIME 2 forums](https://forum.qiime2.org/)) if you have any questions, comments, or suggestions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }