{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selecting features for log-ratios in Qurro"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Qurro, \"selecting features\" lets you choose the numerator and denominator feature(s) of a log-ratio. The values of this log-ratio will be shown on the y-axis of the sample plot.\n",
    "\n",
    "Qurro supports a few different methods of doing this. Here we'll try to describe these methods in detail, as well as some examples of how these can be useful.\n",
    "\n",
    "(For reference: all of the screenshots here were taken on the [Moving Pictures demo dataset](https://biocore.github.io/qurro/demos/q2_moving_pictures/index.html) using a development version of Qurro 0.7.0.)\n",
    "\n",
    "# 1. Autoselection\n",
    "\n",
    "Automatically selects an equal number of features from both sides of the rank plot as the numerator and denominator for log-ratio. This can be done using either a percentage (of the total number of features in the rank plot), or just using a literal number of features in the rank plot.\n",
    "\n",
    "As of version 0.7.0 of Qurro, you can enter in a negative number for autoselection (e.g. `-5` percent of features). This will _flip_ the selection, so that the numerator features are taken from the left side of the rank plot and the denominator features are taken from the right side of the rank plot. (This will have the effect of just switching the sign of each sample's log-ratio, since $\\ln\\big(\\dfrac{a}{b}\\big) = -\\ln\\big(\\dfrac{b}{a}\\big)$.)\n",
    "\n",
    "Other details: If the number of features specified isn't an integer (e.g. you request the top and bottom `5%` features, and there are 50 features in the rank plot), then the number of features selected from each side will be set to the _floor of the magnitude of this value_ (in this example, `2`, since $.05 \\times 50 = 2.5$). This behavior is consistent regardless of if the input number is positive or negative (if you enter `-5%` or `5%` all that will be different is which sides' features are set as the numerator / denominator).\n",
    "\n",
    "**Example:** selecting the log-ratio of the top 10% to bottom 10% of features for a given ranking.\n",
    "\n",
    "**In practice:** this is really useful for quickly looking at the top- and bottom-ranked features for a given ranking -- for example, if you want to see \"how well these rankings separate samples for a given metadata field.\"\n",
    "\n",
    "![](screenshots/autoselection.png)\n",
    "\n",
    "\n",
    "# 2. Clicking\n",
    "\n",
    "It's pretty basic compared to the other selection methods, but by clicking on the rank plot twice you can select simple log-ratios of single features. The first click sets the numerator feature for a log-ratio, and the second click sets the denominator feature for the log-ratio.\n",
    "\n",
    "**Example:** if you have a small number of features, and you just want to look at a log-ratio of two features quickly. (The [color data tutorial](https://nbviewer.jupyter.org/github/biocore/qurro/blob/master/example_notebooks/color_compositions/color_example.ipynb) is a good example of this.) \n",
    "\n",
    "**In practice:** this method isn't that useful compared to the other two, especially for datasets with more than a handful of features. If you're having a hard time clicking on features, increasing the bar width via the slider below the rank plot can help with this.\n",
    "\n",
    "![](screenshots/clicking.png)\n",
    "\n",
    "# 3. Filtering\n",
    "\n",
    "Filtering lets you select features based on focused searches through their IDs or metadata. The searches can be textual (e.g. selecting features where the `taxonomy` field contains the text `Ileibacterium`) or numeric (e.g. selecting features where some feature metadata value is less than 5). These searching methods are usually a good next step after autoselection.\n",
    "\n",
    "Keep in mind that you can mix and match these -- you don't have to use the same filtering method for the numerator and denominator selections.\n",
    "\n",
    "## 3.1. Textual Filtering\n",
    "\n",
    "All of the methods described here are case insensitive (i.e. they treat `abc`, `ABC`, and `AbC` as identical).\n",
    "\n",
    "### A small warning for textual filtering\n",
    "Certain types of data -- for example, taxonomy strings -- can be unexpectedly tricky to search. As an example, say you want to find all features classified in the genus `Streptococcus`. You can search for features with taxonomy annotations containing `Streptococcus`, but this will also give you hits for, for example, features classified in the genus `Peptostreptococcus` instead. (This may or may not be what you want.) You can get around this by including \"prefixes\" when searching (e.g. searching for `g__Streptococcus` should prevent this particular problem when your taxonomic classifications come from Greengenes), but **it's always a good idea to check the tables showing the selected numerator / denominator features to verify that things are working as you expected.**\n",
    "\n",
    "### 3.1.1. `contains the text`\n",
    "\n",
    "This will find features where the selected field contains some text. This input text can include starting/trailing whitespace (e.g. \" hamburger \"), and this whitespace will be included in the search being done (so a \"ahamburger.\" wouldn't be a match for \" hamburger \" since \"ahamburger.\" doesn't have the same surrounding whitespace).\n",
    "\n",
    "**Example:** selecting the log-ratio of all features with taxonomy annotations containing the text `Staphylococcus` over all features with taxonomy annotations containing the text `Propionibacterium`.\n",
    "\n",
    "**In practice:** useful for selecting individual groups of features. Also useful for exploring the rank plot (e.g. \"does this group of features seem to be mostly highly or lowly ranked?\")\n",
    "\n",
    "![](screenshots/containstext.png)\n",
    "\n",
    "### 3.1.2. `contains text separated by | (pipe)`\n",
    "\n",
    "This can be used to select features containing at least one of multiple possible sequences of text, which are separated in the input text by pipe characters (e.g. \"abc | def ghi | jklmnop\").\n",
    "\n",
    "Notably, this will strip whitespace surrounding the individual search terms: So in the example above, searches will only be done for \"abc\", \"def ghi\", and \"jklmnop\", not \"abc \", \" def ghi \", and \" jklmnop\". (That said, in most cases I don't think you should need to worry about whitespace all that much.)\n",
    "\n",
    "**Example:** selecting the log-ratio of all features with taxonomy annotations containing the text `Staphylococcus_aureus | Staphylococcus_epidermidis` over all features with taxonomy annotations containing the text `Propionibacterium`.\n",
    "\n",
    "**In practice:** useful for selecting multiple groups of features. This option and the `contains the text` option should be all you need for most text-based filtering.\n",
    "\n",
    "![](screenshots/containstext_pipe.png)\n",
    "\n",
    "### 3.1.3.  `is provided, and does not contain the text`\n",
    "\n",
    "This will select features where:\n",
    "\n",
    "1. The specified feature field (e.g. Feature ID) is provided for that feature in the feature metadata, and\n",
    "2. The specified feature field _does not_ contain the specified text.\n",
    "\n",
    "Note the first clause. If the selected feature field is not provided (e.g. no taxonomy information is provided for Feature A), then that feature won't show up in any results that involve this searching method. This behavior is the same as with other filtering methods, but we've explicitly specified it here so that it's clear (since you could argue that a non-existent taxonomy entry \"does not contain\" some text).\n",
    "\n",
    "**Example:** selecting the log-ratio of all features that with taxonomy annotations that contain the text `Bacteria` to all features with taxonomy annotations that *don't* contain the text `Bacteria`. (... Whether or not this is a good idea is another question ;)\n",
    "\n",
    "**In practice:** This can be useful for some very niche cases like the one shown above. For most \"normal\" analyses I don't think this will end up being that widely used.\n",
    "\n",
    "![](screenshots/containstext_negated.png)\n",
    "\n",
    "### 3.1.4. `contains the separated text fragment(s)`\n",
    "\n",
    "This will process the input text for searching by splitting it up by commas (`,`), semicolons (`;`), and whitespace. (So, e.g., `abc,def ; ghi` will turn into [`abc`, `def`, `ghi`].)\n",
    "\n",
    "Next, this will go through features and split up their value for the feature field in the same way (e.g. if you're searching on taxonomy, `D_0__Bacteria; D_1__Proteobacteria; D_2__Gammaproteobacteria; D_3__Oceanospirillales; D_4__Endozoicomonadaceae; D_5__Endozoicomonas; D_6__Candidatus Endonucleobacter bathymodioli` will get split into [`D_0__Bacteria`, `D_1__Proteobacteria`, `D_2__Gammaproteobacteria`, `D_3__Oceanospirillales`, `D_4__Endozoicomonadaceae`, `D_5__Endozoicomonas`, `D_6__Candidatus`, `Endonucleobacter`, `bathymodioli`].\n",
    "\n",
    "Finally, this will select features where there is at least one **exact** \"match\" between these split-up lists. (So, for example, if our input text is `D_3__Oceanospirillales`, then we'll get a match with a feature with the above taxonomy string; but if our input text is just `Oceanospirillales`, then we won't get a match, because the `D_3__` prefix is missing. Note that, weirdly, this means that an input text of `bathymodioli` will cause a match, because `bathymodioli` is separated by a space from the rest of the taxonomy string.)\n",
    "\n",
    "**Example:** Let's say you want to select all features classified in the `Staphylococcus` bacterial genus, but you're working with metagenomic data where various `Staphylococcus_phage` viruses are also present as features. This option will let you enter in `Staphylococcus` as your input text and just get results that have `Staphylococcus` separated by commas/semicolons/whitespace (e.g. `Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus; Staphylococcus_aureus`) but not results that have `Staphylococcus` followed by other things (e.g. `Viruses; Caudovirales; Myoviridae; Spounalikevirus; Staphylococcus_phage_Sb_1`).\n",
    "\n",
    "**In practice:** This is an old method I implemented for doing this sort of searching before I added the `contains text separated by | (pipe)` method. I'm keeping this here for legacy purposes, but **I strongly suggest using the `contains text separated by | (pipe)` method now instead:** for cases like the Staphylococcus example above where the \"splitting\" is useful, you can simulate this yourself in the `contains text separated by | (pipe)` method by providing something like `Staphylococcus;` (purposely including the semicolon) as one of the terms in the input text.\n",
    "\n",
    "![](screenshots/containstext_fragments.png)\n",
    "\n",
    "## 3.2. Numeric Filtering\n",
    "\n",
    "These options are hopefully self-explanatory. These are designed to work with numeric feature metadata, or with the feature loading / differential values already present in the feature rankings you provide when constructing a Qurro visualization.\n",
    "\n",
    "**Example:** Say you want to select all features where the differential `Oxygen` is less than `-0.5`. You can do this by setting the feature field to `Oxygen`, selecting the `is less than` searching option, then entering in `-0.5` in the input text.\n",
    "\n",
    "**In practice:** This is mostly useful for selecting a certain number features from a particular side of the rank plot in a more targeted way than autoselection (there's an example of this in Figure 2 in [the Qurro paper](https://www.biorxiv.org/content/10.1101/2019.12.17.880047v1.full)), but I'm sure there could also be other uses if you have interesting numeric feature metadata.\n",
    "\n",
    "![](screenshots/numeric_filtering.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Sidenote: computing log-ratios involving > 2 features\n",
    "In all of the selection methods above except for the \"clicking\" method, an arbitrary number of features can be present in the numerator or denominator of the log-ratio.\n",
    "\n",
    "In this case, the log-ratio is computed for a given sample by summing the abundances of the numerator features, summing the abundances of the denominator features, and then taking the log-ratio of these sums. Written out as a formula, this is `ln(top sum) - ln(bottom sum)` (or, [equivalently](https://en.wikipedia.org/wiki/List_of_logarithmic_identities#Using_simpler_operations), `ln(top sum / bottom sum)`).\n",
    "\n",
    "There are other ways of computing log-ratios that are commonly used -- for example, taking the geometric or arithmetic mean of all feature abundances in the numerator or denominator. Support for these alternatives is on our TODO list (...but PRs are always welcome because grad school is hard)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}