{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Repeat measure experimental designs (e.g. time series) are a valid and powerful method to control for inter-individual variation. However, conventional dimensionality reduction methods can not account for the high-correlation of each subject to itself at a later time point. This inherent correlation structure can cause subject grouping to confound or even outweigh important phenotype groupings. In the CTF tutorials we covered how tensor factorization can help. However, in many cases the time points are irregularly sampled or we may want to project new data into a previously run factorization. For these cases and more we introduced [TEMPTED](https://www.biorxiv.org/content/10.1101/2023.07.26.550749v1). There is an R implementation here but in this tutorial we will cover how to run an analysis with TEMPTED in QIIME2 / python. \n",
    "\n",
    "First we will download the processed data originally from [here](https://qiita.ucsd.edu/study/description/10249#). This data can be downloaded with the following links:\n",
    "\n",
    "* **Table** (table.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/ECAM-Qiita-10249/table.qza)\n",
    "* **Sample Metadata** (metadata.tsv) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/ECAM-Qiita-10249/metadata.tsv)\n",
    "* **Feature Metadata** (taxonomy.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/ECAM-Qiita-10249/taxonomy.qza)\n",
    "* **Tree** (sepp-insertion-tree.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/ECAM-Qiita-10249/insertion-tree.qza)\n",
    "\n",
    "**Note**: This tutorial assumes you have installed [QIIME2](https://qiime2.org/) using one of the procedures in the [install documents](https://docs.qiime2.org/2020.2/install/). This tutorial also assumed you have installed, [Qurro](https://github.com/biocore/qurro) and [gemelli](https://github.com/biocore/gemelli).\n",
    "\n",
    "First, we will make a tutorial directory and download the data above and move the files to the `ECAM-Qiita-10249` directory:\n",
    "\n",
    "```bash\n",
    "mkdir ECAM-Qiita-10249\n",
    "```\n",
    "\n",
    "First we will import our data with the QIIME2 Python API. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "import qiime2 as q2\n",
    "# hide pandas Future/Deprecation Warning(s) for tutorial\n",
    "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) \n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "\n",
    "# import table(s)\n",
    "table = q2.Artifact.load('ECAM-Qiita-10249/table.qza')\n",
    "# import metadata\n",
    "metadata = q2.Metadata.load('ECAM-Qiita-10249/metadata.tsv')\n",
    "# import tree\n",
    "tree = q2.Artifact.load('ECAM-Qiita-10249/insertion-tree.qza')\n",
    "# import taxonomy\n",
    "taxonomy = q2.Artifact.load('ECAM-Qiita-10249/taxonomy.qza')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run TEMPTED we only need to run one command. The required input requirements are:\n",
    "\n",
    "1. table\n",
    "    - The table is of type `FeatureTable[Frequency]` which is a table where the rows are features (e.g. ASVs/microbes), the columns are samples, and the entries are the number of sequences for each sample-feature pair.\n",
    "2. sample-metadata\n",
    "    - This is a QIIME2 formatted [metadata](https://docs.qiime2.org/2020.2/tutorials/metadata/) (e.g. tsv format) where the rows are samples matched to the (1) table and the columns are different sample data (e.g. time point).  \n",
    "3. individual-id-column\n",
    "    - This is the name of the column in the (2) metadata that indicates the individual subject/site (e.g. subject ID) that was sampled repeatedly.\n",
    "4. state-column\n",
    "    - This is the name of the column in the (2) metadata that indicates the numeric repeated measure (e.g., Time in months/days) or non-numeric category (i.e. decade/body-site). \n",
    "5. output-dir\n",
    "    - The desired location of the output. We will cover each output independently below.  \n",
    "\n",
    "There are also _optional_ input parameters:\n",
    "\n",
    "*  n-components \n",
    "    * The underlying low-rank structure. The input can be\n",
    "      an integer (suggested: 1 < rank < 10) [minimum 2].\n",
    "      Note: as the rank increases the runtime will\n",
    "      increase dramatically. [default: 3]\n",
    "* replicate-handling\n",
    "     * Choose how replicate samples are handled. If\n",
    "       replicates aredetected, \"error\" causes method to\n",
    "       fail; \"drop\" will discard all replicated samples;\n",
    "       \"random\" chooses one representative at random from\n",
    "       among replicates. [default: 'random']\n",
    "* svd-centralized (bool/flag)\n",
    "    * Removes the mean structure of the temporal tensor. [default: True]\n",
    "* n-components-centralize \n",
    "    * Rank of approximation for average matrix in svd-centralize. [default: 1]\n",
    "* smooth\n",
    "    * Smoothing parameter for RKHS norm. Larger means\n",
    "      smoother temporal loading functions. [default: 1e-06]\n",
    "* resolution\n",
    "    * Number of time points to evaluate the value of the\n",
    "      temporal loading function.[default: 101]\n",
    "* max-iterations \n",
    "    * Maximum number of iteration in for rank-1 calculation. [default: 20]\n",
    "* epsilon\n",
    "    * Convergence criteria for difference between iterations for each rank-1 calculation. [default: 0.0001]\n",
    "                                                             \n",
    "In this tutorial our individual-id-column is `host_subject_id` and our state-column is different time points denoted as `month` in the sample metadata. Now we are ready to run TEMPTED:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Results (name = value)\n",
       "-------------------------------------------------------------------------------------------------------------\n",
       "individual_biplot = <artifact: PCoAResults % Properties('biplot') uuid: bec45a0a-78d1-4f52-8247-c533c36bab39>\n",
       "state_loadings    = <artifact: SampleData[SampleTrajectory] uuid: 29d6d315-33e0-4fb5-8087-cd772dd95489>\n",
       "distance_matrix   = <artifact: DistanceMatrix uuid: fa902a42-24a3-41fc-a18e-e9d9c694457f>\n",
       "svd_center        = <artifact: SampleData[SampleTrajectory] uuid: c531b3ac-e87a-4ec3-8402-63e865891bf7>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from qiime2.plugins.gemelli.actions import (tempted_factorize, clr_transformation)\n",
    "\n",
    "table_transformed = clr_transformation(table).clr_table\n",
    "tempted_res = tempted_factorize(table_transformed,\n",
    "                                metadata,\n",
    "                                'host_subject_id',\n",
    "                                'month')\n",
    "tempted_res\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now cover the output files:\n",
    "\n",
    "* individual_biplot\n",
    "* state_loadings\n",
    "* distance_matrix\n",
    "* svd_center\n",
    "\n",
    "First, we will explore the `individual_biplot` which is an ordination where dots represent _subjects_ not _samples_ and arrows represent features (e.g. ASVs). First, we will need to aggregate the metadata by subject (i.e. collapsing the metadata of all samples from a given subject). This can be done by hand or using DataFrames in python (with pandas) or R like so:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>delivery</th>\n",
       "      <th>antiexposedall</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>#SampleID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.0</th>\n",
       "      <td>Vaginal</td>\n",
       "      <td>y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.0</th>\n",
       "      <td>Cesarean</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4.0</th>\n",
       "      <td>Cesarean</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5.0</th>\n",
       "      <td>Cesarean</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7.0</th>\n",
       "      <td>Cesarean</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           delivery antiexposedall\n",
       "#SampleID                         \n",
       "1.0         Vaginal              y\n",
       "2.0        Cesarean              n\n",
       "4.0        Cesarean              n\n",
       "5.0        Cesarean              n\n",
       "7.0        Cesarean              n"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from qiime2 import Metadata\n",
    "\n",
    "# first we import the metdata into pandas\n",
    "mf = pd.read_csv('ECAM-Qiita-10249/metadata.tsv', sep='\\t',index_col=0)\n",
    "# next we aggregate by subjects (i.e. 'host_subject_id') \n",
    "# and keep the first instance of 'diagnosis_full' by subject.\n",
    "mf = mf.groupby('host_subject_id').agg({'delivery':'first','antiexposedall':'first'})\n",
    "# now we save the metadata in QIIME2 format.\n",
    "mf.index.name = '#SampleID'\n",
    "mf.index = mf.index.astype(str)\n",
    "mf.to_csv('ECAM-Qiita-10249/subject-metadata.tsv', sep='\\t')\n",
    "mf.head(5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now with the collapsed `subject-metadata.tsv` table we are ready to plot with emperor: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ECAM-Qiita-10249/subject_biplot.qzv'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from qiime2.plugins.emperor.visualizers import biplot\n",
    "\n",
    "# plot subject biplot\n",
    "individual_biplot_emperor = biplot(tempted_res.individual_biplot,\n",
    "                                Metadata(mf),\n",
    "                                feature_metadata=taxonomy.view(Metadata),\n",
    "                                number_of_features=15)\n",
    "individual_biplot_emperor.visualization.save('ECAM-Qiita-10249/subject_biplot.qzv')\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this visualization we can see that the birth mode clearly separate the subjects across time.\n",
    "\n",
    "![image.png](etc/tempted_img_1.png)\n",
    "\n",
    "We can also see that the birthmode grouping is separated entirely along the first PC (axis 2). We can now use [Qurro](https://github.com/biocore/qurro) to explore the feature loading partitions (arrows) in this biplot as a log-ratio of the original table counts. This allows us to relate these low-dimensional representations back to our original data. Additionally, log-ratios provide a nice set of data points for additional analysis such as LME models. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2961 feature(s) in the BIOM table were not present in the feature rankings.\n",
      "These feature(s) have been removed from the visualization.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'ECAM-Qiita-10249/qurro.qzv'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from qiime2.plugins.qurro.actions import loading_plot\n",
    "\n",
    "# run Qurro\n",
    "qurro_plot = loading_plot(tempted_res.individual_biplot, table,\n",
    "                          metadata,\n",
    "                          feature_metadata=taxonomy.view(q2.Metadata))\n",
    "# save visual\n",
    "qurro_plot.visualization.save('ECAM-Qiita-10249/qurro.qzv')\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the Qurro output `qurro.qzv` we will simply choose the PC2 loadings above and below zero as the numerator (red ranks) and denominator (blue ranks) to create a log-ratio that differentiates the samples by birth mode status. Log-ratios can also be chosen by taxonomy or sequence identifiers (see the Qurro tutorials [here](https://github.com/biocore/qurro#tutorials) for more information). We can plot this log-ratio in Qurro with the x-axis as time and the color as birth mode, which clearly shows nice separation between phenotypes. \n",
    "\n",
    "![image.png](etc/tempted_img_2.png)\n",
    "\n",
    "We can further explore these phenotype differences by exporting the `sample_plot_data.tsv` from Qurro (marked in a orange box above) which will provide the selected log-ratio values for each sample. We can then merge this `sample_plot_data` with our sample metadata in python or R. \n",
    "\n",
    "**Note:** Qurro will have an option to export all of the metadata or only the log-ratio data soon.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>month</th>\n",
       "      <th>host_subject_id</th>\n",
       "      <th>delivery</th>\n",
       "      <th>Current_Log_Ratio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>#SampleID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10249.C018.14SS</th>\n",
       "      <td>16.0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>Vaginal</td>\n",
       "      <td>1.604457</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10249.C020.22SS</th>\n",
       "      <td>20.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>Cesarean</td>\n",
       "      <td>3.479770</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 month  host_subject_id  delivery  Current_Log_Ratio\n",
       "#SampleID                                                           \n",
       "10249.C018.14SS   16.0             18.0   Vaginal           1.604457\n",
       "10249.C020.22SS   20.0             20.0  Cesarean           3.479770"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# import log-ratio data\n",
    "metadata_one = pd.read_csv('ECAM-Qiita-10249/metadata.tsv',\n",
    "                           sep='\\t', index_col=0)\n",
    "# import rest of the metadata\n",
    "metadata_two = pd.read_csv('ECAM-Qiita-10249/sample_plot_data.tsv',\n",
    "                           sep='\\t', index_col=0)[['Current_Log_Ratio']]\n",
    "# merge the data\n",
    "log_ratio_metdata = pd.concat([metadata_two, metadata_one], axis=1)\n",
    "# ensure no duplicate columns\n",
    "log_ratio_metdata = log_ratio_metdata.dropna(subset=['Current_Log_Ratio'])\n",
    "log_ratio_metdata.index = log_ratio_metdata.index.astype(str)\n",
    "# export in QIIME2 format\n",
    "log_ratio_metdata = log_ratio_metdata[['month','host_subject_id',\n",
    "                                       'delivery','Current_Log_Ratio']]\n",
    "\n",
    "log_ratio_metdata.index.name = '#SampleID'\n",
    "log_ratio_metdata.to_csv('ECAM-Qiita-10249/merged_sample_plot_data.tsv', sep='\\t')\n",
    "log_ratio_metdata.head(2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see above the metadata now has the added column of `Current_Log_Ratio` from Qurro. So now we will continue to explore this log-ratio by first plotting it explicitly over time with q2-longitudinal.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ECAM-Qiita-10249/log_ratio_plot.qzv'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from qiime2.plugins.longitudinal.actions import (volatility, linear_mixed_effects) \n",
    "\n",
    "# make a time series plot of log-ratio\n",
    "temporal_plot = volatility(q2.Metadata(log_ratio_metdata),\n",
    "                           'month',\n",
    "                           individual_id_column='host_subject_id',\n",
    "                           default_group_column='delivery',\n",
    "                           default_metric='Current_Log_Ratio')\n",
    "temporal_plot.visualization.save('ECAM-Qiita-10249/log_ratio_plot.qzv')\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This demonstrates that we can recreate the separation by birth mode that we saw in both the `subject_biplot` & `state_subject_ordination`, allowing us to associate specific taxa (in the numerator or denominator) with a particular phenotype.\n",
    "\n",
    "![image.png](etc/tempted_img_3.png)\n",
    "\n",
    "We can test the statistical power of this log-ratio to differentiate samples by birth mode status using a linear mixed effects (LME) through q2-longitudinal. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/cameronmartino/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_longitudinal/_longitudinal.py:291: UserWarning: This is only a warning, and the results of this action are still valid. The column name \"predicted Current_Log_Ratio\" already exists in your metadata file. Any \"raw\" metadata that can be downloaded from the resulting visualization will contain overwritten values for this metadata column, not the original values.\n",
      "  warnings.warn(warning, UserWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'ECAM-Qiita-10249/lme_log_ratio.qzv'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run LME model on log-ratio\n",
    "lme_plot = linear_mixed_effects(q2.Metadata(log_ratio_metdata),\n",
    "                                'month',\n",
    "                                individual_id_column='host_subject_id',\n",
    "                                group_columns='delivery',\n",
    "                                metric='Current_Log_Ratio')\n",
    "lme_plot.visualization.save('ECAM-Qiita-10249/lme_log_ratio.qzv')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this LME model we can see that indeed the birthmode grouping is significant across time. \n",
    "\n",
    "![image.png](etc/tempted_img_4.png)\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}