{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This program reads in and merges multiple output files from the TIME pairwise Dynamic Time Warping (DTW) Distance workflow (Workflow 5b). Dynamic Time Warping is a measure of similarity in longitudinal data, and the TIME version of the algorithm ranges from 0 to 1. 0 is the most similar, and 1 is the most different. For more on the algorithm, see the [relevant research paper here](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00036/full)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Workflow 5b allows for calculations across all samples as well as by condition. Conditions are part of the metadata entered into the application. Here, for my example code, I used the repeated antibiotic perturbation example analysis provided by TIME.(To see a version of this program specific to the CF data, go to Create_TDTW_all_filtered.) [Click here to run this analysis](https://web.rniapps.net/time/index.php) now. To learn more about the antibiotic data, click [here to view the antibiotic research paper](https://www.pnas.org/content/108/Supplement_1/4554.long). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I chose 3 of the conditions: PreCp (before the first course of the antibiotic ciprofloxacin), FirstCp (after the first course), and FirstWPC (a week after the first course ended). I used the following settings: a taxonomic level of 'Genus', the default DTW constraint of 2 and the default 0.5 cutoff for ignoring rare taxa. I encourage users to run the example analyses, with the same settings or their own choice of settings, and then run this program on the output. I have also included the output files in a folder called 'Box Plot Data,' located in the 'Data' folder of this repository under 'Extras.'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output csv files are automatically generated, with names in the following format:\n",
    "<br>\n",
    "\n",
    "(participant id)\\_(taxonomic level)\\_mdtw\\_(cutoff for ignoring rare taxa)\\_(constraint)\\_(condition).csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, one of my output files was named 'E_Genus_mdtw_0.5_2_PreCp.csv', for the analysis at the genus level of only the samples taken from particpant E before the first course of antibiotics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This program and Create_TDTW_all_filtered generate files which are used as input in my box plot programs. However, you cannot run Create_TDTW_all_filtered unedited without the CF data. With this in mind, I have included in this repository an altered version of the output from Create_TDTW_all_filtered called TDTW_all_filtered2.csv, using randomly generated values, so that users can run both versions of the box plot programs. If you would like to see the code I used to generate TDTW_all_filtered2.csv, please email me at vtalbot@lesley.edu or vrtalbot@yahoo.com.\n",
    "<br>\n",
    "\n",
    "This program creates the input for DTW_All_boxplots_example. Users can do everything themselves, from running the TIME workflow (as mentioned above) to creating the merged file here to plotting the data in DTW_All_boxplots_example. The output from this program also serves as input for DTW_boxplots_by_status, which does not have a counterpart for the CF data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code may be easily modified to read in and merge multiple files which differ in name only by certain strings or variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import necessary libraries\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "#make a list of the ID's of participants whose data were analyzed in the workflow\n",
    "IDs=['D','E','F']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read in the files\n",
    "If you have the files saved to your computer, use the first cell. If you want to use the files I included in the 'Extras' folder, go to the second cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "#code for when the files are saved to your computer in the 'Data' folder\n",
    "#edit file paths if the files are not in the 'Data' folder\n",
    "#if you're using Binder and have downloaded the files to your computer, just upload them into the 'Data' folder on Binder\n",
    "#we read the output files from the different types of samples into dictionaries\n",
    "#first, a dictionary for the pairwise distances across all the samples combined\n",
    "csv = {i: pd.read_csv('Data/{}_Genus_mdtw_0.5_2_All.csv'.format(i)) for i in IDs }\n",
    "#then for each of the three conditions individually\n",
    "csvPreCp = {i: pd.read_csv('Data/{}_Genus_mdtw_0.5_2_PreCp.csv'.format(i)) for i in IDs}\n",
    "csvFirstCp = {i: pd.read_csv('Data/{}_Genus_mdtw_0.5_2_FirstCp.csv'.format(i)) for i in IDs}\n",
    "csvFirstWPC = {i: pd.read_csv('Data/{}_Genus_mdtw_0.5_2_FirstWPC.csv'.format(i)) for i in IDs}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#code for if you want to use the 'Extras' folder\n",
    "#read the output files from the different types of samples into dictionaries\n",
    "#first for the pairwise distances across all the samples combined\n",
    "csv = {i: pd.read_csv('Data/Extras/Box Plot Data/{}_Genus_mdtw_0.5_2_All.csv'.format(i)) for i in IDs }\n",
    "#then for each of the three conditions individually\n",
    "csvPreCp = {i: pd.read_csv('Data/Extras/Box Plot Data/{}_Genus_mdtw_0.5_2_PreCp.csv'.format(i)) for i in IDs}\n",
    "csvFirstCp = {i: pd.read_csv('Data/Extras/Box Plot Data/{}_Genus_mdtw_0.5_2_FirstCp.csv'.format(i)) for i in IDs}\n",
    "csvFirstWPC = {i: pd.read_csv('Data/Extras/Box Plot Data/{}_Genus_mdtw_0.5_2_FirstWPC.csv'.format(i)) for i in IDs}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example of what the head of one of the tables looks like: \n",
    "\n",
    "<img src='https://imgur.com/4dSoQWe.png' style='height:180px'>\n",
    "\n",
    "<br>\n",
    "The first column shows one bacteria, the next column shows a second, and the third contains their pairwise distance - in this case across all samples, for the participant identified as 'D'.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View it yourself, if you like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv['D'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we process the files to prepare for merging into one data frame. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "#create a dictionary for the dictionaries, and a list of the keys\n",
    "#for the CF data I made the keys lower case because the long strings the key words went into were easier to read that way\n",
    "#here I don't see the need; the keys are short and mostly abbreviations rather than words\n",
    "conditions={'All': csv, 'PreCp':csvPreCp, 'FirstCp':csvFirstCp, 'FirstWPC':csvFirstWPC}\n",
    "keylist=list(conditions.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "#combine the two bacteria name columns, and edit distance column to specify which samples it gives the pairwise distance for\n",
    "#unlike the CF data, the OTU's in this data set have no underscores before their names\n",
    "#we join the pairs with an underscore as a separator\n",
    "for i in IDs:\n",
    "    for j in keylist:\n",
    "        conditions[j][i]['Taxa1Taxa2'] = conditions[j][i][['Taxa1', 'Taxa2']].apply(lambda x: '_'.join(x), axis=1)\n",
    "        conditions[j][i]=conditions[j][i].drop(['Taxa1','Taxa2'],1)\n",
    "        conditions[j][i].rename(columns={'TIME_DTW_Distance': 'TDTW_{}_{}'.format(i,j)}, inplace=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is how, for each condition, one of the data frames looks now.\n",
    "<table><tr>\n",
    "<td> <img src='https://imgur.com/Sg8rh4z.png' style='height:180px'>\n",
    "    <img src='https://imgur.com/akc9Aej.png' style='height:180px'></td>\n",
    "<td> <img src='https://imgur.com/BVZEVy5.png' style='height:180px'>\n",
    "<img src='https://imgur.com/z6K8v4f.png' style='height:180px'></td>\n",
    "</tr></table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View them yourself, if you like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv['D'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csvPreCp['E'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csvFirstCp['F'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csvFirstWPC['D'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we merge everything into one data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#merge the dataframes\n",
    "for i in IDs:\n",
    "    for j in keylist:\n",
    "        if i=='D' and j=='All':\n",
    "            result=conditions[j][i]\n",
    "        else:\n",
    "            result = pd.merge(result, conditions[j][i], on=\"Taxa1Taxa2\", how= \"outer\")\n",
    "#view the merged file, if desired\n",
    "#result.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first few columns and first few rows after the merge:\n",
    "<img src='https://imgur.com/B13Dz0f.png' style='height:180ppx'>\n",
    "The columns need reordering, and we want to tranpsose the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "#sort the columns and transpose\n",
    "result = result.reindex(sorted(result.columns), axis=1)\n",
    "result=result.set_index('Taxa1Taxa2').T\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#check that it ran properly by looking at the head of the result data frame\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What the first few columns and first few rows of the table should look like now:\n",
    "<img src='https://imgur.com/A2xxDm3.png' style='height:180ppx'>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "#save file, keeping the index because it has now it is now an identifying string\n",
    "result.to_csv(\"Data/TDTW_all_example.csv\", index=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}