{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the CF data, box plots showing the output from the TIME pairwise Dynamic Time Warping (DTW) Distance workflow (Workflow 5b) between individual pairs of bacteria grouped by clinical condition were not the best choice, but often one does want to make box plots grouped by a factor variable. With this in mind, this code creates box plots, grouped by condition, of the pairwise DTW distance for the repeated antibiotic perturbation example data available on the TIME website. Dynamic Time Warping is a measure of similarity in longitudinal data, and the TIME version of the algorithm ranges from 0 to 1. 0 is the most similar, and 1 is the most different. For more on the algorithm, see the [relevant research paper here](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00036/full).\n",
    "<br>\n",
    "\n",
    "See Create_TDTW_all_example for more details on Workflow 5b and the repeated antibiotic perturbation data set, which shows differences in participant's microbiome before and after antibiotic use. Create_TDTW_all_example generates the input file for this program, merging output files downloaded from the TIME Workflow 5b. However, if you have not run it yet, I have provided the input file for this program, along with the input files for Create_TDTW_all_example, in the 'Extras' folder for your convenience.\n",
    "<br>\n",
    "\n",
    "[Click here to run the TIME example analysis](https://web.rniapps.net/time/index.php) now if you would like, or [here to view the antibiotic research paper](https://www.pnas.org/content/108/Supplement_1/4554.long) and learn more about this research. \n",
    "<br>\n",
    "\n",
    "Since the antibiotic example analysis only contains 3 participants, these may not be the most informative of box plots, but they demonstrate the plots well. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import necessary libraries\n",
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#make a list of bacteria which are of special interest to us, which we wish to plot\n",
    "#these are not the same as the ones of interest in the CF research because they are in the gut rather than the lung\n",
    "bacteria=['Bacteroides','Roseburia','Clostridium']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first few cells modify the dataset to focus on only 3 OTU's (operational taxonomic units) in our list, creating a separate file for each of the three which shows the distance from the other OTU's to that one OTU for every condition. Three similar files were created for just the DTW distances which were caluclated for the 'All' condition in the programs DTW_All_boxplots and DTW_All_boxplots_example. Of course, you can focus on any number of OTU's by editing the list.\n",
    "<br>\n",
    "\n",
    "If the filtered files have already been created and saved for the bacteria you wish to plot, you can skip to Section 2 and read them in.\n",
    "<br>\n",
    "\n",
    "The first cell in this section reads in the file created in the program Create_TDTW_all_example. If you have not run that program yet, edit the cell as directed in the comments to use the version in the 'Extras' folder. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#read in the file, created in the program Create_TDTW_all_example, which contains all the output from Workflow 5b\n",
    "df = pd.read_csv('Data/TDTW_all_example.csv')\n",
    "#to use the ones in the 'Extras' folder, comment out the previous line and un-comment this one:\n",
    "#df = pd.read_csv(\"Data/Extras/TDTW_all_example.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#rename the first column, which has no name, to 'Sample' \n",
    "df.rename(columns={'Unnamed: 0': 'Sample'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#extract the participant's clinical status from the sample identifier and change to title case\n",
    "#this particular example output is actually already in title case \n",
    "#I included str.title() for the convenience of those editing it to use on data with a lower case status (like the CF data)\n",
    "df['Status']=df['Sample'].str.split('_').str[2].str.title()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#extract the participant's ID number from the identifier\n",
    "df['ID']=df['Sample'].str.split('_').str[1]\n",
    "#review the changes to confirm that they worked\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#keep only the colums with at least 5 non-NaN's\n",
    "df.dropna(thresh=5, axis=1, inplace=True)\n",
    "#you can, of course, change the threshold number based on your data, skip this cell and leave all NaN's, \n",
    "#or drop all NaN's with df3.dropna(axis=1, inplace=True) \n",
    "#view which columns were omitted, if desired\n",
    "#df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Create_TDTW_all_example, we chose to add a single underscore as a separator between bacteria names when we made the file TDTW_all_example. This means that, if you choose to edit this code to run it on your own data instead of the example data, the code below may need to be modified to reflect formatting differences in your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#make a list of columns (pairs of OTU's) for only the 3 OTU's of interest, excluding the ones that pair them with themselves\n",
    "#we also include the 'Status' column to enable grouping by status\n",
    "clist=[]\n",
    "for i in range(len(bacteria)):\n",
    "    #in this line, you may need to do a minor edit such as removing some underscores if your bacteria names don't have them\n",
    "    #viewing the head of the table above should tell you if and how you need to change the str.format() function\n",
    "    j=[col for col in df.columns if col.startswith('{}'.format(bacteria[i]))\n",
    "       and col!= '{}_{}'.format(bacteria[i],bacteria[i]) or col=='Status']\n",
    "    clist.append(j)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the next cell as it is written if you want to save the files after creating them. That will enable you to skip Section 1 and go straight to Section 2 the next time you run this program. If you don't want to save the files, just comment out the line which saves them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#create the files and save, adding them to a dictionary\n",
    "selectdfs={}\n",
    "for i in range(len(bacteria)):\n",
    "    df2=df.filter(clist[i], axis=1)\n",
    "    #since each data frame is for the distance from one bacteria, change the column names to be just the second bacteria\n",
    "    #this code may need adjustment depending on how many underscores are in your bacteria names\n",
    "    df2.columns=[w.replace('{}_'.format(bacteria[i]),'') for w in df2.columns]\n",
    "    #save file if you want to skip Section 1 next time, or comment out the next line\n",
    "    df2.to_csv(\"Data/TDTW_{}_conditions_boxplots.csv\".format(bacteria[i]), index=False)\n",
    "    selectdfs[bacteria[i]]=df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you ran Section 1, you can skip the first cell in Section 2, which reads the files in if you have saved them previously and creates the dictionary which was made in Section 1's last cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#if you saved the files, read them into a dictionary here\n",
    "#if you ran Section 1, you already have the dictionary, so skip this cell\n",
    "selectdfs = {i: pd.read_csv(\"Data/TDTW_{}_conditions_boxplots.csv\".format(i)) for i in bacteria}\n",
    "#view one of the files to get a feel for what the data look like, if desired\n",
    "#selectdfs['Bacteroides'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of my plotting functions are written to save the plots to a folder called 'Plots,' which is in this repository as well. Adjust the file path if you want to save them somewhere else, or comment out the line of code which saves them if you prefer not to save."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#now we plot them, taking care not to plot the same pairwise distance twice\n",
    "skip=[]\n",
    "for i in bacteria:\n",
    "    clist=list(selectdfs[i].columns.values)\n",
    "    clist=[col for col in clist if col!='Status' and col not in skip]\n",
    "    for j in clist:\n",
    "        plt.figure(figsize=(10,8))\n",
    "        sns.boxplot(x='Status',y='{}'.format(j),data=selectdfs[i], order=['Precp','Firstcp','Firstwpc','All'])\n",
    "        plt.title('DTW Distance between {} and {}'.format(i,j), size=16)\n",
    "        plt.xlabel('Status', size=14)\n",
    "        plt.ylabel('Distance',size=14)\n",
    "        #save file if desired\n",
    "        plt.savefig(\"Plots/{}_{}.png\".format(i,j), format = \"png\")\n",
    "        plt.show()\n",
    "    skip.append(i) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sample of what one box plot will look like:\n",
    "<br>\n",
    "<img src='https://imgur.com/OTctuR0.png' style='height:500px'>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}