{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The program\n",
    "This file is a version of readsample27, updated for Python 3.7. It reads in raw output from [GP Microbiome](https://github.com/tare/GPMicrobiome), which is written for Python 2.7 and incorporates the Stan probalistic programming language via the Pystan module. For a full description of the tool and its output, see the relevant [research paper](https://academic.oup.com/bioinformatics/article/34/3/372/41574420). The pickle files that GP Microbiome created when I ran it on my Windows 10 computer needed to be reformatted before loading in Python 3 or with a Unix-based operating system such as Linux. Therefore, I created the pickle files used in this program by loading the original pickle files in Python 2.7 and re-pickling them using different encoding. (See readsample27 or readsample27_with_151_edit for more details, along with the actual code.) This program processes the all the output for the CF data simultaneously, saving the data to csv files for use in other programs in the repository. \n",
    "<br>\n",
    "\n",
    "Unlike most of the programs in this repository, it cannot be run unedited without the actual CF data. I included it to demonstrate how I handled a minor discrepancy that could have caused a significant (but not immediately obvious) error in some of my later plots of this output. See readsample37 to run an extended version of this program with example data, which is also a Python 2.7 program updated for Python 3.7.\n",
    "\n",
    "### The importance of consistent formatting\n",
    "When I created my GP Microbiome input files for the time points and prediction time points, I formatted them as time deltas, with units of days, for all but the first participant, who had ID number 151. I used 151's age in days at the time points and predicted time points. I decided after that first run that, after viewing the output, I preferred using time deltas because it facilitates side-by-side comparison of different participant's output tables. \n",
    "\n",
    "<br>\n",
    "\n",
    "When running the same program on multiple participants, as I did with GP Microbiome, I believe it is best practice to make the same formatting choices for each participant whenever possible. This facilitates comparison, prevents errors, and allows for simpler code. With this in mind, and because the output of this program becomes input for my plotting programs, I include this version of readsample37 to show how I created all the files at once while correcting the discrepancy in the output file samples_151.p (from participant 151), and the same function could have corrected for any other file which did not use time deltas. It transforms 151's time points into time deltas before creating the csv versions of the output files so that they match the others' formatting from this point on. At the end of this program, I include a comment with an 'if' statement which could have been inserted into my plotting functions, to illustrate how I could have handled this discrepancy another way. However, I prefer to make my formatting consistent in the files themselves rather than insert such a statement every time I run a program that uses the files as input.  \n",
    "\n",
    "<br>\n",
    "\n",
    "My function creates all of the files, regardless of formatting, quickly in a loop. Since this is a very specific edit to readsample37 for a single file (or specific alternative time point type), and is meant as a demonstration of how to handle discrepancies, I have kept the comments and explanations to a minimum. The aforementioned program readsample37 contains extensive comments and provides the option to explore the output before running a similar function without the special edit.\n",
    "\n",
    "<br>\n",
    "\n",
    "All of my code is written to be adaptable yet highly resistant to errors. However you choose to define your time points, take similar steps to ensure that, when you run functions like the ones in my plotting programs which generate 20 plots per participant simultaneously, those plots show exactly what they are intended to show. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output files contain different numbers of samples and predicted time points depending on the participant, but they all include noise-free compositions of 245 OTU's (operational taxonomic units) of bacteria. They are named consistently, using the format 'samples2_(participant ID).p'. This indicates that they are the re-pickled versions of files named with the format 'samples_(participant ID).p'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import libraries\n",
    "import pickle\n",
    "import pandas as pd\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#create one list for all the participants we wish to create files for, and a second for the ones with different time points\n",
    "#in our case, since we adjusted our formatting to be consistent after the first run, the edit list has only one item\n",
    "IDs=['151','708','759','764','768']\n",
    "edit_list=['151']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#The function can process all the output files at once using a loop\n",
    "def file_create(name):\n",
    "    T,T_p,samples = pickle.load(open('samples2_{}.p'.format(name),'rb'), encoding='bytes')\n",
    "    rows=len(samples['Theta_G'].mean(0).T)\n",
    "    if name in edit_list:\n",
    "        #find the age at the first time point and subtract\n",
    "        #we don't need to do any other type of transformation because the units are still days\n",
    "        d=T[0]\n",
    "        T=[i-d for i in T]\n",
    "        T_p=[i-d for i in T_p]\n",
    "    else:\n",
    "        T=T.tolist()\n",
    "        T_p=T_p.tolist()\n",
    "    df=pd.DataFrame(samples['Theta_G'].mean(0).T, columns=[i for i in range(len(T))], index=[i+1 for i in range(rows)])\n",
    "    df.loc[0] = T\n",
    "    df=df.sort_index()\n",
    "    df.to_csv('{}.csv'.format(name), index=False)\n",
    "    df2=pd.DataFrame(samples['Theta_G_i'].mean(0).T, columns=[i for i in range(len(T),len(T)+len(T_p))], index=[i+1 for i in range(rows)])\n",
    "    df2.loc[0] = T_p\n",
    "    df2 = df2.sort_index()\n",
    "    dfboth = pd.concat([df, df2], axis=1, sort=False)\n",
    "    dfboth.to_csv('{}_both.csv'.format(name), index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Example of running the function\n",
    "for i in IDs:\n",
    "    file_create(i)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#the alternative correction method for the discrepancy in formatting 151's time points:\n",
    "#this 'if' statement would have been inserted only into the plotting functions that included predictions\n",
    "#the variable 'r' represents a dataframe which is a version of the 'both' csv, with the time points reordered\n",
    "#the comment would be inserted after the variable is defined, near the beginning of the function\n",
    "#if r.iloc[0,0]!=0:\n",
    "    #r.loc[0]-=r.iloc[0,0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}