{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The program\n",
    "This file, intended for Python 2.7, reads in raw output from [GP Microbiome](https://github.com/tare/GPMicrobiome), which is written for Python 2.7 and incorporates the Stan probalistic programming language via the Pystan module. This program processes the all the output for the CF data simultaneously, saving the data to csv files for use in other programs in this repository with up-to-date versions of Python. Unlike most of the programs in this repository, it cannot be run unedited without the actual CF data. I included it to demonstrate how I handled a minor discrepancy that could have caused a significant (but not immediately obvious) error in some of my later plots of this output. See readsample27 to run an extended version of this program with example data.\n",
    "<br>\n",
    "\n",
    "### Pickle issues\n",
    "After running GP Microbiome on my Windows 10 computer, the most straightforward way for me to process the output was to load it in Python 2.7 and convert the pickle file to a series of csv files, then reload them in my Python 3.7 programs. After all, I had a Python 2.7 environment on my computer already set up to run GP Microbiome in.\n",
    "\n",
    "<br>\n",
    "\n",
    "However, I wanted others to be able to run my programs. Windows writes pickle files such as those created by GP Microbiome differently than other operating systems do because Python on Windows treats text files differently than other types. There are also noteworthy incompatibilities between the versions of the Pickle module used in Python 2 and Python 3. My solution to both issues: after loading the pickle file in the first line of this program on my Windows 10 machine, I dump (save) the file again in binary mode, which all operating systems can understand with pickle.\n",
    "\n",
    "<br>\n",
    "\n",
    "As a result, while you may not be able to load the initial versions of the pickle files, you can just skip down to the cell that reads in the re-encoded versions instead. The rest of the program uses those modified pickle files. This enables people to run this program on Binder's Linux system with its version of Python 2.7. It also enables people to run the program in Python 2.7 on their own computers not only if they have Linux, but also if they have Mac or any other Unix-based operating system because they all handle text files (the source of the intial discrepancy) in the same way.\n",
    "\n",
    "<br>\n",
    "Re-encoding the files also enables users to load the file with Python 3 using the edited versions of this program on any system. I have included in the repository the programs readsample37_with_151_edit and readsample37, which are versions of this program and readsample27, edited for use with Python 3.7 on any operating system.\n",
    "<br>\n",
    "\n",
    "### The importance of consistent formatting\n",
    "When I created my GP Microbiome input files for the time points and prediction time points, I formatted them as time deltas, with units of days, for all but the first participant, who had ID number 151. I used 151's age in days at the time points and predicted time points. I decided after that first run that, after viewing the output, I preferred using time deltas because it facilitates side-by-side comparison of different participant's output tables. \n",
    "\n",
    "<br>\n",
    "\n",
    "When running the same program on multiple participants, as I did with GP Microbiome, I believe it is best practice to make the same formatting choices for each participant whenever possible. This facilitates comparison, prevents errors, and allows for simpler code. With this in mind, and because the output of this program becomes input for my plotting programs, I include this version of readsample27 to show how I created all the files at once while correcting the discrepancy in the output file samples_151.p (from participant 151), and the same function could have corrected for any other file which did not use time deltas. It transforms 151's time points into time deltas before creating the csv versions of the output files so that they match the others' formatting from this point on. At the end of this program, I include a comment with an 'if' statement which could have been inserted into my plotting functions, to illustrate how I could have handled this discrepancy another way. However, I prefer to make my formatting consistent in the files themselves rather than insert such a statement every time I run a program that uses the files as input.  \n",
    "\n",
    "<br>\n",
    "\n",
    "My function creates all of the files, regardless of formatting, quickly in a loop. Since this is a very specific edit to readsample27 for a single file (or specific alternative time point type), and is meant as a demonstration of how to handle discrepancies, I have kept the comments and explanations to a minimum. The aforementioned program readsample27 contains extensive comments and provides the option to explore the output before running a similar function without the special edit. \n",
    "\n",
    "<br>\n",
    "\n",
    "All of my code is written to be adaptable yet highly resistant to errors. However you choose to define your time points, take similar steps to ensure that, when you run functions like the ones in my plotting programs which generate 20 plots per participant simultaneously, those plots show exactly what they are intended to show. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output files contain different numbers of samples and predicted time points depending on the participant, but they all include noise-free compositions of 245 OTU's (operational taxonomic units) of bacteria. They are named consistently, using the format 'samples_(participant ID).p'. The re-pickled versions use the format 'samples2_(participant ID).p'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import libraries\n",
    "import pickle\n",
    "import pandas as pd\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#create one list for all the participants we wish to create files for, and a second for the ones with different time points\n",
    "#in our case, since we adjusted our formatting to be consistent after the first run, the edit list has only one item\n",
    "IDs=['151','708','759','764','768']\n",
    "edit_list=['151']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#reformat the files to make it possible to load them with Unix-based operating systems or Python 3\n",
    "#this cell will not run on non-Windows operating systems or Binder, but you don't need to run it yourself\n",
    "#the rest of the program is written to use the reformatted versions of the files as input\n",
    "#that way it can run on those non-Windows operating systems\n",
    "for name in IDs:\n",
    "    T,T_p,samples = pickle.load(open('samples_{}.p'.format(name),'r'))\n",
    "    pickle.dump((T,T_p,samples), open('samples2_{}.p'.format(name), 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#The function can process all the output files at once using a loop\n",
    "def file_create(name):\n",
    "    T,T_p,samples = pickle.load(open('samples2_{}.p'.format(name),'rb'))\n",
    "    rows=len(samples['Theta_G'].mean(0).T)\n",
    "    if name in edit_list:\n",
    "        #find the age at the first time point and subtract\n",
    "        #we don't need to do any other type of transformation because the units are still days\n",
    "        d=T[0]\n",
    "        T=[i-d for i in T]\n",
    "        T_p=[i-d for i in T_p]\n",
    "    else:\n",
    "        T=T.tolist()\n",
    "        T_p=T_p.tolist()\n",
    "    df=pd.DataFrame(samples['Theta_G'].mean(0).T, columns=[i for i in range(len(T))], index=[i+1 for i in range(rows)])\n",
    "    df.loc[0] = T\n",
    "    df=df.sort_index()\n",
    "    df.to_csv('{}.csv'.format(name), index=False)\n",
    "    df2=pd.DataFrame(samples['Theta_G_i'].mean(0).T, columns=[i for i in range(len(T),len(T)+len(T_p))], index=[i+1 for i in range(rows)])\n",
    "    df2.loc[0] = T_p\n",
    "    df2 = df2.sort_index()\n",
    "    dfboth = pd.concat([df, df2], axis=1, sort=False)\n",
    "    dfboth.to_csv('{}_both.csv'.format(name), index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Example of running the function\n",
    "for i in IDs:\n",
    "    file_create(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#the alternative correction method for the discrepancy in formatting 151's time points:\n",
    "#this 'if' statement would have been inserted only into the plotting functions that included predictions\n",
    "#the variable 'r' represents a dataframe which is a version of the 'both' csv, with the time points reordered\n",
    "#the comment would be inserted after the variable is defined, near the beginning of the function\n",
    "#if r.iloc[0,0]!=0:\n",
    "    #r.loc[0]-=r.iloc[0,0]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}