{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This program is written to take a file with observed relative abundances of bacteria for several of our CF study participants and split it into separate files, one for each participant. These files will be used later in my plotting programs. The code can easily be modified to split up any file with column names which contain variables such as ID numbers. \n",
    "<br>\n",
    "\n",
    "The program demonstrates the process of splitting up the file with example data. The example data are designed to resemble our data in format and range, but are not real relative abundances. I have included both the example relative abundance file and a full explanation of how I created it in this repository, so that you can run it yourself. Although I encourage users to run this program and create the files themselves, I have also included a folder containing the output files in the 'Data' folder under 'Extras,' so that users can run the plotting programs without running this one first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import necessary libraries\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Read in the relative abundance file, which contains relative abundances at all time points for all selected participants. \n",
    "#Columns take the form '_(subject id)_(time point of sample).' Example: _500_313\n",
    "rel = pd.read_csv(\"Data/Example_rel.csv\")\n",
    "#look at the head of the file to get a better idea of the format - note that time points are not necessarily in order\n",
    "rel.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create a list of the unique prefixes in the column names by splitting them on the underscores and finding unique values.\n",
    "#In this data set, those prefixes are the participants' ID numbers\n",
    "#For differently formatted data, you might need to modify this code to split on a different string or change the index.\n",
    "#Exclude the '_NAME_' column, which is the non-essential index variable. It numbers the rows 1-245 as OTU_1 to OTU_245.\n",
    "IDs=[col.split('_')[1] for col in rel.columns if col!='_NAME_']\n",
    "IDs=list((set(IDs)))\n",
    "#View the list of unique prefixes\n",
    "IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Define a function to create relative abundance files, subsetting the data based on a string repeated in multiple columns\n",
    "def rel_create(st):\n",
    "    rel = pd.read_csv(\"Data/Example_rel.csv\")\n",
    "    #obtain a list of only the columns starting with the input variable 'st' - in the CF data, this is the participant ID\n",
    "    cols = [col for col in rel.columns if col.split('_')[1]==st]\n",
    "    #subset the data frame based on the input value\n",
    "    df=rel[cols]\n",
    "    #prepare to reorder the columns, which are numerical and represent time points, by redefining them as integers \n",
    "    #if they were pure strings, we could skip this step and sort by some other means (e.g. alphabetically)\n",
    "    num=[int(col.split('_{}_'.format(st))[1]) for col in cols]\n",
    "    df.columns=num\n",
    "    #sort the data frame so that the columns are in numerical (and therefore chronological) order\n",
    "    df = df.reindex(sorted(df.columns), axis=1)\n",
    "    #we don't want the index column, but if we needed the files to have that or any other common column we could add it here\n",
    "    #df['_NAME_']=rel['_NAME_']\n",
    "    #save the file to the desired folder - remember to add a file path if necessary\n",
    "    df.to_csv(\"Data/{}_Rel.csv\".format(st), index=False)\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Use a loop to create a file for each element in the ID's list (in our case, each participant ID)\n",
    "for i in IDs:\n",
    "    rel_create(i)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}