{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This program processes and plots example output from GP Microbiome using functions and loops, then saves the results. Similar to Plotsamples, which plots ouput from GP Microbiome with all time points included, this program is written for GP Microbiome output which results from inputting data without the values from at least one time point, then making a prediction for that time point(s) to test the accuracy of the predictions. Originally I used CF study participant 708's data and withheld the last time point, then predicted the last time point and 4 time points into the future. For this example, I followed a similar pattern with randomly generated data, created from two of my Plotsamples example files. I show how I created these files in Section 4.\n", "
\n", "\n", "For the first file, I once again predicted the last time point, but this time predicted only 3 in the future. For the second, I included predictions between time points and excluded both the second and last time points, then also predicted 3 in the future. See Plotsamples for a complete explanation of those example files, which take the form of GP Microbiome output and observed relative abundance data, and how I have provided them in the repository. \n", "
\n", "\n", "Like those in Plotsamples, the functions in this program produce as many as twenty plots for each participant, and when they are run in a loop they generate plots for all participants at once. The output data is plotted with the observed relative abundance data in a similar way, and the plots use the same colour-coded markers to indicate the participant's clinical condition at each time point. Additionally, yellow markers indicate the prediction(s) for the withheld time point(s), to make it easy to see the level of accuracy. \n", "
\n", "\n", "I have also included comments indicating what the original code looked like, as the only differences are file names and the fact that the original data were not randomly generated.\n", "
\n", "\n", "This code can also easily be adapted to plot output from other algorithms in situations where predictions have been made and validated with a leave-one-out (or leave-k-out, for any value of *k* ) scenario." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import necessary libraries \n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 1\n", "The first few cells create the OTUkey_named file. If the file has already been created, you can skip down to Section 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Read in the original key file. We are going to add a column for only the bacteria's genus name to it.\n", "key = pd.read_excel(\"Data/OTUkey.xlsx\")\n", "#rename first column to avoid Excel mistaking it for a SYLK file due to the \"ID\" in the name\n", "key.rename(columns={'ID_OTU': 'OTU'}, inplace=True)\n", "#view the head, to get an idea of the rest of the format\n", "key.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#extract the genus from the taxonomic information and create the new 'Name' column for it\n", "pat = 'D_5__(?P.*)'\n", "key=key.join(key.Bacteria.str.extract(pat, expand=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#replace NaN, which occurs where the genus is \"Other\", with the word \"Other\"\n", "key['Name'].fillna('Other', inplace=True)\n", "#save edited file and review changes\n", "key.to_csv(\"Data/OTUkey_named.csv\", index=False)\n", "key.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 2\n", "Read in the OTUkey_named file below if you have previously created and saved it with Section 1. I have included a copy of it in the 'Extras' folder as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#read in OTUkey_named file, if it has already been created \n", "key=pd.read_csv(\"Data/OTUkey_named.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#make a list of the operational taxonomic unit (OTU) IDs for our bacteria of interest\n", "bacteria=[2,30,58,59,60,63,70,80,94,104,113,167,169,170,206,221,223,227,229,234]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 3\n", "This section creates a second version of the OTUkey_named file for selected bacteria for use in other programs, and can also be skipped once you have the file. We will use a version of the OTUkey_named_selection file, with plot specifcations added, in DTW_All_boxplots, where we create boxplots of the TIME Dynamic Time Warping output." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#this cell can be skipped if the OTUkey_selection_named file already exists, as it is not used in this particular program.\n", "#creating a second key for selected bacteria\n", "selectkey=key.iloc[[i-1 for i in bacteria],:]\n", "selectkey.head()\n", "#saving the select key for later use in other programs\n", "selectkey.to_csv(\"Data/OTUkey_selection_named.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 4: How I generated the additional example output files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate new data, first we read in our example output file for fictional participants 480 and 453. These example files are intended to resemble real output and are not actually the results of running GP Microbiome. Had they been, they would have been generated from the raw output in the program readsample27. I have included in this repository both the example files and a full explanation of how I created them, for those who are interested. The code here to generate new versions of those files is very similar to the code I used when I originally generated them, but I took additional steps in that initial process to make a more realistic approximation of our data than I would have had if I'd used a completely random process. \n", "
\n", "\n", "This means that we can easily create new versions of the filest that are still realistic simply by applying random weights to the values and normalizing them to sum to 1. For the new version of 480's data set, I only predict the last time point, along with 3 future ones. For the new version of 453's data set, I include predictions between time points and withhold both the second and last time point, predicting for them as well. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I demonstrate two different techniques below for generating new data: for 480, I read in the file which does not contain predictions, apply weights and normalize the existing data and generate my predictions (for the withheld final time point and future time points) based on the withheld final time point. Then I combine the tables to create the file which includes 'predictions.' For 453, I reverse the process and read in the file with predictions included, apply weights and normalize, then cut the 'predictions' out to create a new version of 453's prediction-free file. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we generate the file for participant 480:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import normalize" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.read_csv(\"Data/480.csv\")\n", "#view the file, to get an idea of what it looks like\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#convert to an array, remove the first row and last column, apply random weights and normalize, then preview the new array\n", "np.random.seed(1)\n", "c = normalize(df.iloc[1:,:-1].to_numpy()*np.random.uniform(low=.5, high=1.5, size=(245,len(df.columns)-1)), axis=0, norm='l1')\n", "c" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#convert back to a data frame, adding the first row back in (without the last column)\n", "df2=pd.DataFrame(c)\n", "df3=pd.DataFrame(np.insert(df2.values, 0, values=df.iloc[0,:-1].to_list(), axis=0))\n", "df3.to_csv('Data/480b.csv', index=False)\n", "#view result\n", "df3.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#add predictions - I chose to make 3 in the future this time, so we have a total of 4\n", "np.random.seed(2)\n", "y=df.iloc[1:,10].to_numpy()\n", "x=normalize(y.repeat(4).reshape((245,4))*np.random.uniform(low=.5, high=1.5, size=(245,4)),axis=0, norm='l1')\n", "#alternative code, with same result using np.resize instead of np.reshape\n", "#x=normalize(np.resize(y.T,(4,245)).T*np.random.uniform(low=.5, high=1.5, size=(245,4)),axis=0, norm='l1')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create second data frame and combine the two. Check the head to confirm that it looks right.\n", "#I don't reorder the columns yet.\n", "#This makes it possible to plot or otherwise analyse predicted values on their own without creating a separate file\n", "#Note that for this example I didn't predict between time points, so this will be in order anyway\n", "pred_df=pd.DataFrame(x)\n", "pred_df2=pd.DataFrame(np.insert(pred_df.values, 0, values=[180*i+df.iloc[0,-1] for i in range(4)], axis=0),\n", " columns=[i for i in range(len(df.columns)-1, len(df.columns)+3)])\n", "dfboth = pd.concat([df3, pred_df2], axis=1, sort=False)\n", "dfboth.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfboth.to_csv(\"Data/480b_both.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we generate the file for participant 453, using the second technique:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.read_csv(\"Data/453_both.csv\")\n", "#view the head of the file, noting the original time points and which column has the last actual time point in it\n", "#the actual time points are on the left and the predictions are on the right, starting after the last actual time point\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#convert to an array, apply the random weights and normalize, then preview the new array\n", "np.random.seed(3)\n", "c = normalize(df.iloc[1:,:].to_numpy()*np.random.uniform(low=.5, high=1.5, size=(245,len(df.columns))),axis=0, norm='l1')\n", "c" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#convert back to a data frame, add back the time points, and save\n", "df2=pd.DataFrame(c)\n", "df3=pd.DataFrame(np.insert(df2.values, 0, values=df.iloc[0].to_list(), axis=0))\n", "df3.to_csv('Data/453b_both.csv', index=False)\n", "#since 453_both.csv (like all similarly named files) has the actual time points on the left and the predictions on the right,\n", "#separating out the predictions to create a new file without predictions in it is easy\n", "#remove the second column, the last column in 453_both.csv that was not a prediction, and the predictions columns\n", "#then rename the columns, numbering them with their index values, and save\n", "df3.iloc[:,:7].drop(df3.iloc[:,:7].columns[1], axis=1).set_axis([i for i in range(6)], axis=1, inplace=False).to_csv('Data/453b.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 5: Plots\n", "Read the files into dictionaries and plot them with functions. The dictionaries are to facilitate plotting multiple particpants' output at once in loops. If you only have one file, you can run the function on just that participant ID or create a 1-item dictionary. \n", "
\n", "\n", "See Section 6 for the code to create legends, with options to save the legends as separate files (my preferred method for the CF output data) or to copy and paste into any of the functions in Section 5 at the indicated places." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part A: Read in files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create a list for the ID numbers of the participants whose data we ran through GP Microbiome with time point(s) excluded\n", "IDs=['480','453']\n", "#Create dictionaries and read in each person's leave-k-out output for noise-free compositions without predictions, \n", "#and with predictions added in. We showed how we generated new example files for 480 and 453 in Section 4. \n", "#recall that my files for leave-k-out data differ in name from the ones for data with all time points only by the letter 'b' \n", "#I kept up this convention of simply adding a 'b' in the output from the program as well.\n", "dfs = {i: pd.read_csv('Data/{}b.csv'.format(i)) for i in IDs}\n", "both_dfs = {i: pd.read_csv('Data/{}b_both.csv'.format(i)) for i in IDs}\n", "#we need the files with all time points included too, to create a dictionary indicating which time point(s) were omitted\n", "observed_dfs = {i: pd.read_csv('Data/{}.csv'.format(i)) for i in IDs}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#if we were running this on the CF Data, we would simply change the IDs list:\n", "#IDs=['708']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#if desired, view the first few entries for one of the files without predictions to get a feel for the data\n", "dfs['480'].head() " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#rename the columns in the files containing both sets of time points based on the first row, which contains the time points\n", "#reorder the columns in the files to make the time points consecutive, and put them in a new dictionary\n", "#for 480 they are already in order because we did not predict between time points, but 453 will need to be reordered\n", "reordered_dfs={}\n", "for i in IDs:\n", " df=both_dfs[i].set_axis(both_dfs[i].loc[0].tolist(), axis=1, inplace=False)\n", " df=df.reindex(columns=sorted(df.columns))\n", " #save file if desired\n", " #df.to_csv('Data/{}b_both_reordered.csv'.format(i), index=False)\n", " #if you did save it, you could edit the first cell in Part A to read it directly into a dictionary \n", " #I opted not to do so because it takes so little time to reorder and I wanted to save space on my computer\n", " #I wanted the unordered version saved so that I could easily examine predicted values on their own \n", " reordered_dfs[i]=df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#view the resulting reordered data frame, confirming that the redordering code ran correctly \n", "reordered_dfs['453'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell reads in files created in the program Create_relative_abundance_files. If you have not run that program yet, edit the cell as directed in the comments to use the copy in the 'Extras' folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#read in the files containing the observed relative abundance data for each participant, adding them to a new dictionary\n", "#the columns are the age in days at the time of each sample, and we will use this information as well in the plots\n", "#the files were created and saved in the program Create_relative_abundance_files\n", "#however, they are also in the 'Extras' folder for your convenience\n", "rel_dfs = {i: pd.read_csv(\"Data/{}_Rel.csv\".format(i)) for i in IDs}\n", "#to use the ones in the 'Extras' folder, comment out the previous line and un-comment this one:\n", "#rel_dfs = {i: pd.read_csv(\"Data/Extras/Relative Abundance Files/{}_Rel.csv\".format(i)) for i in IDs}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#if desired, examine the head of one of the files to get a feel for the data\n", "rel_dfs['480'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part B: Markers\n", "In our plots, different coloured markers indicate a participant's clinical condition at each of the time points where samples were taken. In plots without predictions, every time point is marked. In plots with predictions, predicted time points are of course not marked. This means that markers need to be customised, and that having varying numbers of predictions between time points can make the markers especially tricky to define.\n", "
\n", "\n", "I did predict between time points as well as into the future when I ran GP Microbiome with data for all time points included, inserting 1 or 2 at even intervals. For the leave-k-out plots, actual participant 708 and example participant 480 only had predictions for the last time points and future ones but example participant 453 followed the same pattern of having varying numbers of predictions between time points. \n", "
\n", "\n", "Because we can exclude *k* time points for any value of _k_, and we can do so in any positions, the most straightforward way to create markers for these plots is to use two sets of markers. This avoids confusion and makes it easy to put yellow markers on the observed time point(s) which we predicted for. It also allows for more readable code. \n", "
\n", "\n", "There are two main ways of creating dictionaries for the markers: One is to import metadata and process it into dictionaries directly. The other is, after doing the first method once and saving the results to an Excel file (in my case, the same metadata file), to import those results into dictionaries. Since we are using an edited version of the data set, and since one might want to do these kinds of plots multiple times with different time points excluded, it doesn't make sense to save these markers and read them in. It's more efficient to create them directly, adding new dictionaries for the leave-k-out data and the withheld time points based on the metadata. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#read in the metadata file which includes the condition and time delta for each participant's samples \n", "status=pd.read_excel(\"Data/ExampleDeltaKey.xlsx\", sheet_name=\"Metadata and time deltas\")\n", "#for the CF Data, the name of the file is 'MetaDataKey.xlsx'\n", "#otherwise, the code is identical" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create a dictionary which, for each participant, lists the time deltas for samples taken while stable\n", "S_list={}\n", "for i in [int(x) for x in IDs]:\n", " #convert to a list, for each ID, the entries in the Time_Delta column for which the Visit_type was 'Stable'\n", " S_list[i]=list(status.query('Participant == {} and Visit_type == \"Stable\"'.format(i))['Time_Delta'])\n", "#display to confirm\n", "S_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create a dictionary which, for each participant, lists the time deltas for samples taken during exacerbations\n", "E_list={}\n", "for i in [int(x) for x in IDs]:\n", " #convert to a list, for each ID, the entries in the Time_Delta column for which the Visit_type was 'Exacerbation'\n", " E_list[i]=list(status.query('Participant == {} and Visit_type == \"Exacerbation\"'.format(i))['Time_Delta'])\n", "#display to confirm\n", "E_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create a dictionary of lists of the withheld time points for each participant\n", "out_dict={}\n", "for name in IDs:\n", " out_dict[name]=[i for i in list(observed_dfs[name].iloc[0]) if i not in list(dfs[name].iloc[0])]\n", "out_dict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#make a dictionary depicting the order of the values for each status, to identify where to place markers\n", "#then make a version of the dictionary for the leave-k-out file\n", "#these are for use with plots without predictions\n", "#these positional values correspond to column names in the 'observed_dfs' dictionary, which are simple index values \n", "#if they weren't numbers, we would replace int(col) with dfs[name].columns.get_loc(col)\n", "#a 'g' or 'r' in the dictionary name indicates green or red, for stable or exacerbated status, respectively\n", "#the 'new' prefix indicates that a dictionary is for leave-k-out data\n", "markers_gdict={}\n", "new_gdict={}\n", "markers_rdict={}\n", "new_rdict={}\n", "for i in IDs:\n", "#make a dictionary depicting the order of the values in the lists contained in S_list, which will be for green markers\n", " markers_gdict[i]=[int(col) for col in observed_dfs[i].columns if observed_dfs[i][col][0] in S_list[int(i)]]\n", "#we also have a version of the dictionary for the leave-k-out file\n", " new_gdict[i]=[int(col) for col in dfs[i].columns if dfs[i][col][0] in S_list[int(i)] and dfs[i][col][0]\n", " not in out_dict[i]]\n", "#make a dictionary depicting the order of the values in the lists contained in E_list, which will be for red markers\n", " markers_rdict[i]=[int(col) for col in observed_dfs[i].columns if observed_dfs[i][col][0] in E_list[int(i)]]\n", "#we also have a version of the dictionary for the leave-k-out file\n", " new_rdict[i]=[int(col) for col in dfs[i].columns if dfs[i][col][0] in E_list[int(i)] and dfs[i][col][0]\n", " not in out_dict[i]]\n", "#display to confirm that this is consistent with the values we removed\n", "markers_gdict, markers_rdict, new_gdict, new_rdict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#make versions of the dictionaries for the plots with predictions for the leave-k-out files\n", "#these correspond to index values of column names in the 'reordered_dfs' dictionary\n", "#which are time deltas for both actual and predicted values\n", "new_g1dict={}\n", "new_r1dict={}\n", "markers_ydict={}\n", "for i in IDs:\n", "#first make a dictionary depicting the order of the values in the lists contained in S_list, which will be for green markers\n", " new_g1dict[i]=[reordered_dfs[i].columns.get_loc(col) for col in reordered_dfs[i].columns \n", " if reordered_dfs[i][col][0] in S_list[int(i)] and reordered_dfs[i][col][0] not in out_dict[i]]\n", "#next make a dictionary depicting the order of the values in the lists contained in E_list, which will be for red markers\n", " new_r1dict[i]=[reordered_dfs[i].columns.get_loc(col) for col in reordered_dfs[i].columns \n", " if reordered_dfs[i][col][0] in E_list[int(i)] and reordered_dfs[i][col][0] not in out_dict[i]]\n", "#finally, make a dictionary to mark the predictions for the withheld points, which will have square yellow markers\n", " markers_ydict[i]=[reordered_dfs[i].columns.get_loc(col) for col in reordered_dfs[i].columns \n", " if reordered_dfs[i][col][0] in out_dict[i]] \n", "#display to confirm that all the markers are in the right places\n", "new_g1dict, new_r1dict, markers_ydict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you were to decide to save your markers to a file, you could insert here the same code used in Plotsamples and edit it easily to do so. I would recommend making a new sheet just for the leave-k-out markers, to avoid confusion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part C: Creating the plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "My plotting functions create as many as 20 plots per participant, and when run in loops they plot all participants' data at the same time. Before running such a function, always make sure that your input data is formatted consistently for each participant, to ensure that the plots show what they are intended to show.\n", "
\n", "\n", "All of my plotting functions are written to save the plots to a folder called 'Plots,' which is in this repository as well. Adjust the file path if you want to save them somewhere else, or comment out the line of code which saves them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#define custom colours for the plots - light and dark red and green, for noise-free and observed values respectively\n", "l_red='#FF5959'\n", "d_red='#A40000'\n", "l_green='#14AE0E'\n", "d_green='#0B5A08'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Given the distribution of our data, it didn't make sense to define the y-axis the same way for all the plots. \n", "#However, if you did wish to do so, you would add the following code to the plot specifications section of the function:\n", "#plt.ylim(min_value, max_value) \n", "#and substitute in the minimum and maximum values you wish to use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we have a function for use in a loop with the dictionaries, plotting without predictions. I included the observed values for the omitted time points in these plots (on the solid line) for reference. Although you may not feel the need to save these, you may want to see them to note how they differ from the corresponding plots using all time points. Because of how GP Microbiome works, output from a smaller input data set will look similar but not identical at each common time point, and seeing how things change provides insight about the withheld data.\n", "
\n", "\n", "Since our leave-k-out example output data were randomly generated, these plots may not look as similar to the plots of all of the original example output data as they would have if they were actually the results of running GP Microbiome.\n", "
\n", "\n", "These plots only differ from the first parts of the plots with predictions if you predicted between time points. Plots with predictions follow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#function for use in a loop with the dictionaries, plotting noise-free compositions without predictions\n", "def plot_loop(name):\n", " #divide the list of bacteria of interest into groups of 4 to facilitate plotting\n", " rows=[[2,30,58,59],[60,63,70,80],[94,104,113,167],[169,170,206,221],[223,227,229,234]]\n", " s=dfs[name]\n", " rel=rel_dfs[name]\n", " days=[int(x) for x in rel.columns]\n", " out_list=out_dict[name]\n", " markers_r = markers_rdict[name]\n", " markers_g = markers_gdict[name]\n", " new_r=new_rdict[name]\n", " new_g=new_gdict[name]\n", " ID=int(name) \n", " #run a loop to plot each group of 4 in a 2 by 2 format with our custom markers, then save the file\n", " for j in range(5):\n", " fig = plt.figure(figsize=(18,14))\n", " for i in range(4):\n", " ax = fig.add_subplot(2,2,i+1)\n", " #because I made my markers slightly transparent, I need separate plots for lines and red markers\n", " #this avoids having the line become transparent\n", " #slightly transparent markers make it easier to see subtle differences between the lines \n", " #if you opt to set alpha at the default of 1 (non-transparent), you can combine the first 2 red plots this way:\n", " #ax.plot([age for age in days if age-days[0] not in out_list], s.iloc[rows[j][i]],'-gD', \n", " #markevery=new_r, markerfacecolor=l_red, markersize=8, \n", " #linewidth=2,dashes=[2, 2,5,2], c='black')\n", " #there's no built-in way to customise marker colours by variables, so green markers need a dummy line \n", " #we need to exclude the withheld time points in the plot for noise-free compositions of bacteria\n", " ax.plot([age for age in days if age-days[0] not in out_list], s.iloc[rows[j][i]],'-gD', \n", " markevery=new_r, markerfacecolor='none',markersize=8, linewidth=2,dashes=[2, 2,5,2], c='black')\n", " ax.plot([age for age in days if age-days[0] not in out_list], s.iloc[rows[j][i]],'-gD', \n", " markevery=new_r, markerfacecolor=l_red, alpha=0.75, markersize=8,c='none')\n", " ax.plot([age for age in days if age-days[0] not in out_list], s.iloc[rows[j][i]],'-gD', \n", " markevery=new_g, markerfacecolor=l_green,alpha=0.75, markersize=8, c='none')\n", " #for the observed values, we leave the withheld time points in for reference\n", " #again, if you prefer alpha=1 you can combine the two lines for red markers:\n", " #ax.plot(days,rel.iloc[rows[j][i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, markersize=8, \n", " #linewidth=2, c='black')\n", " ax.plot(days,rel.iloc[rows[j][i]-1],'-gD', markevery=markers_r,markerfacecolor='none',markersize=8, \n", " linewidth=2, c='black')\n", " ax.plot(days,rel.iloc[rows[j][i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red,alpha=0.75, markersize=8, \n", " c='none')\n", " ax.plot(days,rel.iloc[rows[j][i]-1],'-gD', markevery=markers_g,markerfacecolor=d_green,alpha=0.75, markersize=8, \n", " c='none')\n", " #optional: insert code from Section 6 to add a legend for each plot - remember to make size/fit adjustments\n", " plt.title('{} Composition'.format(key['Name'][rows[j][i]-1]), size=15)\n", " plt.xlabel(\"Age (Days) of Participant {}\".format(ID), size=13)\n", " plt.ylabel(\"Relative Abundance\", size=13)\n", " plt.savefig(\"Plots/{}b_{}.png\".format(ID,j), format='png')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#run the first function in a loop\n", "for name in IDs:\n", " plot_loop(name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All subsequent plots in this section contain predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#function to plot with predictions, for use in a loop with the dictionaries\n", "def plot_pred_loop(name):\n", " #divide the list of bacteria of interest into groups of 4 to facilitate plotting\n", " rows=[[2,30,58,59],[60,63,70,80],[94,104,113,167],[169,170,206,221],[223,227,229,234]]\n", " r=reordered_dfs[name]\n", " rel=rel_dfs[name]\n", " days=[int(x) for x in rel.columns]\n", " out_list=out_dict[name]\n", " markers_r = markers_rdict[name]\n", " markers_g = markers_gdict[name]\n", " new_r1 = new_r1dict[name]\n", " new_g1 = new_g1dict[name]\n", " markers_y=markers_ydict[name]\n", " ID=int(name) \n", " #run a loop to plot each group of 4 in a 2 by 2 format with our custom markers, then save the file\n", " for j in range(5):\n", " fig = plt.figure(figsize=(18,14))\n", " for i in range(4):\n", " ax = fig.add_subplot(2,2,i+1)\n", " #because I made my markers slightly transparent, I need separate plots for lines and red markers\n", " #this avoids having the line become transparent\n", " #slightly transparent markers make it easier to see subtle differences between the lines\n", " #if you opt to set alpha at the default of 1 (not transparent), you can combine the first two red plots:\n", " #ax.plot(r.loc[0]+days[0], r.iloc[rows[j][i]],'-gD', markevery=new_r1, \n", " #markerfacecolor=l_red,markersize=8, linewidth=2,dashes=[2, 2,5,2], c='black') \n", " #there's no built-in way to customise marker colours by variables, so green and yellow markers always need a dummy line\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[j][i]],'-gD', markevery=new_r1, \n", " markerfacecolor='none', markersize=8, linewidth=2,dashes=[2, 2,5,2], c='black')\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[j][i]],'-gD', markevery=new_r1, \n", " markerfacecolor=l_red, alpha=0.75, markersize=8, c='none')\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[j][i]],'-gD', markevery=new_g1, \n", " markerfacecolor=l_green,alpha=0.75, markersize=8, c='none')\n", " #square yellow markers show the predictions for time points which were withheld\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[j][i]],'-gs', markevery=markers_y, \n", " markerfacecolor='yellow', alpha=0.9, markersize=11, c='none')\n", " #again, if you prefer alpha=1 you can combine the two lines for red markers:\n", " #ax.plot(days,rel.iloc[rows[j][i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, markersize=8, \n", " #linewidth=2, c='black')\n", " ax.plot(days, rel.iloc[rows[j][i]-1],'-gD', markevery=markers_r,markerfacecolor='none',markersize=8, \n", " linewidth=2, c='black')\n", " ax.plot(days, rel.iloc[rows[j][i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, alpha=0.75, markersize=8, \n", " c='none')\n", " ax.plot(days,rel.iloc[rows[j][i]-1],'-gD', markevery=markers_g,markerfacecolor=d_green,alpha=0.75, markersize=8, \n", " c='none')\n", " #optional: insert code from Section 6 to add a legend for each plot - remember to make size/fit adjustments\n", " plt.title('{} Composition with Predictions'.format(key['Name'][rows[j][i]-1]), size=15)\n", " plt.xlabel(\"Age (Days) of Participant {}\".format(ID), size=13)\n", " plt.ylabel(\"Relative Abundance\", size=13)\n", " plt.savefig(\"Plots/{}b_pred_{}.png\".format(ID,j), format='png')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#run the function with predictions in a loop\n", "for name in IDs:\n", " plot_pred_loop(name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#code to plot with predictions for the just 3 most important bacteria in a row\n", "def plot_pred_rows(name):\n", " rows=[94,113,229]\n", " r=reordered_dfs[name]\n", " rel=rel_dfs[name]\n", " days=[int(x) for x in rel.columns]\n", " out_list=out_dict[name]\n", " markers_r = markers_rdict[name]\n", " markers_g = markers_gdict[name]\n", " new_r1 = new_r1dict[name]\n", " new_g1 = new_g1dict[name]\n", " markers_y=markers_ydict[name]\n", " ID=int(name) \n", " fig=plt.figure(figsize=(26,7))\n", " for i in range(3):\n", " ax = fig.add_subplot(1,3,i+1)\n", " #because I made my markers slightly transparent, I need separate plots for lines and red markers\n", " #this avoids having the line become transparent\n", " #slightly transparent markers make it easier to see subtle differences between the lines\n", " #if you opt to set alpha at the default of 1 (not transparent), you can combine the first two red plots:\n", " #ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_r1, markerfacecolor=l_red,markersize=8, \n", " #linewidth=2,dashes=[2, 2,5,2], c='black') \n", " #there's no built-in way to customise marker colours by variables, so the green markers always need a dummy line \n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_r1, markerfacecolor='none',markersize=8, \n", " linewidth=2,dashes=[2, 2,5,2], c='black')\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_r1, markerfacecolor=l_red, alpha=0.75, \n", " markersize=8, c='none')\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_g1, markerfacecolor=l_green, alpha=0.75,\n", " markersize=8, c='none') \n", " #square yellow markers show the predictions for time points which were withheld\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gs', markevery=markers_y, \n", " markerfacecolor='yellow', alpha=0.9, markersize=11, c='none')\n", " #again, if you prefer alpha=1 you can combine the two lines for red markers:\n", " #ax.plot(days,rel.iloc[rows[i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, markersize=8, \n", " #linewidth=2, c='black')\n", " ax.plot(days, rel.iloc[rows[i]-1],'-gD', markevery=markers_r,markerfacecolor='none',markersize=8, \n", " linewidth=2, c='black')\n", " ax.plot(days, rel.iloc[rows[i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, alpha=0.75, markersize=8, \n", " c='none')\n", " ax.plot(days,rel.iloc[rows[i]-1],'-gD', markevery=markers_g,markerfacecolor=d_green,alpha=0.75, markersize=8, \n", " c='none')\n", " #optional: insert code from Section 6 to add a legend for each plot - remember to make size/fit adjustments\n", " plt.title('{} Composition with Predictions'.format(key['Name'][rows[i]-1]), size=24)\n", " plt.xlabel(\"Age (Days) of Participant {}\".format(ID), size=18)\n", " plt.ylabel(\"Relative Abundance\", size=18)\n", " plt.setp(ax.get_xticklabels(), size=14)\n", " plt.setp(ax.get_yticklabels(), size=14)\n", " #the tight_layout function reduces white space in the image. \n", " #If you turn off tight_layout you may need to adjust your text size etc. \n", " plt.tight_layout() \n", " plt.savefig(\"Plots/{}b_pred_rows.png\".format(ID), format='png')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#test-run the function on one participant\n", "plot_pred_rows('453')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#plotting with predictions for 2 in a row - simple edit to plot_pred_rows\n", "def plot_pred_two(name):\n", " rows=[94,229]\n", " r=reordered_dfs[name]\n", " rel=rel_dfs[name]\n", " days=[int(x) for x in rel.columns]\n", " out_list=out_dict[name]\n", " markers_r = markers_rdict[name]\n", " markers_g = markers_gdict[name]\n", " new_r1 = new_r1dict[name]\n", " new_g1 = new_g1dict[name]\n", " markers_y=markers_ydict[name]\n", " ID=int(name) \n", " fig=plt.figure(figsize=(15,6))\n", " for i in range(2):\n", " ax = fig.add_subplot(1,2,i+1)\n", " #because I made my markers slightly transparent, I need separate plots for lines and red markers\n", " #this avoids having the line become transparent\n", " #slightly transparent markers make it easier to see subtle differences between the lines\n", " #if you opt to set alpha at the default of 1 (not transparent), you can combine the first two red plots:\n", " #ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_r1, markerfacecolor=l_red,markersize=8, \n", " #linewidth=2,dashes=[2, 2,5,2], c='black') \n", " #there's no built-in way to customise marker colours by variables, so the green markers always need a dummy line \n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_r1, markerfacecolor='none',markersize=8, \n", " linewidth=2,dashes=[2, 2,5,2], c='black')\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_r1, markerfacecolor=l_red, alpha=0.75, \n", " markersize=8, c='none')\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gD', markevery=new_g1, markerfacecolor=l_green, alpha=0.75,\n", " markersize=8, c='none') \n", " #square yellow markers show the predictions for time points which were withheld\n", " ax.plot(r.loc[0]+days[0], r.iloc[rows[i]],'-gs', markevery=markers_y, \n", " markerfacecolor='yellow', alpha=0.9, markersize=11, c='none')\n", " #again, if you prefer alpha=1 you can combine the two lines for red markers:\n", " #ax.plot(days,rel.iloc[rows[i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, markersize=8, \n", " #linewidth=2, c='black')\n", " ax.plot(days, rel.iloc[rows[i]-1],'-gD', markevery=markers_r,markerfacecolor='none',markersize=8, \n", " linewidth=2, c='black')\n", " ax.plot(days, rel.iloc[rows[i]-1],'-gD', markevery=markers_r,markerfacecolor=d_red, alpha=0.75, markersize=8, \n", " c='none')\n", " ax.plot(days,rel.iloc[rows[i]-1],'-gD', markevery=markers_g,markerfacecolor=d_green,alpha=0.75, markersize=8, \n", " c='none')\n", " #optional: insert code from Section 6 to add a legend for each plot - remember to make size/fit adjustments\n", " plt.title('{} Composition with Predictions'.format(key['Name'][rows[i]-1]), size=20)\n", " plt.xlabel(\"Age (Days) of Participant {}\".format(ID), size=16)\n", " #alternative x axis label, if the participant's ID is in the title\n", " #plt.xlabel(\"Age(Days)\", size=16)\n", " plt.ylabel(\"Relative Abundance\", size=16)\n", " plt.setp(ax.get_xticklabels(), size=12)\n", " plt.setp(ax.get_yticklabels(), size=12)\n", " #the tight_layout function reduces white space in the image. \n", " #If you turn off tight_layout you may need to adjust your text size etc. \n", " plt.tight_layout()\n", " plt.savefig(\"Plots/{}b_pred_two.png\".format(ID), format='png')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#test-run the function in a loop this time \n", "for name in IDs:\n", " plot_pred_two(name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 6: Legends\n", "Here we have code for creating legends for the plots in this program to be saved as separate files. Then we provide a template code which can be copied and pasted into the functions, then adjusted accordingly to give every plot its own legend. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create a legend for leave-k-out plots with predictions and save to a separate file using dummy plots\n", "fig = plt.figure()\n", "fig.patch.set_alpha(0.0)\n", "ax = fig.add_subplot()\n", "ax.plot([], [], linewidth=2, c='black',dashes=[2, 2,5,2], label=\"Noise-Free with Predictions\")\n", "ax.plot([], [], 'gD', color=l_red,alpha=0.75,label=\"Noise-Free Exacerbated\")\n", "ax.plot([], [], 'gD', color=l_green,alpha=0.75,label=\"Noise-Free Stable\")\n", "ax.plot([], [], 'gs', color='yellow',alpha=0.9, label=\"Prediction for Withheld Sample\")\n", "ax.plot([], [], linewidth=2, c='black', label=\"Observed\")\n", "ax.plot([], [], 'gD', color=d_red,alpha=0.75,label=\"Observed Exacerbated\")\n", "ax.plot([], [], 'gD', color=d_green,alpha=0.75,label=\"Observed Stable\")\n", "ax.legend(loc='center', shadow=True, ncol=2)\n", "plt.gca().set_axis_off()\n", "plt.savefig(\"Plots/Legend_with_Pred_b.png\", format='png')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create a legend for plots without predictions and save to a separate file using dummy plots\n", "#you may already have this legend saved from the program Plotsamples, in which case you can skip this\n", "fig = plt.figure()\n", "fig.patch.set_alpha(0.0)\n", "ax = fig.add_subplot()\n", "ax.plot([], [], linewidth=2, c='black',dashes=[2, 2,5,2], label=\"Noise-Free\")\n", "ax.plot([], [], 'gD', color=l_red,alpha=0.75,label=\"Noise-Free Exacerbated\")\n", "ax.plot([], [], 'gD', color=l_green,alpha=0.75,label=\"Noise-Free Stable\")\n", "ax.plot([], [], linewidth=2, c='black', label=\"Observed\")\n", "ax.plot([], [], 'gD', color=d_red,alpha=0.75,label=\"Observed Exacerbated\")\n", "ax.plot([], [], 'gD', color=d_green,alpha=0.75,label=\"Observed Stable\")\n", "ax.legend(loc='center', shadow=True, ncol=2)\n", "plt.gca().set_axis_off()\n", "plt.savefig(\"Plots/Basic_Legend.png\", format='png')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#code to paste into the functions at the indicated places - legends for leave-k-out plots with predictions\n", "#it is written to place the legend outside the plot, where it won't interfere\n", "#you may wish to change the position of the legend box or make other adjustments to the figsize, or make other edits\n", "#it is created using dummy plots with the same features as our actual plots\n", "ax.plot([], [], linewidth=2, c='black',dashes=[2, 2,5,2], label=\"Noise-Free with Predictions\")\n", "ax.plot([], [], 'gD', color=l_red,alpha=0.75,label=\"Noise-Free Exacerbated\")\n", "ax.plot([], [], 'gD', color=l_green,alpha=0.75,label=\"Noise-Free Stable\")\n", "ax.plot([], [], 'gs', color='yellow',alpha=0.9, label=\"Prediction for Withheld Sample\")\n", "ax.plot([], [], linewidth=2, c='black', label=\"Observed\")\n", "ax.plot([], [], 'gD', color=d_red,alpha=0.75,label=\"Observed Exacerbated\")\n", "ax.plot([], [], 'gD', color=d_green,alpha=0.75,label=\"Observed Stable\")\n", "chartBox = ax.get_position()\n", "ax.set_position([chartBox.x0, chartBox.y0, chartBox.width*0.6, chartBox.height])\n", "#the tuple (1.25, 0.8) refers to the position relative to the width and height of the plot\n", "ax.legend(loc='upper center', bbox_to_anchor=(1.25, 0.8), shadow=True, ncol=2)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#code to paste into the functions at the indicated places - legends for plots without predictions\n", "#it is written to place the legend outside the plot, where it won't interfere\n", "#you may wish to change the position of the legend box or make other adjustments to the figsize, or make other edits\n", "#it is created using dummy plots with the same features as our actual plots\n", "ax.plot([], [], linewidth=2, c='black',dashes=[2, 2,5,2], label=\"Noise-Free\")\n", "ax.plot([], [], 'gD', color=l_red,alpha=0.75,label=\"Noise-Free Exacerbated\")\n", "ax.plot([], [], 'gD', color=l_green,alpha=0.75,label=\"Noise-Free Stable\")\n", "ax.plot([], [], linewidth=2, c='black', label=\"Observed\")\n", "ax.plot([], [], 'gD', color=d_red,alpha=0.75,label=\"Observed Exacerbated\")\n", "ax.plot([], [], 'gD', color=d_green,alpha=0.75,label=\"Observed Stable\")\n", "chartBox = ax.get_position()\n", "ax.set_position([chartBox.x0, chartBox.y0, chartBox.width*0.6, chartBox.height])\n", "#the tuple (1.25, 0.8) refers to the position relative to the width and height of the plot\n", "ax.legend(loc='upper center', bbox_to_anchor=(1.25, 0.8), shadow=True, ncol=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }