{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "source": [ "Lab 3: Exploratory Data Analysis for Classification using Pandas and Matplotlib" ], "cell_type": "heading", "metadata": {}, "level": 1 }, { "source": [ "### Preliminary plotting stuff to get things going" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 1, "input": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 2, "input": [ "!~/anaconda/bin/pip install brewer2mpl" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 3, "input": [ "import brewer2mpl\n", "from matplotlib import rcParams\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)\n", "dark2_colors = dark2_cmap.mpl_colors\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n", "rcParams['font.family'] = 'StixGeneral'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right()" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 4, "input": [ "pd.set_option('display.width', 500)\n", "pd.set_option('display.max_columns', 100)" ], "metadata": {} }, { "source": [ "##1. The Olive Oils dataset" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Some of the following text is taken from the rggobi book (http://www.ggobi.org/book/). It is an excellent book on visualization and EDA for classification, and is available freely as a pdf from Hollis for those with a Harvard Id. Even though the book uses ggobi, a lot of the same analysis can be done in Mondrian or directly in Matplotlib/Pandas (albeit not interactively)." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "
\n", "\n", "\"The Olive Oils data has eight explanatory variables (levels of fatty acids in the oils) and nine classes (areas of Italy). The goal of the analysis is to develop rules that reliably distinguish oils from the nine different areas. It is a problem of practical interest, because oil from some areas is more highly valued and unscrupulous suppliers sometimes make false claims about the origin of their oil. The content of the oils is a subject of study in its own right: Olive oil has high nutritional value, and some of its constituent fatty acids are considered to be more beneficial than others.\"\n", "\n", "In addition, fatty acid contents vary with climate: this information is important in deciding which varieties to grow where.\n", "\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "\"Source: Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of Olive Oils from their Fatty Acid Composition, in Martens, H. and\n", "Russwurm Jr., H., eds, Food Research and Data Analysis, Applied Science\n", "Publishers, London, pp. 189\u2013214. It was brought to our attention by Glover\n", "& Hopke (1992).\n", "\n", "Number of rows: 572\n", "\n", "Number of variables: 10\n", "\n", "Description: This data consists of the percentage composition of fatty acids\n", "found in the lipid fraction of Italian olive oils. The data arises from a study\n", "to determine the authenticity of an olive oil.\"\n", "
" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 5, "input": [ "from IPython.display import Image\n", "Image(filename='Italy.png')" ], "metadata": {} }, { "source": [ "In working with python I always remember: a python is a duck.\n", "\n", "In working with pandas and matplotlib I dont always remember the syntax. A programmer is a good tool for converting Stack Overflow snippets into code. I almost always put what I am trying to do into google and go from there. \n", "\n", "That said, I found the following links very useful in understanding the Pandas mode, how things work.\n", "\n", "* http://blog.yhathq.com/posts/R-and-pandas-and-what-ive-learned-about-each.html\n", "* http://www.bearrelroll.com/2013/05/python-pandas-tutorial/\n", "* http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "##2. Loading and Cleaning" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Let's load the olive oil dataset into a pandas dataframe and have a look at the first 5 rows." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 6, "input": [ "df=pd.read_csv(\"data/olive.csv\")\n", "df.head(5)" ], "metadata": {} }, { "source": [ "Let's rename that ugly first column. \n", "\n", "*Hint*: A Google search for 'python pandas dataframe rename' points you at this documentation." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 7, "input": [ "print df.columns\n", "df.rename(columns={df.columns[0]:'areastring'}, inplace=True)\n", "df.columns" ], "metadata": {} }, { "source": [ "Let's explore. Which unique regions and areas are contained in the dataset?" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 8, "input": [ "print 'regions\\t', df.region.unique()\n", "print 'areas\\t', df.area.unique()" ], "metadata": {} }, { "source": [ "Let's create a *crosstab*ulation or contingency table of the factors.\n", "\n", "*Hint*: A Google search for 'python pandas cross tabulation' points you at this documentation.\n" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 9, "input": [ "pd.crosstab(df.area, df.region)" ], "metadata": {} }, { "source": [ "Do we need to clean the dataset before we can explore it a little more? Let's have a look." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 10, "input": [ "df.head()" ], "metadata": {} }, { "source": [ "Let's get rid of the junk numbering in `df.areastring`. For single column Pandas Series we use `map`. Here's the documentation." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 11, "input": [ "df.areastring=df.areastring.map(lambda x: x.split('.')[-1])\n", "df.head()" ], "metadata": {} }, { "source": [ "To access a specific subset of columns of a dataframe, we can use list indexing." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 12, "input": [ "df[['palmitic','oleic']].head()" ], "metadata": {} }, { "source": [ "Notice that this returns a new object of type DataFrame." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "To access the series of entries of a single column, we could do the following." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 13, "input": [ "print df['palmitic']" ], "metadata": {} }, { "source": [ "Notice the difference in the syntax. In the first example where we used list indexing we got a new DataFrame. In the second example we got a series corresponding to the column. " ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 14, "input": [ "print \"type of df[['palmitic']]:\\t\", type(df[['palmitic']]) \n", "print \"type of df['palmitic']:\\t\\t\", type(df['palmitic'])" ], "metadata": {} }, { "source": [ "To access the series of values of a single column more conveniently, we can do this:" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 15, "input": [ "df.palmitic" ], "metadata": {} }, { "source": [ "### YOUR TURN NOW (10 minutes)" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Get the unique areastrings of the dataframe `df`." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 16, "input": [ "#your code here\n" ], "metadata": {} }, { "source": [ "Create a new dataframe `dfsub` by taking the list of acids and using pandas' `apply` function to divide all values by 100.\n", "\n", "If you're not familiar with pandas' `apply` function, a Google search for 'python pandas dataframe apply' points you to the documentation" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 17, "input": [ "acidlist=['palmitic', 'palmitoleic', 'stearic', 'oleic', 'linoleic', 'linolenic', 'arachidic', 'eicosenoic']\n", "\n", "#your code here\n" ], "metadata": {} }, { "source": [ "Notice that we can replace part of the dataframe by this new frame. Since we need the percentages, let's do this. The `Oleic` percentages should be in the 70s and 80s if you did this right." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 18, "input": [ "df[acidlist]=dfsub\n", "df.head()" ], "metadata": {} }, { "source": [ "##2. Quick Intro to Matplotlib" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "This is just a quick and dirty intro. Please read the excellent tutorial here." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 19, "input": [ "fig=plt.figure()\n", "plt.scatter(df.palmitic, df.linolenic)\n", "axis = fig.gca() #get current axis\n", "axis.set_title('linolenic vs palmitic')\n", "axis.set_xlabel('palmitic')\n", "axis.set_ylabel('linolenic')\n", "#fig can be got with fig.gcf()" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 20, "input": [ "plt.hist(df.palmitic)" ], "metadata": {} }, { "source": [ "There are many many more kinds of plots." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "A more object oriented approach sees us using the `subplots` function to set both figure and axis." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 21, "input": [ "fig, axes=plt.subplots(figsize=(10,10), nrows=2, ncols=2)\n", "axes[0][0].plot(df.palmitic, df.linolenic)\n", "axes[0][1].plot(df.palmitic, df.linolenic, '.')\n", "axes[1][0].scatter(df.palmitic, df.linolenic)\n", "axes[1][1].hist(df.palmitic)\n", "fig.tight_layout()" ], "metadata": {} }, { "source": [ "###YOUR TURN NOW (10 minutes)\n", "\n", "Make scatterplots of the acids in the list `yacids` against the acids in the list `xacids`. As the names suggest, plot the acids in `yacids` along the y axis and the acids in `xacids` along the x axis. Label the axes with the respective acid name. Set it up as a grid with 3 rows and 2 columns." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 22, "input": [ "xacids=['oleic','linolenic','eicosenoic']\n", "yacids=['stearic','arachidic']\n", "\n", "#your code here\n" ], "metadata": {} }, { "source": [ "##3. Pandas Data Munging" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "The first concept we deal with here is pandas `groupby`. The idea is to group a dataframe by the values of a particular factor variable. The documentation can be found here." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 23, "input": [ "region_groupby = df.groupby('region')\n", "print type(region_groupby)\n", "region_groupby.head()" ], "metadata": {} }, { "source": [ "The function `groupby` gives you a dictionary-like object, with the keys being the values of the factor, and the values being the corresponding subsets of the dataframe." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 24, "input": [ "for key, value in region_groupby:\n", " print \"( key, type(value) ) = (\", key, \",\", type(value), \")\"\n", " v=value\n", "\n", "v.head()" ], "metadata": {} }, { "source": [ "The `groupby` function also acts like an object that can be **mapped**. After the mapping is complete, the rows are put together (**reduced**) into a larger dataframe. For example, using the `describe` function. The documentation of the `describe` function can be found here." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 25, "input": [ "dfrd=region_groupby.describe()\n", "print type(dfrd)\n", "dfrd.head(20)" ], "metadata": {} }, { "source": [ "So, one may iterate through the groupby 'dictionary', get the pandas series from each sub-dataframe, and compute the standard deviation using the `std` function (find documentation of `std` here):" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 26, "input": [ "vecs=[]\n", "keys=[]\n", "for key, value in region_groupby:\n", " k=key\n", " v=value.std()\n", "print k, type(v), v" ], "metadata": {} }, { "source": [ "Or one might let pandas take care of concatenating the series obtained by running `std` on each dataframe back into a dataframe for us. Notice that the output dataframe is automatically indexed by region for us!" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 27, "input": [ "dfbystd=df.groupby('region').std()\n", "dfbystd.head()" ], "metadata": {} }, { "source": [ "Or one can use `aggregate` to pass an arbitrary function of to the sub-dataframe. The function is applied columnwise." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 28, "input": [ "dfbymean=region_groupby.aggregate(np.mean)\n", "dfbymean.head()" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 29, "input": [ "region_groupby.aggregate(lambda x: x.palmitic.sum()) #probably not what u had in mind :-)" ], "metadata": {} }, { "source": [ "Or one can use `apply` to pass an arbitrary function to the sub-dataframe. This one takes the dataframe as argument." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 30, "input": [ "region_groupby.apply(lambda f: f.mean())" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 31, "input": [ "region_groupby.apply(lambda f: f.palmitic.mean())" ], "metadata": {} }, { "source": [ "Let's rename the columns in `dfbymean` and `dfbystd`." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 32, "input": [ "renamedict_std={k:k+\"_std\" for k in acidlist}\n", "renamedict_mean={k:k+\"_mean\" for k in acidlist}\n", "dfbystd.rename(inplace=True, columns=renamedict_std)\n", "dfbymean.rename(inplace=True, columns=renamedict_mean) \n", "dfbystd.head()" ], "metadata": {} }, { "source": [ "Pandas can do general merges. When we do that along an index, it's called a `join` (documentation). Here we make two sub-dataframes and join them on the common region index." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 33, "input": [ "dfpalmiticmean = dfbymean[['palmitic_mean']] \n", "dfpalmiticstd = dfbystd[['palmitic_std']] \n", "\n", "newdfbyregion=dfpalmiticmean.join(dfpalmiticstd)\n", "newdfbyregion.head()" ], "metadata": {} }, { "source": [ "###YOUR TURN NOW (10 minutes)\n", "\n", "Let's weight the palmitic acids content by a random weight. We'll first extract a subset of columns from `df` and then you will write a function to weigh the palmitic content by this random weight, delivering a weighted palmitic mean in the final dataframe." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 34, "input": [ "df.shape" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 35, "input": [ "weights=np.random.uniform(size=df.shape[0])\n", "smallerdf=df[['palmitic']]\n", "otherdf=df[['region']]\n", "otherdf['weight'] = weights\n", "otherdf.head()" ], "metadata": {} }, { "source": [ "Join `smallerdf` and `otherdf` on the index, into smallerdf" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 36, "input": [ "#your code here\n" ], "metadata": {} }, { "source": [ "Now lets use these weights to compute a weighted average over the palmitic column." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 37, "input": [ "#your code here\n" ], "metadata": {} }, { "source": [ "Finally aggregate the column percentages by summing them up over the regions." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 38, "input": [ "#your code here\n" ], "metadata": {} }, { "source": [ "## One Dimensional Exploratory Data Analysis (EDA) with Pandas" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 39, "input": [ "rkeys=[1,2,3]\n", "rvals=['South','Sardinia','North']\n", "rmap={e[0]:e[1] for e in zip(rkeys,rvals)}\n", "rmap" ], "metadata": {} }, { "source": [ "Let's get a dataframe with just the acids." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 40, "input": [ "mdf2=df.groupby('region').aggregate(np.mean)\n", "mdf2=mdf2[acidlist]\n", "mdf2.head()" ], "metadata": {} }, { "source": [ "Let's make a bar plot of the relative mean percentages of the acids. In pandas this is as simple as:" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 41, "input": [ "ax=mdf2.plot(kind='barh', stacked=True)\n", "ax.set_yticklabels(rvals)\n", "ax.set_xlim([0,100])" ], "metadata": {} }, { "source": [ "Well, that's kind of ugly. In the appendix we have some code showing how you can clean this plot up." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "The above graph get's proportions of all the acids in each region. We can ask the opposite question: for each acid, what's the distribution of regions?" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 42, "input": [ "fig, axes=plt.subplots(figsize=(10,20), nrows=len(acidlist), ncols=1)\n", "i=0\n", "colors=[dark2_cmap.mpl_colormap(col) for col in [1.0,0.5,0.0]]\n", "for ax in axes.flatten():\n", " acid=acidlist[i]\n", " seriesacid=df[acid]#get the Pandas series\n", " minmax=[seriesacid.min(), seriesacid.max()]\n", " counts=[]\n", " nbins=30\n", " histbinslist = np.linspace(minmax[0], minmax[1], nbins)\n", " counts=-np.diff([seriesacid[seriesacid>x].count() for x in histbinslist]).min()\n", " for k,g in df.groupby('region'):\n", " style = {'histtype':'step', 'color':colors[k-1], 'alpha':1.0, 'bins':histbinslist, 'label':rmap[k]}\n", " ax.hist(g[acid],**style)\n", " ax.set_xlim(minmax)\n", " ax.set_title(acid)\n", " ax.grid(False)\n", " #construct legend\n", " ax.set_ylim([0, counts])\n", " ax.legend()\n", " i=i+1\n", "fig.tight_layout()\n" ], "metadata": {} }, { "source": [ "You can make a mask!" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 43, "input": [ "mask=(df.eicosenoic < 0.05)\n", "mask" ], "metadata": {} }, { "source": [ "The first gives a count, the second is a shortcut to get a probability!" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 44, "input": [ "np.sum(mask), np.mean(mask)" ], "metadata": {} }, { "source": [ "Pandas supports conditional indexing: documentation" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 45, "input": [ "loweico=df[df.eicosenoic < 0.02]\n", "pd.crosstab(loweico.area, loweico.region)" ], "metadata": {} }, { "source": [ "### YOUR TURN NOW (10 minutes)\n", "\n", "You can see that oleic dominates, and doesn't let us see much about the other acids. Remove it and let's draw bar plots again." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 46, "input": [ "acidlistminusoleic=['palmitic', 'palmitoleic', 'stearic', 'linoleic', 'linolenic', 'arachidic', 'eicosenoic']\n", "#your code here\n" ], "metadata": {} }, { "source": [ "**Note that there are no eicosenoic acids in regions 2 and 3, which are Sardinia and the North respectively**" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Two-dimensional EDA with Pandas" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Let's write code to scatterplot acid against acid color coded by region. A more polished version is in the appendix" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 47, "input": [ "# just do the boxplot without the marginals to split the north out\n", "def make2d(df, scatterx, scattery, by=\"region\", labeler={}):\n", " figure=plt.figure(figsize=(8,8))\n", " ax=plt.gca()\n", " cs=list(np.linspace(0,1,len(df.groupby(by))))\n", " xlimsd={}\n", " ylimsd={}\n", " xs={}\n", " ys={}\n", " cold={}\n", " for k,g in df.groupby(by):\n", " col=cs.pop()\n", " x=g[scatterx]\n", " y=g[scattery]\n", " xs[k]=x\n", " ys[k]=y\n", " c=dark2_cmap.mpl_colormap(col)\n", " cold[k]=c\n", " ax.scatter(x, y, c=c, label=labeler.get(k,k), s=40, alpha=0.4);\n", " xlimsd[k]=ax.get_xlim()\n", " ylimsd[k]=ax.get_ylim()\n", " xlims=[min([xlimsd[k][0] for k in xlimsd.keys()]), max([xlimsd[k][1] for k in xlimsd.keys()])]\n", " ylims=[min([ylimsd[k][0] for k in ylimsd.keys()]), max([ylimsd[k][1] for k in ylimsd.keys()])]\n", " ax.set_xlim(xlims)\n", " ax.set_ylim(ylims)\n", " ax.set_xlabel(scatterx)\n", " ax.set_ylabel(scattery)\n", " ax.grid(False)\n", " return ax\n", "a=make2d(df, \"linoleic\",\"arachidic\", labeler=rmap)\n", "a.legend(loc='upper right');" ], "metadata": {} }, { "source": [ "**A nonlinear classifier could separate the north from Sardinia!**" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "We use the really ugly trellis rplot interface in Pandas to do some hierarchical digging. We plot oleic against linoleic. **We can split Sardinia. We might be able to split East Liguria out but there could be significant misclassification.**" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 48, "input": [ "import pandas.tools.rplot as rplot\n", "dfcopy=df.copy()\n", "dfcopy['region']=dfcopy['region'].map(rmap)\n", "imap={e[0]:e[1] for e in zip (df.area.unique(), df.areastring.unique())}\n", "#dfcopy['area']=dfcopy['area'].map(imap)\n", "plot = rplot.RPlot(dfcopy, x='linoleic', y='oleic');\n", "plot.add(rplot.TrellisGrid(['region', '.']))\n", "plot.add(rplot.GeomPoint(size=40.0, alpha=0.3, colour=rplot.ScaleRandomColour('area')));\n", "\n", "fig=plot.render()\n", "print df.areastring.unique()\n" ], "metadata": {} }, { "source": [ "### YOUR TURN NOW (10 minutes)" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Plot palmitoleic against palimitic. **What can you separate?** Use the `dfcopy` dataframe." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 49, "input": [ "#your code here\n" ], "metadata": {} }, { "source": [ "## Appendix: For you to try at home" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "### Marginal data: Rug plots and histograms " ], "cell_type": "markdown", "metadata": {} }, { "source": [ "This code allows you to plot marginals using rug plots and histograms" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 50, "input": [ "#adapted from https://github.com/roban/quarum/blob/master/margplot.py\n", "from mpl_toolkits.axes_grid1 import make_axes_locatable\n", "def setup_mhist(axes, figure):\n", " ax1=axes\n", " divider = make_axes_locatable(ax1)\n", " ax2 = divider.append_axes(\"top\", 1.5, pad=0.0, sharex=ax1)\n", " ax3 = divider.append_axes(\"right\", 1.5, pad=0.0, sharey=ax1)\n", " #xscale=yscale='log'\n", " #ax2.set_yscale(yscale)\n", " #ax3.set_xscale(xscale)\n", " #ax2.set_ylim([0,1])\n", " #ax3.set_xlim([0,5])\n", " ax2.grid(False)\n", " ax3.grid(False)\n", " ax2.grid(axis=\"y\", color=\"white\", linestyle='-', lw=1)\n", " ax3.grid(axis=\"x\", color=\"white\", linestyle='-', lw=1)\n", " remove_border(ax2, right=True, left=False)\n", " remove_border(ax3, right=False, left=True, bottom=False, top=True)\n", " figure.subplots_adjust(left=0.15, right=0.95)\n", " return [ax1,ax2,ax3]\n", "\n", "#BUG: need to get appropriate min and max amongst the multiple marginal hists\n", "#BUG: need to get highest frequency marked as label when we do this.\n", "def make_mhist(axeslist, x, y, color='b', mms=8):\n", " ax1 = axeslist[0]\n", " ax2 = axeslist[1]\n", " ax3 = axeslist[2]\n", " #print list(ax2.get_yticklabels())\n", " for tl in (ax2.get_xticklabels() + ax2.get_yticklabels() +\n", " ax3.get_xticklabels() + ax3.get_yticklabels()):\n", " tl.set_visible(False)\n", " #for tl in ( ax2.get_xticklabels() + ax3.get_yticklabels()):\n", " # tl.set_visible(False)\n", " histbinslist = [np.ceil(len(x)/20.), np.ceil(len(y)/20.)]\n", " histbinslist = copy.copy(histbinslist)\n", " #style = {'histtype':'stepfilled', 'color':color, 'alpha':0.6, 'normed':True, 'stacked':True}\n", " style = {'histtype':'stepfilled', 'color':color, 'alpha':0.4}\n", " nbins = histbinslist[0]\n", " x_range = [np.min(x), np.max(x)]\n", " histbinslist[0] = np.linspace(x_range[0], x_range[1], nbins)\n", "\n", " ax2.hist(x, histbinslist[0], **style)\n", "\n", " nbins = histbinslist[1]\n", " y_range = [np.min(y), np.max(y)]\n", " histbinslist[1] = np.linspace(y_range[0], y_range[1], nbins)\n", " ax3.hist(y, histbinslist[1], orientation='horizontal', **style)" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 51, "input": [ "import random\n", "import copy\n", "def scatter_by(df, scatterx, scattery, by=None, figure=None, axes=None, colorscale=dark2_cmap, labeler={}, mfunc=None, setupfunc=None, mms=8):\n", " cs=copy.deepcopy(colorscale.mpl_colors)\n", " if not figure:\n", " figure=plt.figure(figsize=(8,8))\n", " if not axes:\n", " axes=figure.gca()\n", " x=df[scatterx]\n", " y=df[scattery]\n", " if not by:\n", " col=random.choice(cs)\n", " axes.scatter(x, y, cmap=colorscale, c=col)\n", " if setupfunc:\n", " axeslist=setupfunc(axes, figure)\n", " else:\n", " axeslist=[axes]\n", " if mfunc:\n", " mfunc(axeslist,x,y,color=col, mms=mms)\n", " else:\n", " cs=list(np.linspace(0,1,len(df.groupby(by))))\n", " xlimsd={}\n", " ylimsd={}\n", " xs={}\n", " ys={}\n", " cold={}\n", " for k,g in df.groupby(by):\n", " col=cs.pop()\n", " x=g[scatterx]\n", " y=g[scattery]\n", " xs[k]=x\n", " ys[k]=y\n", " c=colorscale.mpl_colormap(col)\n", " cold[k]=c\n", " axes.scatter(x, y, c=c, label=labeler.get(k,k), s=40, alpha=0.3);\n", " xlimsd[k]=axes.get_xlim()\n", " ylimsd[k]=axes.get_ylim()\n", " xlims=[min([xlimsd[k][0] for k in xlimsd.keys()]), max([xlimsd[k][1] for k in xlimsd.keys()])]\n", " ylims=[min([ylimsd[k][0] for k in ylimsd.keys()]), max([ylimsd[k][1] for k in ylimsd.keys()])]\n", " axes.set_xlim(xlims)\n", " axes.set_ylim(ylims)\n", " if setupfunc:\n", " axeslist=setupfunc(axes, figure)\n", " else:\n", " axeslist=[axes]\n", " if mfunc:\n", " for k in xs.keys():\n", " mfunc(axeslist,xs[k],ys[k],color=cold[k], mms=mms);\n", " axes.set_xlabel(scatterx);\n", " axes.set_ylabel(scattery);\n", " \n", " return axes\n", "\n", "def make_rug(axeslist, x, y, color='b', mms=8):\n", " axes=axeslist[0]\n", " zerosx1=np.zeros(len(x))\n", " zerosx2=np.zeros(len(x))\n", " xlims=axes.get_xlim()\n", " ylims=axes.get_ylim()\n", " zerosx1.fill(ylims[1])\n", " zerosx2.fill(xlims[1])\n", " axes.plot(x, zerosx1, marker='|', color=color, ms=mms)\n", " axes.plot(zerosx2, y, marker='_', color=color, ms=mms)\n", " axes.set_xlim(xlims)\n", " axes.set_ylim(ylims)\n", " return axes\n", " \n", "#BUG: remove ticks and maybe even border on top and right" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 52, "input": [ "ax=scatter_by(df, 'linoleic', 'eicosenoic', by='region', labeler=rmap, mfunc=make_rug, mms=20)\n", "ax.grid(False)\n", "ax.legend(loc='upper right');" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 53, "input": [ "ax=scatter_by(df, 'linoleic', 'arachidic', by='region', labeler=rmap, setupfunc=setup_mhist, mfunc=make_mhist, mms=20)\n", "ax.grid(False)\n", "ax.legend(loc='upper right');" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 54, "input": [ "ax=scatter_by(df, 'linoleic', 'eicosenoic', by='region', labeler=rmap, setupfunc=setup_mhist, mfunc=make_mhist, mms=20)\n", "ax.grid(False)\n", "ax.legend(loc='upper right');" ], "metadata": {} }, { "source": [ "### Probability distributions" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 55, "input": [ "import scipy.stats as stats\n", "mu=0.\n", "sigma=1.\n", "samples=np.random.normal(mu, sigma, 10000)\n", "plt.hist(samples,bins=25, normed=True)\n", "nd=stats.norm()\n", "plt.hist(nd.rvs(size=10000), bins=25, alpha=0.5,normed=True)\n", "x=np.linspace(-4.0,4.0,100)\n", "plt.plot(x,nd.pdf(x))\n", "plt.plot(x,nd.cdf(x))" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 56, "input": [ "mean = [0,0]\n", "cov = [[1,0],[0,5]] # diagonal covariance, points lie on x or y-axis\n", "m=300\n", "nrvs = np.random.multivariate_normal(mean,cov,(m,m))\n", "duets=nrvs.reshape(m*m,2)\n", "print duets[:,1]\n", "normaldf=pd.DataFrame(dict(x=duets[:,0], y=duets[:,1]))\n", "normaldf.head()\n", "ax=scatter_by(normaldf, 'x', 'y', figure=plt.figure(figsize=(8,10)),setupfunc=setup_mhist, mfunc=make_mhist, mms=20)\n", "#ax.grid(False)" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 57, "input": [ "H, xedges, yedges = np.histogram2d(normaldf.x, normaldf.y, bins=(50, 50), normed=True)\n", "extent = [xedges[0], xedges[-1], yedges[-1], yedges[0]]\n", "plt.imshow(H, extent=extent, interpolation='nearest')\n", "plt.colorbar()" ], "metadata": {} }, { "source": [ "### Miscellaneous Pandas Plotting tools: scatters, boxplots, and parallel co-ordinates" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 58, "input": [ "from pandas.tools.plotting import scatter_matrix\n", "scatter_matrix(df[['linoleic','arachidic','eicosenoic']], alpha=0.3, figsize=(10, 10), diagonal='kde');" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 59, "input": [ "plt.figure(figsize=(24,5))\n", "for key, group in df.groupby('region'):\n", " plt.subplot(int('13'+str(key)))\n", " group[acidlistminusoleic].boxplot(grid=False)\n", " ax=plt.gca()\n", " ax.set_title(rvals[key-1])\n", " remove_border(ax, left=False, bottom=False)\n", " ax.grid(axis=\"y\", color=\"gray\", linestyle=':', lw=1)\n" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 60, "input": [ "from pandas.tools.plotting import parallel_coordinates\n", "dfna=df[['region', 'palmitic', 'palmitoleic', 'stearic', 'oleic', 'linolenic', 'linoleic', 'arachidic', 'eicosenoic']]\n", "dfna_norm = (dfna - dfna.mean()) / (dfna.max() - dfna.min())\n", "dfna_norm['region']=df['region'].map(lambda x: rmap[x])\n", "parallel_coordinates(dfna_norm, 'region', colors=[dark2_cmap.mpl_colormap(col) for col in [1.0,0.5,0.0]], alpha=0.05)" ], "metadata": {} }, { "source": [ "### Improving the pandas histograms" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 61, "input": [ "ax2=mdf2.plot(kind='barh', stacked=True, color=dark2_colors, grid=False, legend=False)\n", "remove_border(ax2, left=False, bottom=False)\n", "ax2.grid(axis=\"x\", color=\"white\", linestyle='-', lw=1)\n", "ax2.legend(loc='right', bbox_to_anchor=(1.3,0.5))\n", "labels2=['South','Sardinia','North']\n", "ax2.set_yticklabels(labels2)\n", "ax2.set_ylabel('');\n", "ax2.set_xlim(right=100.0);" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 62, "input": [ "#your code here\n" ], "metadata": {} }, { "source": [ "### More details are only good at times" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "It's hard to understand the graph below. A hierarchical approach as we have used is better." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 63, "input": [ "fig=plt.figure(figsize=(10,10))\n", "ax=scatter_by(df, 'linoleic', 'arachidic', by='area', figure=fig, labeler=imap, setupfunc=setup_mhist, mfunc=make_mhist, mms=20)\n", "ax.grid(False)\n", "ax.legend(loc='right', bbox_to_anchor=(1.7,0.5));" ], "metadata": {} }, { "source": [ "On the other hand, inspecting loads of scatter plots is not a bad idea!" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 64, "input": [ "indices=np.tril_indices(8)\n", "plts=[]\n", "for i,j in zip(indices[0], indices[1]):\n", " if i!=j:\n", " plts.append((i,j))\n", "print plts" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 65, "input": [ "fig, axes = plt.subplots(nrows=14, ncols=2, figsize=(14,40));\n", "k=0\n", "af=axes.flatten()\n", "for a in af:\n", " i,j=plts[k]\n", " a=scatter_by(df, acidlist[i], acidlist[j], by='region', axes=a, labeler=rmap, mfunc=make_rug, mms=20);\n", " a.grid(False);\n", " k=k+1\n", "af[0].legend(loc='best');\n", "fig.tight_layout();" ], "metadata": {} } ], "metadata": {} } ] }