{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "source": [ "Lab 3: Exploratory Data Analysis for Classification using Pandas and Matplotlib" ], "cell_type": "heading", "metadata": {}, "level": 1 }, { "source": [ "### Preliminary plotting stuff to get things going" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 1, "input": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 2, "input": [ "!~/anaconda/bin/pip install brewer2mpl" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 3, "input": [ "import brewer2mpl\n", "from matplotlib import rcParams\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)\n", "dark2_colors = dark2_cmap.mpl_colors\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n", "rcParams['font.family'] = 'StixGeneral'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right()" ], "metadata": {} }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 4, "input": [ "pd.set_option('display.width', 500)\n", "pd.set_option('display.max_columns', 100)" ], "metadata": {} }, { "source": [ "##1. The Olive Oils dataset" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Some of the following text is taken from the rggobi book (http://www.ggobi.org/book/). It is an excellent book on visualization and EDA for classification, and is available freely as a pdf from Hollis for those with a Harvard Id. Even though the book uses ggobi, a lot of the same analysis can be done in Mondrian or directly in Matplotlib/Pandas (albeit not interactively)." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "