{ "metadata": { "name": "", "signature": "sha256:58a7340594b76df16e55dc6affcd468a95360232c6cbfa80034fd59c6082e4b9" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Interacting with Glue from IPython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Glue](http://glueviz.org) makes it easy to build linked, interactive statistical graphs from files and python datasets. One of Glue's nice features for interactive data analysis is the ability to run Glue and a \"normal\" python session in parallel. This lets you extract data, plots, and data selections from Glue, or send information back to Glue. Here's a demo, using a catalog of FBI Crime Statistics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up IPython\n", "\n", "Glue is a Qt program, and we need to run a special IPython magic function to properly setup \n", "interaction between Qt and IPython. Without it, IPython will be unresponsive while Glue is running." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from glue import qglue\n", "import pandas as pd\n", "\n", "# set up IPython/Qt integration\n", "# NOTE: this cell takes a second to run. For some reason,\n", "# IPython will stall if you try to run the next cell before this one completes\n", "%gui qt4 " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading Data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "states = pd.read_csv('state_crime.csv')\n", "states.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearPopulationViolent Crime rateMurderRapeRobberyAssaultPropertyBurglaryLarcenyVehicularState
0 1960 3266740 186.6 12.4 8.6 27.5 138.1 1035.4 355.9 592.1 87.3 Alabama
1 1961 3302000 168.5 12.9 7.6 19.1 128.9 985.5 339.3 569.4 76.8 Alabama
2 1962 3358000 157.3 9.4 6.5 22.5 119.0 1067.0 349.1 634.5 83.4 Alabama
3 1963 3347000 182.7 10.2 5.7 24.7 142.1 1150.9 376.9 683.4 90.6 Alabama
4 1964 3407000 213.1 9.3 11.7 29.1 163.0 1358.7 466.6 784.1 108.0 Alabama
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ " Year Population Violent Crime rate Murder Rape Robbery Assault \\\n", "0 1960 3266740 186.6 12.4 8.6 27.5 138.1 \n", "1 1961 3302000 168.5 12.9 7.6 19.1 128.9 \n", "2 1962 3358000 157.3 9.4 6.5 22.5 119.0 \n", "3 1963 3347000 182.7 10.2 5.7 24.7 142.1 \n", "4 1964 3407000 213.1 9.3 11.7 29.1 163.0 \n", "\n", " Property Burglary Larceny Vehicular State \n", "0 1035.4 355.9 592.1 87.3 Alabama \n", "1 985.5 339.3 569.4 76.8 Alabama \n", "2 1067.0 349.1 634.5 83.4 Alabama \n", "3 1150.9 376.9 683.4 90.6 Alabama \n", "4 1358.7 466.6 784.1 108.0 Alabama " ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sending to Glue\n", "\n", "[qglue](http://www.glueviz.org/en/latest/glue_from_python.html?highlight=qglue#quickly-send-data-to-glue-with-qglue) is an easy way to send python data structures (Numpy arrays, Pandas dataframes, Astropy tables, others) to glue. It returns an application object wich contains lots of state about the application. One of the most important pieces of this state is the data collection:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "app = qglue(states=states)\n", "dc = app.data_collection" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data Collections are list-like, and contain each dataset passed to Glue (only one in our case):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print dc" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "DataCollection (1 data set)\n", "\t 0: states\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "data = dc[0]\n", "print type(data)\n", "print data" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Data Set: statesNumber of dimensions: 1\n", "Shape: 2751\n", "Components:\n", " 0) Year\n", " 1) Pixel Axis 0\n", " 2) World 0\n", " 3) Population\n", " 4) Violent Crime rate\n", " 5) Murder\n", " 6) Rape\n", " 7) Robbery\n", " 8) Assault\n", " 9) Property\n", " 10) Burglary\n", " 11) Larceny\n", " 12) Vehicular\n", " 13) State\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual datasets in Glue are dictionary-like: we extract arrays using bracket notation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data['Murder']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "array([ 12.4, 12.9, 9.4, ..., 4.8, 4.7, 4.7])" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've created a few basic graphs in Glue, which look like this" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use Scikit-learn to run a simple K-means clustering on the data, and send \n", "the cluster IDs back to Glue as new subsets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cluster import KMeans\n", "import numpy as np\n", "\n", "# extract data into Numpy [N,3] array\n", "X = np.column_stack((data['Robbery'], data['Rape'], data['Murder']))\n", "clusters = KMeans(n_clusters=3).fit_predict(X)\n", "\n", "# add cluster_id as a new attribute\n", "c = data.add_component(clusters, 'cluster_id')\n", "\n", "# create 3 new subsets, that select each value in clusters\n", "dc.new_subset_group(label='Cluster 1', subset_state = (c == 0))\n", "dc.new_subset_group(label='Cluster 2', subset_state = (c == 1))\n", "dc.new_subset_group(label='Cluster 3', subset_state = (c == 2))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plots update automatically, coloring the new clusters\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting Data From Glue" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data objects also have a `to_dataframe()` method which convert the Glue data back to a DataFrame. Note that the new `cluster_id` attribute is included in the output" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = data.to_dataframe()\n", "print df.columns\n", "cuts = pd.cut(df.Robbery, 10)\n", "df.groupby(cuts).Murder.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Index([u'Assault', u'Burglary', u'Larceny', u'Murder', u'Pixel Axis 0', u'Population', u'Property', u'Rape', u'Robbery', u'State', u'Vehicular', u'Violent Crime rate', u'World 0', u'Year', u'cluster_id'], dtype='object')\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "Robbery\n", "(0.267, 165.22] 5.288493\n", "(165.22, 328.54] 8.553755\n", "(328.54, 491.86] 11.371795\n", "(491.86, 655.18] 16.909091\n", "(655.18, 818.5] 31.235714\n", "(818.5, 981.82] 37.100000\n", "(981.82, 1145.14] 41.614286\n", "(1145.14, 1308.46] 64.050000\n", "(1308.46, 1471.78] 31.100000\n", "(1471.78, 1635.1] 34.350000\n", "Name: Murder, dtype: float64" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subsets can also be extracted from Glue, and converted into boolean masks. This is useful for investigating selections defined by hand:\n", "\n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "outliers = data.subsets[0].to_mask()\n", "print \"Selected %i rows\" % outliers.sum()\n", "print outliers\n", "df[outliers].head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Selected 189 rows\n", "[False False False ..., False False False]\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AssaultBurglaryLarcenyMurderPixel Axis 0PopulationPropertyRapeRobberyStateVehicularViolent Crime rateWorld 0Yearcluster_id
53 45.1 332.1 970.5 10.2 53 226167 1544.9 20.8 28.3 Alaska 242.3 104.3 53 1960 1
55 54.5 351.6 985.4 4.5 55 246000 1564.6 18.7 13.8 Alaska 227.6 91.5 55 1962 1
56 66.1 381.5 1213.7 6.5 56 248000 1952.8 14.9 22.2 Alaska 357.7 109.7 56 1963 1
104 466.1 394.0 2052.1 4.1 104 723860 2637.8 60.2 79.6 Alaska 191.7 610.1 104 2011 1
105 433.2 403.3 2128.0 4.1 105 731449 2739.4 79.7 86.1 Alaska 208.1 603.2 105 2012 1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ " Assault Burglary Larceny Murder Pixel Axis 0 Population Property \\\n", "53 45.1 332.1 970.5 10.2 53 226167 1544.9 \n", "55 54.5 351.6 985.4 4.5 55 246000 1564.6 \n", "56 66.1 381.5 1213.7 6.5 56 248000 1952.8 \n", "104 466.1 394.0 2052.1 4.1 104 723860 2637.8 \n", "105 433.2 403.3 2128.0 4.1 105 731449 2739.4 \n", "\n", " Rape Robbery State Vehicular Violent Crime rate World 0 Year \\\n", "53 20.8 28.3 Alaska 242.3 104.3 53 1960 \n", "55 18.7 13.8 Alaska 227.6 91.5 55 1962 \n", "56 14.9 22.2 Alaska 357.7 109.7 56 1963 \n", "104 60.2 79.6 Alaska 191.7 610.1 104 2011 \n", "105 79.7 86.1 Alaska 208.1 603.2 105 2012 \n", "\n", " cluster_id \n", "53 1 \n", "55 1 \n", "56 1 \n", "104 1 \n", "105 1 " ] } ], "prompt_number": 19 } ], "metadata": {} } ] }